Bounty: PHP internationalisation ascii / UTF-8 problem

by oneafrikan on September 14, 2006

I have a small issue that I’m completely stumped with, where we’re experiencing different results on dev and live servers within the same hosting company and apparetnly the same stack, which is really weird and completely frustrating.

Basically on the one box we’re getting foreign language submitted and coming back to us in ascii, which we’ve planned and accounted for; but then on the other box the submission is in ascii, but it’s coming back to us in the foriegn language itself – without any processing on our part.

So, if anyone can help with this, then I’m willing to pay a bounty for successful resolution ;-)

Any takers?

6 comments

Hi Gareth – you don’t mention how much the bounty is?

There are a few problems with this site/page, in order of graveness:

– you should never submit a search form with POST, (see http://trac.seagullproject.org/wiki/Standards/CorrectUseOfGetAndPost)

– if you correctly set the character encoding you should not *also* run the output through htmlentities() as you’ve done

– in the html head you correctly set the charset to utf-8, but you should also use the PHP header function to set the equivalent for the Content-type header

– modern browsers send x-www-form-urlencoded data to the server in the CHARSET that was determined to be that of the *form*, however that determination was made

I don’t actually see the difference between the two pages you list, except that only the dev example shows the first SELECT query above the search form. Both the 2nd queries sucessfully use Cyrilic text in the WHERE statement (i’m using FF).

by Demian Turner on September 14, 2006 at 12:28 pm. Reply #

Thanks Demian ;-)

Have to dash off now (to Germany), so will get back to this asap!

by Gareth Knight on September 14, 2006 at 4:42 pm. Reply #

You should never submit a search form with a post? That’s new to me — if you’re doing any .net develpment, any form element runat=”server” with an event handler will post back using a post (function __doPostBack(eventTarget, eventArgument)).

I think http://trac.seagullproject.org/wiki/Standards/CorrectUseOfGetAndPost
is saying simply: protected yourself against sql injection attacks by not processing any GET data directly against your DB that might be malicious, which is certainly a good practice, whether it’s a post or a get.
The link off your link ( http://www.cs.tut.fi/~jkorpela/forms/methods.html ) seems to be saying the same thing: “one should normally use METHOD=”POST” if and only if the form submission may cause changes.”

I don’t see the difference between the pages either (FF and IE). I’d start by diffing the php.ini’s, any .htaccess, and httpd.conf’s between the boxes once you’re sure the webroots are identical.

by Jeff on September 14, 2006 at 10:48 pm. Reply #

@Demien:

First, thanks for your reply dude ;-)

The bounty: I didn’t have a fixed idea as to how much I wanted to offer, but did have a max figure that I wouldn’t have gone higher than. I was kinda hoping to either tie it to an hourly/daily rate; or just throw a figure out there and see what happened.

I’m not sure about never submitting a search form with POST… I can see why it would be a bad idea, but in this case
a) the client didn’t want to see ugly or long url’s
b) we’d inherited a system from someone else and had to work within those boundaries; and
c) we’re only really doing select statements from the DB using predetermined dropdowns (I know that someone skilled enough probably could try a SQL injection attack, but we’ve tried to cover that by fixing it so that we’re only returning results if the request comes from the same server).

What makes you think we’ve used htmlentities()?

Thanks for the heads up regards the html and php charset stuff ;-)

Yup – somewhere along the line, the live server was spitting out correct cyrillic, but the dev server was spitting out ascii, so that was kinda weird – so you might have seen ascii in the select statement if you’d viewed the source.

However, we tackled this issue late last week, and when we came back to the dev server, it appeared that both the dev and live servers are now spitting out the correct cyrillic (no more ascii as it was doing), which made the problem easier to solve ‘cos we had a dev server that replicated the problem on the live server.

by Gareth Knight on October 2, 2006 at 10:24 am. Reply #

@Jeff:

Thanks for the reply ;-)

Yup – we asked the hosting company to do a check while we were seeing the problems, and then when we came back to the problem it seemed to be the same problem consistent accross both boxes, so I guess they either fixed it or made the appropriate changes to make things consisten… Will drop them a mail to ask what it was they did.

by Gareth Knight on October 2, 2006 at 10:29 am. Reply #

This comment may be far too late, but I just stumbled upon this blog entry looking around for info about securing against SQL injection with a UTF-8-encoded database.

Your clients may want a short URL, but you (and they) should be aware of all of the possible ramifications:

If the search form POSTs and then the POSTed page displays output:

1) The browser’s back button will not work properly: If they back up through a search page, the user will get a warning message asking to re-submit a form.

2) HTTP-level caching will not work (either in the browser or in an intermediate HTTP cache such as Squid), since caches can’t operate on POSTdata.

3) Search results cannot be bookmarked. Not sure if this is a problem for you or not, but in general query-based form results should be bookmarkable.

If the search form POSTs, then the search results are stored somewhere intermediate, and the POSTed page redirects to something like “searchresults.php?searchid=13242&page=1”:

1) The URL is more likely to be acceptable to your clients. You can also play with URL rewriting to get something like “/searchresults/13242/1”.

2) The browser’s back button will operate as expected

3) The URL will probably expire at some point, unless you keep search results around forever. This will still prevent users from bookmarking search results, and will interfere with HTTP-level caching.

If the search form GETs instead of POSTs:

1) Your clients have to deal with a long URL

2) Your clients can bookmark search results indefinitely

3) The browser’s back button will operate as expected

4) Intermediate caches (as well as your server) will include query terms in their logfiles. For example, forms containing sensitive information like social security numbers or credit card numbers should always be POSTed, to prevent a disgrunted admin from hunting through logs for identity theft. Having search terms in logfiles may be bad (if they’re sensitive searches) or good (if you can mine popular topics from logs and improve your services with that knowledge).

Anyway, this is a bit over the top and not really what you were asking for in your blog, but hopefully you, your other readers, or your clients will find something useful in it.

by Tom Barta on May 22, 2007 at 10:12 pm. Reply #

Leave your comment

Required.

Required. Not published.

If you have one.

Protected with IP Blacklist CloudIP Blacklist Cloud