Common mistakes with character encodings - part 1

posted: February 14th, 2007 · by: Sven

in: Globalization, Programming · tagged as: , , , , , , , ·  5 comments »

Ok, I’m going to collect some gotchas, pitfalls, mistakes and traps relating to Unicode, UTF-8, character encoding in general etc. Hopefully this prevents myself and others from being bitten (again) or at least might help to find the culprid more easily.

So if you have any additions here: please let me know! You read that? If you’ve encountered some kind of common problem with character encodings, please, let me know! There’s a comment form to use below and you also can always send me a mail. Thanks in advance :)

Let’s start with some basic stuff …

Know your encodings!

This might be pretty obviously to you once you’ve encountered it. But once in a while I’m meeting somebody who’s stuck with weird hassles because he’s simply using another encoding than he’s declaring.

As long as your just using relatively safe 7-bit ASCII characters everything might seem to work pretty fine but as soon as you dare to move outside that range and try to (e.g.) use a German umlaut or other “extended” 8-bit ASCII character somehow the hell’s going to rise and heaven’s falling … if you’re environments character encodings don’t match for some reason.

For example:

With Globalize for Ruby on Rails and other t10n/i18n tools you’re hardcoding strings in your templates that – when run – are handed to the database (in case of Globalize) or some other persistence mechanism. E.g. a typical, Globalized template might contain: "I wonder how I'm going to be encoded".t. Other tools such GLoc might use: _("I wonder how I'm going to be encoded") instead.

Either way the string is used as hardcoded data and being worked on/with by the software. In case of Globalize it’s going to be inserted into the database as a key for subsequent lookups.

Now when your sourcecode editor for some reason encodes your template as Latin-1 while your database is expecting you to provide UTF-8 you’re in trouble.</p

Recently somebody asked me about an error he got from MySQL 5.0 when following the instructions in my Globalize tutorial. His database told him: “Mysql::Error: #22001 Data too long or column ‘tr_key’ at row 1: INSERT INTO globalize_translations (`item_id`, `pl…”.

It turned out that he’d been bitten by exactly this problem: he’s had encoded his files in Latin-1 while his database table was configured to use UTF-8. So these encodings clashed quite understandably: Rails handed a Latin-1 encoded string to a database that expected UTF-8.

The backstory was: while normally being used to VI on a Linux box he’s now been working with RadRails for Eclipse on Windows XP. Windows XP file dialogs seem to offer “ANSI” as the default encoding of newly created files.

The MySQL error message was pretty misleading (there’s nothing been “too long or column” whatever that was ment to tell in the first place) and this has been recognized and fixed in the meantime.

The lesson from this seems to be: Your files are encoded somehow. So, know about their encoding!

Your files are encoded!

That’s what all this character encoding and charset stuff is all about. :)

You might want to start reading some things up. But for starters this has to do with the fact that every character needs to be saved as bits and bytes. Basically charsets are conventions that determine how characters are encoded.

An application that consumes some chunk of data, e.g. a file, will need to know about the character encoding that’s been used to saved the data. Likewise, a browser that receives an HTML page from a webserver needs to know (or guess) the character encoding. It needs to decode the bits and bytes this way or another.

For example: The commonly used standard character encoding ISO 8859-1 or less formally Latin-1 will cause a character like the German umlaut Ä to be safed (encoded) as the hexadecimal byte or number A4 (which equals decimal 164).

But the same character will be encoded to an entirely different byte or number, that’s to say hexadecimal 80 or decimal 128 when you tell your application to save (encode) this character using the Mac OS Roman character encoding. And the byte A4 does represent a completely unrelated character instead, namely the dagger glyph.

Now, when you try to open any such file with another application on another computer and probably even another operating system (browsing the web you’re doing this all the time) ... how would that software know what that number A4 that’s contained in the file is meant to be? Is it the German umlaut or is it that cross-shaped dagger glyph?

Leave a comment

5 Comments

  1. jack said January 23rd, 2011 at 11:45 AM  

    thanks for that heads up. That’s a useful tip! I’ve never ran into that, but for sure that’s something quite some people will need a solution for. cheap vps

  2. chat said March 31st, 2011 at 07:39 PM  

    The following cleaned up the issue:

    Dependencies.loadoncepaths -= Dependencies.loadoncepaths.select{|path| \ path =~ %r(^#{File.dirname(FILE)}) }

  3. Okey oyunu said May 12th, 2011 at 04:10 PM  

    Thanks for this article. Tüm dünya artik okey oyunu oynuyor. Yillardir bir çok oyun programi olmasina ragmen, içlerinden en güzeli olarak nitelendirebilecegimiz tek bir site göze çarpmaktadir. Diger tüm okey oyunu programlarinin aksine ücretsiz olmasi ve 3 boyutlu olarak hizmet vermesi mükemmel bir gelismedir. Sizlerde www.okey-oyunu.com adresinden bu essiz okey oyununu indirebilirsiniz. Kullanimi çok basit ve Türkçe dil seçenegi ile kolaylikla oyuna baslayabilirsiniz. Ister kendi ülkenizden, isterseniz dünyanin tüm farkli bölgelerinden dilediginiz oyun odalarini seçerek, oyuna hemen baslayabilirsiniz. Okey oyunu oynamak için artik arkadas bile aramaniza gerek kalmadan, bilgisayarinizdan 100 binlerce üye ile online olarak okey oyununu oynamanin zevkine varabilirsiniz.

  4. porno said May 22nd, 2011 at 01:34 PM  

    I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.

  5. porno said May 22nd, 2011 at 02:25 PM  

    good comment. thanks you friends.

    I’ve surfed the net more than three hours today, however, I haven’t found such useful information. Thanks a lot, it is really useful to me

Sorry, comments are closed for this article.

artweb design
Sven Fuchs
Grünberger Str. 65
10245 Berlin, Germany


http://www.artweb-design.de

Fon +49 (30) 47 98 69 96
Fax +49 (30) 47 98 69 97