So much has been written about Unicode and Python. But Unipain is the best. Although its ugly head surfaces at times, I’ve somehow got around unicode and python 2.7 problems and never given it the due respect it deserves. But a few months ago on a Sunday morning, I found myself in deep Unipain. This is an attempt of recalling how I got out of that mess.
My exploration in program source code analysis generally involves munging text files all day. Up until now, most projects, it has been text files with ASCII strings. Most of them came from open source projects with the code being written by developers who speak English. However, while working on Entrancer – we found the dataset that comes with TraceLab (a platform for software traceability research) contained source code from Italian developers. All my Python 2.7 scripts exception-ed miserably when they tried to chew on those files.
An example of one of the input files is here. What confused me more was that all of the *nix tools (sort, uniq etc) I had to access through Cygwin were happily operating on these input files, but the file utility appeared confused about the encoding.
After a bit of random googling, I found Chardet by David Cramer which guesses the encoding of text files.
So no help there. Why would Italian text be encoded in the Central European character set? RTFM-ing the codecs docs didn’t lead anywhere. Soon I had drifted to reading hackernews.
Ok. Luckily the Internet has made this – Character Encoding Recommendation for Languages. I tried all variants like 8859-1, 8859-3, 8859-9, and 8859-15 and all have similar reactions. Thankfully, Jeff Hinrichs on the Chicago Python Mailing list pointed out “If it is in fact looking like 8559-1 then you should be using cp-1252, that is what HTML5 does”. According to Wikipedia“This encoding is a superset of ISO 8859-1, but differs from the IANA’s ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.”
While this worked for the file of a particular dataset, soon enough another file started biting at the scripts with its encoding fangs. And at this point you find yourself asking Why does Python print unicode characters when the default encoding is ASCII? and your REPL hurls a joke at you!
You are heartbroken at your failure to appreciate the arcane reasons for choosing the file system encoding as UTF-8 and leaving the default string encoding as ascii. You try coaxing Python by telling your .dot profiles to use UTF-8,
export PYTHONIOENCODING=UTF-8 … but Python doesn’t care!
Almost by noon, you realize its time to let the inner purist go and whatever-works-ugly-hacker take over.
I vim /path/to/site.py +491-ed and changed the goddamned “ascii” to “utf-8” in the site.py file. In your heart you know this is the least elegant way of solving the problem, as it would break dictionary hashes, and this code should never be allowed to talk to other python systems expecting a default ascii encoding. But its too easy to revert back. If you are interested, this is the /path/to/Python27/Lib/site.py file in your system. Read more on why is this a bad idea.
And lo and behold! All problems solved. But a safe way to do this might be to beg Python directly at the bangline of your script as described here.
With Python encoding out of the way, it was MySql’s turn to come biting. We needed Wordnet for the Italian Language for our project and it uses MySql for storing the data. Though you have to get approval before using it, its free and the guys maintaining it are super helpful at helping you out.
While importing the data the first ouch was the following:
Well clearly it doesn’t understand the difference between acute and grave accent marks. Luckily MySql Workbench is verbose enough to tell you where it is getting things wrong:
This stackoverflow post says that you have do an
ALTER shcema – in MySQL workbench you can right click on the schema and find it on the menu. It drops you infront of a drop down to change the default encoding while importing.
But it was back to square one again: how do I know the encoding of these strings embedded in the sql statements? May be Chardet knows?
Nice. Next all you need is to find out what you should select to enable this charset – and luckily its in the official docs. Turns out, I needed
But unfortunately this did not change the auto-generated import sql statement that the MySql Workbench was using. It was still using –default-character-set=utf8.
Forget GUI! Back to command line. Under Plugins in MySql Workbench you’ll find “Start shell for MySQL Utilities” and you’ll be dropped to a shell where you can issue the above command with the password flag like this:
Note the error message saying it could not open the default file due to lack of file permissions, but that did not stop it from importing the data properly. Ok! works for me 😉
Finally everything tertiary was working. That means it was time to go back to writing the actual code!