I used to dread the days when I ran into a question mark, in a black diamond shape or not, in a middle of some text on a website or elsewhere, but now I find it fun. I feel like I’m finally getting the grip on the character encodings question, and I also like the idea of looking behind the scenes into what is really going on inside a file, or a program, or a process, or a database, a website, or anything else. Well, one tiny little baby step at a time, of course.
Today I found a question mark in a pdf, and wanted to know why. I went into the database to look at the text that was being printed in pdf, and just intuitively copied the text to then paste it into the shell and do a hexdump on it. I didn’t know if the copy-paste would modify it in some way, but I thought I’d give it a shot and see what happens.
I pretty much did this:
$ echo “Example of a possible phrase: single right quote’s problem” | hexdump -C
00000000 45 78 61 6d 70 6c 65 20 6f 66 20 61 20 70 6f 73 |Example of a pos|
00000010 73 69 62 6c 65 20 70 68 72 61 73 65 3a 20 73 69 |sible phrase: si|
00000020 6e 67 6c 65 20 72 69 67 68 74 20 71 75 6f 74 65 |ngle right quote|
00000030 e2 80 99 73 20 70 72 6f 62 6c 65 6d 0a |…s problem.|
That is UTF-8 Unicode for you. It is backwards compatible with ASCII, so where it can it uses just one byte for character. And this one strange character took 3 bytes. To make the long story short, when we were passing (the choice of coders other then us) this nice UTF-8 to the php function utf8_decode() it was turning it into single-byte ISO-8859-1. And losing the right single quote. Because as I was told today, if you translate something complex into something simpler, rich into poor encoding, you lose something. And I was losing the right single quote.
Why is this so entertaining?
Because it’s everywhere, and because often what we think is going on is not what really is going on, and sometimes this can create problems. Haven’t you noticed how tricky it is to just copy and paste simple ASCII text these days without carrying markup and what not along with it? That alone requires a blog post. And if you paste a non printing character into a website database or a website file, that can create strange problems. The non-printable character is not seen and yet it breaks the layout, since the browser tries to work around it, closes the tag, and makes what it can of the mess.
Character encodings are everywhere. Web sites have them (check out View->character encoding in Firefox or Wrench menu->Tools->Encoding in Chrome). Databases have collations. Web applications too. And all these have to be aligned and also made to work well for different users. We have to keep track of encodings when moving between programs and operating systems. We need to know how much can we trust our text editors to not create problems. Etc.
A phrase is made of bytes, which are numbers. We see letters, digits, symbols, ideograms, and what not, but underneath it’s numbers, and encodings give them meaning. When we need to print or read letters and numbers and symbols, computer programs translate them into numbers and back again.
If it was simple ASCII, every character would just be represented by one 7-bit byte. This means that there are 128 (0-127) ASCII characters, printable and not, and the 128th character, decimal 127, has a binary value of 1111111. You are probably thinking that I miscalculated the bits, since they are 7, not 8 like in the byte we are used to. No, seven they are. Someone told me that in the teleprinter old days the 8th bit was used for parity, as a sort of a checksum, so you could check if the rest of it came through plausibly right. The idea was kept through the modem days, apparently (see XMODEM and ZMODEM file transfer protocols). But don’t worry, there is the extended ASCII (or high ASCII) that goes into the higher numbers and describes 8-bit or larger character encodings. Letters with accents are in the extended part, and some old school computer fanatics still refuse to have anything to do with them in e-mail and such and stick to the basics:) Oh yeah, there’s also the ASCII Ribbon Campaign against HTML in e-mails which makes more sense every day in this world of malicious Javascript code passed in img tags and such. Oh well.
And now go see some ASCII art to relax. A bit. Ha ha. Not funny 🙂
So, I found the weird 3 bytes (“e2 80 99“) in my string and Googled them. It turned out to be some Microsoft specific right single quote, that takes well 3 bytes to be represented in Unicode. UTF-8 Unicode encoding uses different numbers of bytes to represent characters – sometimes 1, sometimes 2, sometimes more. UCS-2 is fixed length, UTF-16 uses 1 or 2 bytes for a character, and so on. ISO and IEC have their series of standards for 8-bit character encodings. Microsoft has their own character sets.
In the Wikipedia page you can check out the web resources, including the character encoding Wikipedia page.
I could go on, but I’ll just admit I have to study much more yet, and some day maybe put together a tutorial on the subject.
In the meantime I’d like to remind you to learn something new every day, big or small makes no difference, and keep a diary of what you learn. It will help you see the day in a new light. Oh, BTW don’t compare yourself to others. Mostly they have a lot to learn, but hide it well.