Character Encoding — A Simple Explanation
In a nutshell, character encoding is how you make your computer understand what stuff like
\n are. In other words, it’s the conversion/translation of a character to binary-based using an encoding system.
In this Geekswipe edition, you will learn the basics of encoding, the shortcomings of ASCII, and how we solved it and ended up with a standard like Unicode.
One of the earliest encodings we used is the Morse code, where the characters are represented as dots and dashes. For example, the word SMS in morse code is
... -- ..., which is three dots that represent the word
S, two dashes the word
M, and then three dots for
This way you can encode the English alphabets, some punctuations, numbers—all into dots and dashes for easy communication and electric transmission.
Here is what a morse code table looks like.
|Symbol||Morse Code||Symbol||Morse Code|
In a similar way, for computers, they don’t read anything other than a series of bits. So to encode the characters like alphabets, numbers, punctuations, and control characters (like ESC, NUL, DEL, Tabs, and other non-printable control stuff) to a machine-readable format, which is binary, humans came up with another neat table like the morse code, called the ASCII.
The ASCII table is made up of 128 rows, index starting from 0, listing all stuff like the control characters, special characters, numbers, and alphabets, 7 bits of size each.
For example, for letter
A the corresponding index is an integer 65. This is also called the code point. And the binary of
100 0001, which the computer can read. The process of encoding A as 1000001 is known as character encoding. In this example, this is an ASCII encoding of size 7 bits.
But there was a problem with ASCII. The world has so many characters, expressions, and glyphs that the 7-bit ASCII or the 8-bit extended ASCII encodings couldn’t hold. So we needed a bigger, better, encoding scheme that is also backward compatible so the older systems and software doesn’t break. And thus came the Unicode standard, also known as the Unicode Transformation Format, with a whopping 11,14,112 code points. That’s … pretty huge, isn’t it?
UTF-8, one of Unicode’s three encoding schemes, has a variable byte size for different characters. For example, the first 128 characters of UTF-8 are identical to ASCII and are 1 byte in size (pretty cool, eh?). But the consequent characters are from two to four bytes in size, allowing for a very wide range of characters in all languages around the world.
Difference between Unicode, UTF-8, UTF-16, and UTF-32
It could be a bit confusing reading about Unicode and the encoding schemes like UTF-8 and the lesser-used UTF-16 and UTF-32. A simplified way to understand this is to think of Unicode as a standard table that just assigns an integer in hexadecimal to different characters. And UTF-8 is how you encode those characters by converting the integer to bytes.
And the difference between the three different UTF schemes is the byte size they use. UTF-8, despite the ‘8’ in the name, uses a variable byte size (8-bit, 16-bit, 24-bit, and 32-bit) as mentioned above. And UTF-16 uses up to 32 bits in the variable byte size scheme (16-bit and 32-bit). UTF-32, though, uses 32 bytes for everything.
This post was first published on October 11, 2012.