UTF-8: Bits, Bytes, and Benefits
Posted on Friday, March 5, 2010.
UTF-8 is a way to encode Unicode code points—integer values from 0 through 10FFFF—into a byte stream, and it is far simpler than many people realize. The easiest way to make it confusing or complicated is to treat it as a black box, never looking inside. So let's start by looking inside. Here it is:
Unicode code points | UTF-8 encoding (binary) | ||
---|---|---|---|
00-7F | (7 bits) | 0tuvwxyz | |
0080-07FF | (11 bits) | 110pqrst 10uvwxyz | |
0800-FFFF | (16 bits) | 1110jklm 10npqrst 10uvwxyz | |
010000-10FFFF | (21 bits) | 11110efg 10hijklm 10npqrst 10uvwxyz | |
The convenient properties of UTF-8 are all consequences of the choice of encoding.
- All ASCII files are already UTF-8 files.
The first 128 Unicode code points are the 7-bit ASCII character set, and UTF-8 preserves their one-byte encoding. - ASCII bytes always represent themselves in UTF-8 files. They never appear as part of other UTF-8 sequences.
All the non-ASCII UTF-8 sequences consist of bytes with the high bit set, so if you see the byte 0x7A in a UTF-8 file, you can be sure it represents the characterz
. - ASCII bytes are always represented as themselves in UTF-8 files. They cannot be hidden inside multibyte UTF-8 sequences.
The ASCIIz
01111010 cannot be encoded as a two-byte UTF-8 sequence 11000001 10111010. Code points must be encoded using the shortest possible sequence. A corollary is that decoders must detect long-winded sequences as invalid. In practice, it is useful for a decoder to use the Unicode replacement character, code point FFFD, as the decoding of an invalid UTF-8 sequence rather than stop processing the text. - UTF-8 is self-synchronizing.
Let's call a byte of the form 10xxxxxx a continuation byte. Every UTF-8 sequence is a byte that is not a continuation byte followed by zero or more continuation bytes. If you start processing a UTF-8 file at an arbitrary point, you might not be at the beginning of a UTF-8 encoding, but you can easily find one: skip over continuation bytes until you find a non-continuation byte. (The same applies to scanning backward.) - Substring search is just byte string search.
Properties 2, 3, and 4 imply that given a string of correctly encoded UTF-8, the only way those bytes can appear in a larger UTF-8 text is when they represent the same code points. So you can use any 8-bit safe byte at a time search function, likestrchr
orstrstr
, to run the search. - Most programs that handle 8-bit files safely can handle UTF-8 safely.
This also follows from Properties 2, 3, and 4. I say “most” programs, because programs that take apart a byte sequence expecting one character per byte will not behave correctly, but very few programs do that. It is far more common to split input at newline characters, or split whitespace-separated fields, or do other similar parsing around specific ASCII characters. For example, Unix tools like cat, cmp, cp, diff, echo, head, tail, and tee can process UTF-8 files as if they were plain ASCII files. Most operating system kernels should also be able to handle UTF-8 file names without any special arrangement, since the only operations done on file names are comparisons and splitting at/
. In contrast, tools like grep, sed, and wc, which inspect arbitrary individual characters, do need modification. - UTF-8 sequences sort in code point order.
You can verify this by inspecting the encodings in the table above. This means that Unix tools like join, ls, and sort (without options) don't need to handle UTF-8 specially. - UTF-8 has no “byte order.”
UTF-8 is a byte encoding. It is not little endian or big endian. Unicode defines a byte order mark (BOM) code point FFFE, which are used to determine the byte order of a stream of raw 16-bit values, like UCS-2 or UTF-16. It has no place in a UTF-8 file. Some programs like to write a UTF-8-encoded BOM at the beginning of UTF-8 files, but this is unnecessary (and annoying to programs that don't expect it).
UTF-8 does give up the ability to do random access using code point indices. Programs that need to jump to the nth Unicode code point in a file or on a line—text editors are the canonical example—will typically convert incoming UTF-8 to an internal representation like an array of code points and then convert back to UTF-8 for output, but most programs are simpler when written to manipulate UTF-8 directly.
Programs that make UTF-8 more complicated than it needs to be are typically trying to be too general, not wanting to make assumptions that might not be true of other encodings. But there are good tools to convert other encodings to UTF-8, and it is slowly becoming the standard encoding: even the fraction of web pages written in UTF-8 is nearing 50%. UTF-8 was explicitly designed to have these nice properties. Take advantage of them.
For more on UTF-8, see “Hello World or Καλημέρα κόσμε or こんにちは 世界,” by Rob Pike and Ken Thompson, and also this history.
Notes: Property 6 assumes the tools do not strip the high bit from each byte.
Such mangling was common years ago but is very uncommon now.
Property 7 assumes the comparison is done treating
the bytes as unsigned, but such behavior is mandated
by the ANSI C standard for memcmp
,
strcmp
, and strncmp
.
(Comments originally posted via Blogger.)
Andrew Wooster (March 5, 2010 4:47 PM) One nit on substring search: searching for series of bytes is almost never what you want when doing a substring search in Unicode. For most substring search needs, you'll need to take into account Unicode equivalence: http://en.wikipedia.org/wiki/Unicode_equivalence
Additionally, there are more complications when doing case/diacritic-insensitive substring searches and the like.
This is one of the reasons many languages/frameworks use normalized UTF-16 strings internally, as using UTF-8 as the internal representation can lead to subtle bugs when developers don't understand the same string can be represented in multiple different ways.
Russ Cox (March 5, 2010 5:52 PM) That's not really true: many substring searches are for fixed ASCII strings, and those work just fine. And a lot of the time you do care about finding a specific sequence of code points, not some equivalent ones. So much of the time, it's fine.
Even if you do want to do Unicode equivalence in the string search, that's not incompatible with UTF-8. The output of a normalization process can be UTF-8 as easily as any other encoding.
Also, one nit: I think (hope) you mean UCS-2 not UTF-16. UTF-16 has all of the drawbacks of UTF-8 (no random indexing based on code points) without any of the benefits.
A modern implementation would need UCS-4 instead to allow code points outside the BMP, so normalizing to UCS-4 would take 4x as much memory for mostly ASCII text as UTF-8 would. Unless you need the random access, it's probably all pain and no gain.
Andrew Wooster (March 5, 2010 6:57 PM) I actually meant UTF-16. If a developer knows enough to access the raw storage and chop off a byte, they probably know they shouldn't be doing it. (Yeah, lame, I know.) It also encodes the BMP directly. You still don't get random access, but it's a tradeoff. AFAIK (and I'm not an expert), Java, .Net, and Cocoa/CoreFoundation all use UTF-16 internally.
bdauvergne (March 7, 2010 1:51 AM) Java, like NTFS and a lot of a 'let's make a half bad choice' systems, use UCS2 internally, to keep pseudo performance/compatibility with C-string indexing logic, that's why its String does not really support Unicode (no support outside the BMP).
Russ Cox (March 7, 2010 12:14 PM) Well, the Unicode consortium kind of stabbed everyone in the back on UCS-2. In the early days people believed that Unicode would stop at 64k code points, so UCS-2 (== UTF-16 at the time) was a completely reasonable internal representation.
It was a big mistake, even then, to use it as an external representation, because then you have byte order-dependent ASCII-incompatible files. I think Windows support for Unicode was set back at least five years if not more by the insistence on using UCS-2 in files and file names.
And then when Unicode bloated beyond the BMP, that hurt people using UCS-2 both as internal and external representations, essentially dooming legacy programs to mishandling non-BMP code points. It hurt the "UCS-2 as external representation" code more, but even changing an internal-only representation requires recompiling, discovering places where assumptions of internal==16-bits lie hidden, and so on.
We converted the Plan 9 tools (http://swtch.com/plan9port) from UCS-2 to UCS-4 for internal representation recently, but it required recompiling the tools that were dependent on that (text editors, window systems, regular expressions, etc). And we've probably still missed a few places. But it was far easier than if we had to convert all our text files and file names and system calls too.
DAGwyn (May 19, 2010 4:26 AM) ISO 10646, predating the name "Unicode", started out with a 31-bit universal code space and treated potential warts such as "compositing" characters as distinct points in the space. I don't know why, at the time that Windows NT first adopted 16-bit characters, "people believed that Unicode would stop at 64k code points"; it was obviously not the intent of 10646, nor was it even plausible after one tallied up all the known glyphs in use.
Anyway, the original definition of UTF (aka FSS/UTF) included encodings up to the full 31 bits needed for 10646 (exercise: is that the maximum encodable using the obvious extension of the 21-bit UTF-8 scheme?), although I don't think the original implementation supported more than 16 bits.
I pushed FSS/UTF in an article long ago in JCLT, which may have help bring it to the attention of people who had been working on less desirable solutions.
kkll2 (January 5, 2011 6:33 AM) @rsc: that's a red herring. Unicode equivalence problem is at codepoint level, not at byte level.
No matter what encoding you use to represent codepoints, you still have to deal with alternative forms (NFC/NFD) and multi-*codepoint* (i.e. 8 bytes or more in UCS-4) characters.
So using UCS-4 gives you only false sense of random access, but you won't be able to correctly index, say “Z̎ͪ̒͑͏̜͓̲̥̹͙͖̝̟̘̦͇̺̮́A̷̡̜̗̰͙̲̱͖̳̰̰̖͍̙̻̞̍̂̐͂͐̅͡L̛͔̳̦̫̬̩̣̭̝ͫ͆̈́̿̌ͥ̉͂̑̍ͧ̽͊͟G̠̩̤̞̭͎͖̯̘̳͉̪̫̰͙̰͖̼ͥ̊͐̒͐͗̏̉̀Ó”.
So if you have to deal with non-indexable strings which can have characters represented in multiple ways, you can as well use UTF-8, because UCS-4 doesn't solve this at all!