Characters, Symbols and the Unicode Miracle – Computerphile


Audible free book:
Representing symbols, characters and letters that are used worldwide is no mean feat, but unicode managed it – how? Tom Scott explains how the web has settled on a standard.

More from Tom Scott: and

Data Security:

This video was filmed and edited by Sean Riley.

Computerphile is a sister project to Brady Haran’s Numberphile. See the full list of Brady’s video projects at:



  1. Another advantage of UTF-8 that wasn't mentioned is that if you want to sort strings by Unicode value, you can just treat it as though each byte were a separate character, and it'll just work.

    The only real downside to UTF-8 is that you can't seek out a character at a specific index without walking the entire string character by character.

  2. You forgot to mention that utf8 give up fixed size characters making every operation on text a lot more expensive like search for example. Indexing is needed on large text. In c++ we still don't have a standard utf string. Some security has been found. Windows is still.using utf16, It is not all good and well as you described
    But yes it is better than having an infinite number of encodings.

  3. So does that mean non-English text takes more space to store? If I translate a document from English to .. say … Arabic, wouldn't that double or triple its file size? That sounds like a pretty big problem to me.

  4. For the people wanting to know where this vid was taken it in a cafe called the booking office in St Pancras station I know because I have been there once it's pretty popular

  5. Very interesting as always Tom, but I couldn't watch this video – I listened to it, but the camera work made it unwatchable. The constant side to side swaying me seriously nauseous and the random rapid zooms just jarred. Please, tell your cameraman to get a tripod and to use it and to stop playing with the zoom lever.

  6. I am confused by a phrase at 2:13, "languages that don't use alphabets at all." If there is no written system, what are they typing?

  7. Additionally, UTF-8 does not have a "byte order". The "always store 32 bits for each character" encoding (a.k.a. UTF-32) has the problem that when a little-endian computer and a big-endian computer exchange data in this format, they have to add a prefix which tells the other computer "I'm sending the bytes of each character in ascending order" or "… in descending order". Then software needs logic to understand this prefix, to eliminate this prefix, to guess what to do when this prefix is missing, and so on. The UTF-16 encoding, which is used by Microsoft Windows internally, has the problem. Whereas UTF-8 just gets away without it. Simple and beautiful!

  8. A Miracle would have given one unified Encoding. Not the mess we have now! Video is misleading UTF-8 is NOT de facto standard, not even for the Internet.

  9. So why isn't the header for everything that doesn't fit into two bytes (e.g. 110xxxxx 10xxxxxx) not just 110 aswell? Or why does the header need to specify how many more bytes there are? It could also just say: "there are more bytes to come" and the program reading it would just look for the next header (or the end of the data) and "use" all the bytes in between… Or am I missing somethin?

  10. I thought there was a Klein bottle on the left side behind Tom. I got excited, and then I got sad when I realized it wasn't…. :'(

  11. I watched this video like 5 times over a long period now. Keep coming back to it, I so love the explanation and the storytelling!


Please enter your comment!
Please enter your name here