ASCII and Unicode Character Set - UTF-8, UTF-16 & UTF-32 with example using Python

We all know or have heard that computers store everything in Binary Format. While for numbers, it is easy to convert them to binary but how can the alphabets be handled?

ASCII (American Standard Code for Information Interchange) character set represents the english characters as numbers. Standard ASCII is a 7 bit set, which means it can represent 128 possible characters. Most computer systems use ASCII as text-format for storing files. There are other variations of ASCII available which utilizes all the 8 bits of the Byte (e.g. Extended ASCII, Latin ISO-1).

Unicode was introduced to support more characters which can't be represented in ASCII standard like Greek, Arabic, Emoji's etc. Unicode standard is a character set designed to support the worldwide interchange, processing and display of the written texts of the diverse languages and technical disciplines of the modern world.

In Unicode, every character is identified or represented by its code point. The current Unicode standard defines 1,114,112 code points, which are further divided into 17 planes/groupings. Each plane in Unicode is contionous group of 65,536 code points and these planes are identified by numbers 0 to 16 which can have the HEX values of 00-10 in the represntation U+hhhhhh. Plane 0 is the Basic Multilingul Plane (BMP), which contains most commonly used characters. For details about the planes, check this wiki post.

UTF-8, UTF-16 & UTF-32 are different ways to implement the Unicode standard & their brief information is as below :-

UTF-32 :
UTF-32 is a fixed width encoding in which 32 bits (4 Bytes) are used to represent any information.

UTF-16 :
UTF-16 is a variable length encoding in which either 16-bit or 32-bit will be used depending on the character.

UTF-8 :
UTF-8 is a variable length encoding in which either 8-bit, 16-bit, 24-bit or 32-bit will be used depending on the character.

Endianness - Big Endian vs Little Endian :-
Endianess refers to sequential order in which which bytes are arranged. In Big-Endian system, the big end (most significant part) is stored first at the lowest storage address. In Little-Endian system, the little end (least significant part) is stored first.

e.g. In Big-Endian system, two bytes required for the hexadecimal 4F52 would be stored as - 4F at storage address 1000 and 52 at 1001.

Endianness in UTF-16 and UTF-32 :-
Since UTF-16 and UTF-32 uses more than 8 bits, endianness a significance for storing values in these encodings. Unicode system uses BOM (byte order mark) which can be used as a signature defining the byte order and encoding form. Below table shows the BOM signature for different encodings :

00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian

Example using Python-3 :-
In the example code below, we will store the alphabet "A" in different encodings and see the data in binary form using the command (xxd -b <*.txt>)

# UTF-32 in Big-Endian
with open("test_utf32_be.txt","w",encoding="utf_32_be") as fo:
  fo.write("A")
# UTF-32 in Little-Endian
with open("test_utf32_le.txt","w",encoding="utf_32_le") as fo:
  fo.write("A")
# UTF-32 without any endianness (Python will use one
with open("test_utf32.txt","w",encoding="utf_32") as fo:
  fo.write("A")
# UTF-16 in Big-Endian
with open("test_utf16_be.txt","w",encoding="utf_16_be") as fo:
  fo.write("A")
# UTF-16 in Little-Endian
with open("test_utf16_le.txt","w",encoding="utf_16_le") as fo:
  fo.write("A")
# UTF-16 without any endianness
with open("test_utf16.txt","w",encoding="utf_16") as fo:
  fo.write("A")
# UTF-8
with open("test_utf8.txt","w",encoding="utf_8") as fo:
  fo.write("A")

Output for test_utf16.txt

00000000: 11111111 11111110 01000001 00000000                    ..A.

Output for test_utf16_be.txt

00000000: 00000000 01000001                                      .A

Output for test_utf16_le.txt

00000000: 01000001 00000000                                      A.

Output for test_utf32.txt

00000000: 11111111 11111110 00000000 00000000 01000001 00000000  ....A.

00000006: 00000000 00000000                                      ..

Output for test_utf32_be.txt

00000000: 00000000 00000000 00000000 01000001                    ...A

Output for test_utf32_le.txt

00000000: 01000001 00000000 00000000 00000000                    A...

Output for test_utf8.txt

00000000: 01000001                                               A

Sarbjit Singh

Search This Blog

ASCII and Unicode Character Set - UTF-8, UTF-16 & UTF-32 with example using Python

Labels

Comments

Post a Comment