Python String and Unicode

Sreedhar Bukya
1 min readAug 26, 2016

This week interesting for me building feature to send SMS in the local language. I was working around to understand string, Unicode, ASCII, UTF-8, UTF-16 etc.

ASCII: It is 8-bit characters. First ASCII table contains 128 characters. Later it was extended to support 256 characters.

UTF-8: ASCII has supported English characters initially then extended to support more symbols but not supported other languages like Russian, Chinese, and French. In the early 1990s, Unicode standards are defined to support encoding for any human language. Unicode was started with 16 bits instead of 8 bits. UTF-8 means, 8-bit number. UTF-8 bit most used Unicode standard

UTF-16: It means, Unicode represented with the 16-bit number. It is less used Unicode standard.

Example:

Case 1: Handling non-ASCII character

name_in_hindi = "सरीधर बूकि" name_in_hindi.encode('UTF-8')

result

UnicodeDecodeError: 'ASCII' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Failed:

Reason: Python string will handle ASCII characters since Hindi characters are not mapped in ASCII table.

Correction:

name_in_hindi = u"सरीधर बूकि" name_in_hindi.encode('UTF-8')

Bingo. It works!

Reason: No need explain.. :)

In my application: I need to send Unicode as Hexadecimal strings to the standard of UTF-16BE (This encoding standard will be provided by Service Provider).

In Python:

import binascii utf8_hex = binascii.hexlify(name_in_hindi.encode('utf-8')) utf16_hex = binascii.hexlify(name_in_hindi.encode('utf-16BE')) print(utf8_hex) print(utf16_hex)

result as follows:

E0A4B8E0A4B0E0A580E0A4A7E0A4B020E0A4ACE0A582E0A495E0A4BF 093809300940092709300020092C09420915093F

Please let me know your comments below. Thank you.

Originally published at www.sreedharbukya.com on August 26, 2016.

--

--