Python String and Unicode
This week interesting for me building feature to send SMS in the local language. I was working around to understand string, Unicode, ASCII, UTF-8, UTF-16 etc.
ASCII: It is 8-bit characters. First ASCII table contains 128 characters. Later it was extended to support 256 characters.
UTF-8: ASCII has supported English characters initially then extended to support more symbols but not supported other languages like Russian, Chinese, and French. In the early 1990s, Unicode standards are defined to support encoding for any human language. Unicode was started with 16 bits instead of 8 bits. UTF-8 means, 8-bit number. UTF-8 bit most used Unicode standard
UTF-16: It means, Unicode represented with the 16-bit number. It is less used Unicode standard.
Case 1: Handling non-ASCII character
name_in_hindi = "सरीधर बूकि" name_in_hindi.encode('UTF-8')
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Reason: Python string will handle ASCII characters since Hindi characters are not mapped in ASCII table.
name_in_hindi = u"सरीधर बूकि" name_in_hindi.encode('UTF-8')
Bingo. It works!
Reason: No need explain.. :)
In my application: I need to send Unicode as Hexadecimal strings to the standard of UTF-16BE (This encoding standard will be provided by Service Provider).
import binascii utf8_hex = binascii.hexlify(name_in_hindi.encode('utf-8')) utf16_hex = binascii.hexlify(name_in_hindi.encode('utf-16BE')) print(utf8_hex) print(utf16_hex)
result as follows:
Please let me know your comments below. Thank you.