Data Coding Scheme

Data Coding Scheme is a one-octet field in Short Messages (SM) and Cell Broadcast Messages (CB) which carries a basic information how the recipient handset should process the received message.

IntegerBinaryTranslationNotes
000000000SMSC Default Alphabet – ASCII for short and long code and to GSM for toll-free
100000001ASCII for short and long code, Latin 9 for toll-free(ISO-8859-9)
200000010Octet Unspecified8-bit binary
300000011Latin 1(ISO-8859-1)
400000100Octet Unspecified8-bit binary
500000101JIS(X 0208-1990)
600000110Cyrllic(ISO-8859-5)
700000111Latin/Hebrew(ISO-8859-8)
800001000UCS2/UTF-16(ISO/IEC-10646)
900001001Pictogram Encoding
1000001010Music Codes(ISO-2022-JP)
1300001101Extended Kanji JIS(X 0212-1990)
1400001110Korean Graphic Character Set(KS C 5601/KS X 1001)

Message Character Sets
A special 7-bit encoding called the GSM 7 bit default alphabet was designed for the Short Message System in GSM. The alphabet contains the most-often used symbols from most Western-European languages (and some Greek uppercase letters). Some ASCII characters and the Euro sign did not fit into the GSM 7-bit default alphabet and must be encoded using two septets. These characters form GSM 7 bit default alphabet extension table. Support of the GSM 7-bit alphabet is mandatory for GSM handsets and network elements.

Languages which use Latin script, but use characters which are not present in the GSM 7-bit default alphabet, often replace missing characters with diacritic marks with corresponding characters without diacritics, which causes not entirely satisfactory user experience, but is often accepted. In order to include these missing characters the 16-bit UTF-16 (in GSM called UCS-2) encoding may be used at the price of reducing the length of a (non-segmented) message from 160 to 70 characters.

The messages in Chinese, Korean or Japanese languages must be encoded using the UTF-16 character encoding. The same was also true for other languages using non-Latin scripts like Russian, Arabic, Hebrew and various Indian languages. In 3GPP TS 23.038 8.0.0 published in 2008 a new feature, an extended National language shift table was introduced, which in the version 11.0.0 published in 2012 covers Turkish, Spanish, Portuguese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu languages. The mechanism replaces GSM 7-bit default alphabet code table and/or extended table with a national table(s) according to special information elements in User Data Header. The non-segmented message using national language shift table(s) may carry up to 155 (or 153) 7-bit characters.

GSM recognizes only two encodings for text messages and one encoding for binary messages:

GSM 7-bit default alphabet (which includes using of National language shift tables as well)
UCS-2
8-bit data