unicode codec help for GSM0338

Tue Oct 8 04:48:03 EDT 2002

Anthony Baxter <anthony at interlink.com.au> writes:

> The problem is with the "extension characters" stuff. For instance,
> a left square bracket gets turned into 0x1B 0x3C.  When I try to
> use the attached file with
[...]
> Worse yet, the spec says that 0x1B maps to NBSP, but only if not followed
> by a valid escape sequence. I'm not sure if the python codec system can
> even handle that at all.
> 
> Anyone out there understand the python codec stuff at all that would
> care to lend a hand here? Or do I have to break out the *shudder* regexps?

Since this is not a single-byte character set, you cannot create a
charmap codec. However, Python can support multi-byte character sets
just fine; the UTF-8 codec, and the
JapaneseCodecs/KoreanCodecs/ChineseCodecs packages are prominent
examples.

Writing a codec for this encoding looks not very difficult. Leaving
out the NBSP case for the moment, you'ld get

class Codec:
  def encode(self, input, errors = 'strict'):
    result = []
    for c in input:
      try:
        result.append(regular_encode_dict[c])
      except KeyError:
        try:
          result.append(escape_encode_dict[c])
        except KeyError:
          if errors = 'strict': raise UnicodeError,"invalid SMS character"
          elif errors = 'replace': result.append(chr(0x3f)) #question mark
          elif errors = 'ignore': pass
          else: raise UnicodeError, "unknown error handling"
    return ''.join(result)

Here, regular_encode_dict is the dictionary mapping the right column
of GSM0338.TXT to the left column - *including* the commented-out
mappings, e.g. map u'\U0391' to '\x41' (you may follow the charmap
codec convention of using integers as keys and values).

escape_encode_dict maps the remaining characters to escaped sequences.

Encoding is a little bit more tricky:

  def decode(self, input, errors = 'strict'):
    result = []
    index = 0
    while index < len(input):
      c = input[index]
      index += 1
      if c = '\x1b':
        c = input[index]
        index += 1
        result.append(escape_decode_dict[c])
      else:
        try:
          result.append(regular_decode_dict[c])
        except KeyError:
          # error handling: unassigned byte, must be > 0x7f
     return u"".join(result)

regular_decode_dict maps the left column to the right column, possibly
excluding 0x1b (doesn't matter for that algorithm). escape_decode_dict
maps the escape sequences.

Now for NBSP: On decoding, you get an KeyError in
escape_decode_dict. In that case, append NBSP, and decrement index (to
consume the next character in the next round). On encoding, the
encoding is underspecified: how do I encode nbsp + EQUAL SIGN? It
probably is best to ignore those problem cases, and encode NBSP as
0x1b. Alternatively, treat with NBSP cases that would result in a
valid escape sequence as an error.

If you are *really* pedantic, you'll have to write special-cased
stream readers and writers. If somebody does

writer.write(u"\xa0") # NBSP
writer.write(u"\x3d")

you would need to preserve state in the writer to know whether the
last character read or written was an escape character. You can ignore
this if you know that you won't need stream writers, or that you
always use them in line mode, or that you don't care about reads or
writes that go in the middle of an escape sequence.

HTH,
Martin