[I18n-sig] CJKCodecs 1.0b1 is released

Sun, 13 Jul 2003 03:55:05 +0900

On Sat, Jul 12, 2003 at 07:32:21PM +0200, "Martin v. L?wis" wrote:
> Hye-Shik Chang wrote:
> 
> >  *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added.
> 
> What is the rationale for this change? Python already distributed codecs 
> for these.
> 

Python's utf-7 codec is slightly broken for StreamReaders and it was
not easy to fix them for me.

Simple tests:
(doesn't handle surrogate pairs on ucs2 build)

>>> u'\U00012345'.encode('utf-7')
'+2AjfRQ-'
>>> '+2AjfRQ-'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-7 decoding error: code pairs are not supported
>>> '+2AjfRQ-'.decode('cjkcodecs.utf-7')
u'\U00012345'

(broken encoding for unichar > 0xffff)

>>> u'\U00012345'.encode('utf-7')
'+I0U-'
>>> u'\U00012345'.encode('cjkcodecs.utf-7')
'+2AjfRQ-'
>>> '+2AjfRQ-'.decode('utf-7')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: code pairs are not supported
>>> '+2AjfRQ-'.decode('cjkcodecs.utf-7')
u'\U00012345'

(problem for long utf-7 sequence)

>>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7'))
>>> rs = codecs.getreader('utf-7')(s)
>>> rs.read(10)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.3/codecs.py", line 262, in read
    object, decodedbytes = decode(data, self.errors)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-19: unterminated shift sequence
>>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7'))
>>> rs = codecs.getreader('cjkcodecs.utf-7')(s)
>>> rs.read(10)
u'\uac00\uac00\uac00'

And, I created utf-8 and utf-16 codec for cjkcodecs just for fun.
I shipped them because they are somewhat faster than Python's equivalents.

(StreamReader benchmarks with a usual 10Kbyte chinese text)
(all values are in iterates/sec)

            Python  CJKCodecs
read(16)    14      187
read(256)   221     1645
read(512)   468     1990
readline    361     921
readlines   785     1193

They are not so big and don't replace Python's codecs by default.
(distributed as commented out on cjkcodecs/aliases.py)
So, I think they are not so useless comparing to their size.

> Regards,
> Martin
> 
> 

Regards,
    Hye-Shik =)