[I18n-sig] CJKCodecs 1.0b1 is released
Hye-Shik Chang
perky@i18n.org
Sun, 13 Jul 2003 03:55:05 +0900
On Sat, Jul 12, 2003 at 07:32:21PM +0200, "Martin v. L?wis" wrote:
> Hye-Shik Chang wrote:
>
> > *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added.
>
> What is the rationale for this change? Python already distributed codecs
> for these.
>
Python's utf-7 codec is slightly broken for StreamReaders and it was
not easy to fix them for me.
Simple tests:
(doesn't handle surrogate pairs on ucs2 build)
>>> u'\U00012345'.encode('utf-7')
'+2AjfRQ-'
>>> '+2AjfRQ-'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: UTF-7 decoding error: code pairs are not supported
>>> '+2AjfRQ-'.decode('cjkcodecs.utf-7')
u'\U00012345'
(broken encoding for unichar > 0xffff)
>>> u'\U00012345'.encode('utf-7')
'+I0U-'
>>> u'\U00012345'.encode('cjkcodecs.utf-7')
'+2AjfRQ-'
>>> '+2AjfRQ-'.decode('utf-7')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: code pairs are not supported
>>> '+2AjfRQ-'.decode('cjkcodecs.utf-7')
u'\U00012345'
(problem for long utf-7 sequence)
>>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7'))
>>> rs = codecs.getreader('utf-7')(s)
>>> rs.read(10)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.3/codecs.py", line 262, in read
object, decodedbytes = decode(data, self.errors)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-19: unterminated shift sequence
>>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7'))
>>> rs = codecs.getreader('cjkcodecs.utf-7')(s)
>>> rs.read(10)
u'\uac00\uac00\uac00'
And, I created utf-8 and utf-16 codec for cjkcodecs just for fun.
I shipped them because they are somewhat faster than Python's equivalents.
(StreamReader benchmarks with a usual 10Kbyte chinese text)
(all values are in iterates/sec)
Python CJKCodecs
read(16) 14 187
read(256) 221 1645
read(512) 468 1990
readline 361 921
readlines 785 1193
They are not so big and don't replace Python's codecs by default.
(distributed as commented out on cjkcodecs/aliases.py)
So, I think they are not so useless comparing to their size.
> Regards,
> Martin
>
>
Regards,
Hye-Shik =)