Problem processing Chinese character with Python
Anthony Liu
antonyliu2002 at yahoo.com
Sun Mar 7 03:01:36 EST 2004
Hey, I fiddled with the Chinese punctuations, and it
can work elegantly now.
Thanks a lot!
--- Andrew Bennetts <andrew-pythonlist at puzzling.org>
wrote:
> On Sat, Mar 06, 2004 at 02:05:11AM -0800, Anthony
> Liu wrote:
> > Andrew gave me a sample code with let me read a
> text
> > file sentence by sentence.
> >
> > Suppose I just wanna read the part between 2 full
> > stops each time.
> >
> > It works nicely with English text files, where the
> > full stop is a dot (.).
> >
> > But when I tried to read Chinese text files, I
> found
> > that it sometimes reads a few sentences at one
> time.
>
> Yep -- you'll notice I'm reading bytes, but the
> sentences generator is
> expecting characters. That assumption holds for
> ASCII, but not many other
> encodings.
>
> You need some way of reading *characters*, rather
> than bytes from the file.
> To do this you need to know the encoding of the file
> (of course), and then I
> guess you need to try to decode the bytes as you
> read them in. I'm just a
> boring mono-lingual English speaker, so I haven't
> really played with unicode
> much, but I guess something along these lines would
> work:
>
> def characters(textFile, encoding):
> bytes = ''
> for byte in iter(lambda: textFile.read(1), ''):
> bytes += byte
> try:
> yield bytes.decode(encoding)
> except TypeError:
> pass
> else:
> bytes = ''
> if bytes:
> yield bytes.decode(encoding)
>
> Hopefully someone who knows more about unicode will
> tell me if I've somehow
> got this completely wrong.
>
> Again, reading one byte at a time is pretty
> inefficient. You can probably
> optimise fairly easily by reading and decoding large
> chunks.
>
> -Andrew.
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what youre looking for faster
http://search.yahoo.com
More information about the Python-list
mailing list