Problem processing Chinese character with Python

Sat Mar 6 23:01:00 EST 2004

On Sat, Mar 06, 2004 at 02:05:11AM -0800, Anthony Liu wrote:
> Andrew gave me a sample code with let me read a text
> file sentence by sentence.
> 
> Suppose I just wanna read the part between 2 full
> stops each time.
> 
> It works nicely with English text files, where the
> full stop is a dot (.).  
> 
> But when I tried to read Chinese text files, I found
> that it sometimes reads a few sentences at one time.

Yep -- you'll notice I'm reading bytes, but the sentences generator is
expecting characters.  That assumption holds for ASCII, but not many other
encodings.

You need some way of reading *characters*, rather than bytes from the file.
To do this you need to know the encoding of the file (of course), and then I
guess you need to try to decode the bytes as you read them in.  I'm just a
boring mono-lingual English speaker, so I haven't really played with unicode
much, but I guess something along these lines would work:

def characters(textFile, encoding):
    bytes = ''
    for byte in iter(lambda: textFile.read(1), ''):
        bytes += byte
        try:
            yield bytes.decode(encoding)
        except TypeError:
            pass
        else:
            bytes = ''
    if bytes:
        yield bytes.decode(encoding)

Hopefully someone who knows more about unicode will tell me if I've somehow
got this completely wrong.

Again, reading one byte at a time is pretty inefficient.  You can probably
optimise fairly easily by reading and decoding large chunks.

-Andrew.