Problem processing Chinese character with Python
Andrew Bennetts
andrew-pythonlist at puzzling.org
Sat Mar 6 23:01:00 EST 2004
On Sat, Mar 06, 2004 at 02:05:11AM -0800, Anthony Liu wrote:
> Andrew gave me a sample code with let me read a text
> file sentence by sentence.
>
> Suppose I just wanna read the part between 2 full
> stops each time.
>
> It works nicely with English text files, where the
> full stop is a dot (.).
>
> But when I tried to read Chinese text files, I found
> that it sometimes reads a few sentences at one time.
Yep -- you'll notice I'm reading bytes, but the sentences generator is
expecting characters. That assumption holds for ASCII, but not many other
encodings.
You need some way of reading *characters*, rather than bytes from the file.
To do this you need to know the encoding of the file (of course), and then I
guess you need to try to decode the bytes as you read them in. I'm just a
boring mono-lingual English speaker, so I haven't really played with unicode
much, but I guess something along these lines would work:
def characters(textFile, encoding):
bytes = ''
for byte in iter(lambda: textFile.read(1), ''):
bytes += byte
try:
yield bytes.decode(encoding)
except TypeError:
pass
else:
bytes = ''
if bytes:
yield bytes.decode(encoding)
Hopefully someone who knows more about unicode will tell me if I've somehow
got this completely wrong.
Again, reading one byte at a time is pretty inefficient. You can probably
optimise fairly easily by reading and decoding large chunks.
-Andrew.
More information about the Python-list
mailing list