How best to handle Unicode where only 8-bit chars are now?
Skip Montanaro
skip at pobox.com
Sat Feb 8 00:21:27 EST 2003
The csv module supporting PEP 305 doesn't do Unicode yet. All string
manipulation is currently done using null-terminated C strings for speed.
I'm looking for suggestions about how best to incorporate Unicode string
handling into the code. I see three possibilities:
1 Try and treat unicode objects the same as string objects - extract the
raw data and handle them as bytes.
2 Provide two different state machines, the current fast one which
operates only on C strings representing ASCII data and a slow one
which operates on unicode objects.
3 Rewrite the state machine to operate at the level of string or unicode
objects even though it will slow down the common case significantly.
Option 1 seems doomed because you'd be trying to mix processing of wide and
narrow characters. Option 2 seems the least disruptive, but if somehow a
unicode object snuck into the system (say, a single field was unicode or the
delimiter was specified as u'"' even though it was really ASCII), the whole
system might mysteriously slow down. Option 3 seems the cleanest, but would
slow everything down significantly because character extraction and
comparison would require a function call instead of an array index operation
or a simple comparison.
Skip
More information about the Python-list
mailing list