How best to handle Unicode where only 8-bit chars are now?

Skip Montanaro skip at pobox.com
Sat Feb 8 00:21:27 EST 2003


The csv module supporting PEP 305 doesn't do Unicode yet.  All string
manipulation is currently done using null-terminated C strings for speed.
I'm looking for suggestions about how best to incorporate Unicode string
handling into the code.  I see three possibilities:

    1 Try and treat unicode objects the same as string objects - extract the
      raw data and handle them as bytes.

    2 Provide two different state machines, the current fast one which
      operates only on C strings representing ASCII data and a slow one
      which operates on unicode objects.

    3 Rewrite the state machine to operate at the level of string or unicode
      objects even though it will slow down the common case significantly.

Option 1 seems doomed because you'd be trying to mix processing of wide and
narrow characters.  Option 2 seems the least disruptive, but if somehow a
unicode object snuck into the system (say, a single field was unicode or the
delimiter was specified as u'"' even though it was really ASCII), the whole
system might mysteriously slow down.  Option 3 seems the cleanest, but would
slow everything down significantly because character extraction and
comparison would require a function call instead of an array index operation
or a simple comparison.

Skip





More information about the Python-list mailing list