Removing Unicode from Python?

Thu Oct 30 14:39:32 EST 2003

> In general I love Python for text manipulation but at our 
> company we
> have the need to manipulate large text values stored in 
> either a SQL
> Server database or text files. This data is stored in a 
> "text" field
> type and is definitely not unicode though it is often very strange
> text since it is either OCR or some kinda electronic file 
> extraction.

Issue #1: You need to find out what encoding is being used. You cannot
hope to gloss over this fact, and ignore Unicode altogether and just
"get on with life". Read
http://www.joelonsoftware.com/articles/Unicode.html for a recent,
well-written plea on this issue. You should probably not finish reading
my post until you've read that, or my comments will seem needlessly
biting.

> Unfortunately when it is retrieved into a string type in 
> python it is invariably a unicode type string.

Issue #2: What method is being used to "retrieve into a string type"?
Show some code.

> The best I can do is try and encode
> it to 'latin-1' but that will often throw and error if I use the
> ignore parameter then it will wack my data with a bunch of "?".

Again, you need to find out what encoding the text is in before the
"retrieval". MSSQL2k, for example, can have a different code page in use
for each collation. You need to find out what that is for your data.
Then, when the unicode() coercion is performed, you will be able to
correctly inform that factory function what you're sending it. However
you find out, for heaven's sake, don't GUESS.

Issue #3: Your data is probably being "wacked" on your display, not in
and of itself. In other words, the characters themselves are probably
correct, functioning unicode; however, your display (which is.. what?
web page? DOS prompt?) is unable to display unicode using the default
encoding. It may be possible for you to find out which encodings your
display supports and properly encode() the data before it is sent to the
display device.

> I am just not understanding why python is thinking stuff is 
> unicode and why
> it is failing on conversion. There is no way that a byte 
> can not be between 0 and 255 right?

Right. Unless it's actually two bytes, like UCS-2 (which IIRC SQL Server
uses extensively), and your database or other application is trying to
insulate the poor little overworked programmer (you) from that fact. Or
UTF-8, which can take up to six bytes per glyph. Etcetera.

> This problem can be so haunting that I will
> start to wish I had coded the solution in VB where at 
> least a string is a string is a string.

Or at least, a string is something which VB isn't going to elaborate on,
because some poor overworked programmers don't really *need* to
understand Unicode--we'll "help them out" by not mentioning it, and hope
it doesn't come back to bite them.

> Is there a way to modify Python so that all
> strings will always be single byte strings since we have 
> no need for Unicode support?

unicode = str? >:) Just kidding.

Issue #4: You have a need for Unicode support. You just don't know it
yet. ;)

> Any solutions or suggestions to my biggest Python
> annoyance would be greatly appreciated.

>From my point of view, it seems like a Unicode (and those who would
"protect" you from it) annoyance, not a Python one.

    "There's a set of rules that anything that was in the world
     when you were born is normal and natural. Anything invented
     between when you were 15 and 35 is new and revolutionary and
     exciting, and you'll probably get a career in it. Anything in-
     vented after you're 35 is against the natural order of things."

Douglas Adams

Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org