[Pythonmac-SIG] Unicode and split

Fri May 23 18:16:26 CEST 2008

Jeremy Reichman wrote:
> I have some characters in line strings in a file I'm processing that appear
> to be Unicode. (When I print them to the shell from my script, they are
> Asian characters for files like fonts in the Mac OS X filesystem.)
> 
> When I run a.split() on the affected line strings, they split on what I'm
> guessing is considered a Unicode whitespace character. Specifically, the
> culprit seems to be '\xe1':
> 
> $ python -c 'print "\xe1"'
> ?

actually, u'xe1' is a lower case accented a: á (if the unicode comes 
through email OK), so I doubt that python is splitting on that.

Also, when you do the above, you're creating a regular string, not a 
unicode object. If you do:

$ python -c 'print u"\xe1"'
á

You may get the right thing, if you're terminal is set up right to 
display unicode.

I suspect your problem is that you aren't decoding the input file 
correctly. The whole problem with unicode (and indeed, any non-ascii 
encoding), is that you need to know what encoding your data is, in order 
to use it. if it looks mostly OK when interpreted as ASCII, then in 
MIGHT be utf8, so try reading in your file and decoding it this way:

contents = myfile.read().decode('utf8')

Then do your splitting. If it's not utf8, then you'll need to figure out 
what it is.

First, read this:
http://www.joelonsoftware.com/articles/Unicode.html

then take a look at some of the python unicode tutorials, this is only 
one of them:

http://www.reportlab.com/i18n/python_unicode_tutorial.html

there are other good ones.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov