[Pythonmac-SIG] Unicode and split
Christopher Barker
Chris.Barker at noaa.gov
Fri May 23 18:16:26 CEST 2008
Jeremy Reichman wrote:
> I have some characters in line strings in a file I'm processing that appear
> to be Unicode. (When I print them to the shell from my script, they are
> Asian characters for files like fonts in the Mac OS X filesystem.)
>
> When I run a.split() on the affected line strings, they split on what I'm
> guessing is considered a Unicode whitespace character. Specifically, the
> culprit seems to be '\xe1':
>
> $ python -c 'print "\xe1"'
> ?
actually, u'xe1' is a lower case accented a: á (if the unicode comes
through email OK), so I doubt that python is splitting on that.
Also, when you do the above, you're creating a regular string, not a
unicode object. If you do:
$ python -c 'print u"\xe1"'
á
You may get the right thing, if you're terminal is set up right to
display unicode.
I suspect your problem is that you aren't decoding the input file
correctly. The whole problem with unicode (and indeed, any non-ascii
encoding), is that you need to know what encoding your data is, in order
to use it. if it looks mostly OK when interpreted as ASCII, then in
MIGHT be utf8, so try reading in your file and decoding it this way:
contents = myfile.read().decode('utf8')
Then do your splitting. If it's not utf8, then you'll need to figure out
what it is.
First, read this:
http://www.joelonsoftware.com/articles/Unicode.html
then take a look at some of the python unicode tutorials, this is only
one of them:
http://www.reportlab.com/i18n/python_unicode_tutorial.html
there are other good ones.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the Pythonmac-SIG
mailing list