Ascii to Unicode.

Mark Tolonen metolone+gmane at gmail.com
Thu Jul 29 20:43:07 EDT 2010


"Joe Goldthwaite" <joe at goldthwaites.com> wrote in message 
news:5A04846ED83745A8A99A944793792810 at NewMBP...
> Hi Steven,
>
> I read through the article you referenced.  I understand Unicode better 
> now.
> I wasn't completely ignorant of the subject.  My confusion is more about 
> how
> Python is handling Unicode than Unicode itself.  I guess I'm fighting my 
> own
> misconceptions. I do that a lot.  It's hard for me to understand how 
> things
> work when they don't function the way I *think* they should.
>
> Here's the main source of my confusion.  In my original sample, I had read 
> a
> line in from the file and used the unicode function to create a
> unicodestring object;
>
> unicodestring = unicode(line, 'latin1')
>
> What I thought this step would do is translate the line to an internal
> Unicode representation.

Correct.

> The problem character \xe1 would have been
> translated into a correct Unicode representation for the accented "a"
> character.

Which just so happens to be u'\xe1', which probably adds to your confusion 
later :^)  The first 256 Unicode code points map to latin1.

>
> Next I tried to write the unicodestring object to a file thusly;
>
> output.write(unicodestring)
>
> I would have expected the write function to request the byte string from 
> the
> unicodestring object and simply write that byte string to a file.  I 
> thought
> that at this point, I should have had a valid Unicode latin1 encoded file.
> Instead get an error that the character \xe1 is invalid.

Incorrect.  The unicodestring object doesn't save the original byte string, 
so there is nothing to "request".

> The fact that the \xe1 character is still in the unicodestring object 
> tells
> me it wasn't translated into whatever python uses for its internal Unicode
> representation.  Either that or the unicodestring object returns the
> original string when it's asked for a byte stream representation.

Both incorrect.  As I mentioned earlier, the first Unicode code points map 
to latin1.  It *was* translated to a Unicode code point whose value (but not 
internal representation!) is the same as latin1.

> Instead of just writing the unicodestring object, I had to do this;
>
> output.write(unicodestring.encode('utf-8'))

This is exactly what you need to do...explicitly encode the Unicode string 
into a byte string.

> This is doing what I thought the other steps were doing.  It's translating
> the internal unicodestring byte representation to utf-8 and writing it 
> out.
> It still seems strange and I'm still not completely clear as to what is
> going on at the byte stream level for each of these steps.

I'm surprised that by now no one has mentioned the codecs module.  You 
original stated you are using Python 2.4.4, which I looked up and does 
support the codecs module.

    import codecs

    infile = codecs.open('ascii.csv,'r','latin1')
    outfile = codecs.open('unicode.csv','w','utf-8')
    for line in infile:
        outfile.write(line)
    infile.close()
    outfile.close()

As you can see, codecs.open takes a parameter for the encoding of the file. 
Lines read are automatically decoded into Unicode; Unicode lines written are 
automatically encoded into a byte stream.

-Mark





More information about the Python-list mailing list