[Chicago] understanding unicode problems

Fri Nov 16 17:50:13 CET 2007

Carl,

I'm afraid the only way to get deeper into this is to back up and soak
in the fundamentals.  This is a great article to start with:
http://www.joelonsoftware.com/articles/Unicode.html ... and Pete
posted some good links: http://del.icio.us/pfein/unicode

If you want the answers to your questions below skip right to the
Unicode section in that first article.  But it is worth the full read.
 After that, the "real world" is the best way to learn things (for me
anyway) so if you post some more code (can do that off the list too)
then perhaps I can post back some solutions in code.

On Nov 16, 2007 10:36 AM, Carl Karsten <carl at personnelware.com> wrote:
> Kumar McMillan wrote:
> > On Nov 16, 2007 9:07 AM, Carl Karsten <carl at personnelware.com> wrote:
> >> Kumar McMillan wrote:
> >>> I wrote up a little something about it when it finally clicked for me:
> >>> http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
> >>> (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
> >>> how or why or what that even implied)
> >> "However, it's not always possible to work with unicode all the time because not
> >> everything supports it. As just one example, you'll need to create a wrapper
> >> that temporarily encodes / decodes data when reading a csv file using the
> >> standard csv module."
> >>
> >> Is there a standard way of encoding?
> >
> > I suppose the standard way is to find all the boundaries of your
> > application (where you accept strings from files or user input) and
> > convert it all to unicode then deal with it everywhere internally as
> > unicode.  Whenever you need to send output to stdout, a file,
> > whatever, then you encode it.
> >
> >> A string (unicode or not) is a bunch of bytes.  unicode chars may use more than
> >> one byte.
> >
> > unicode is actually represented internally as "code points;" it's not
> > stored in bytes while it's "unicode."
>
> Um, what's a "code point"?  and what are you calling "bytes", cuz in my
> vocabulary, everything is stored as a set of bytes, those 8 bit things that the
> CPU reads and writes to ram and disk drives.
>
> >
> >> What I don't understand:  Why do I need to encode / decode?
> >
> > Because you can't write unicode to a file, for example.  A file
> > contains bytes and unicode has arbitrary byte representations.  When
> > you encode unicode as UTF-8 the bytestring will look different than if
> > you encode it as LATIN-1.  The reason this is so confusing is that
> > Python will **try** to do the encoding/decoding for you automatically.
> >  This is also why the errors you see are often very confusing (if you
> > don't know Python is doing this under the hood).
> >
>
> This will make more sense once I get a grip on what a byte is.
>
>
> >>  I get
> >> the feeling the error caused is a reminder "so that you know that you need to do
> >> the other operation later."
> >
> > if you post a little bit more of the error I can try and give some
> > specific suggestions for solving it.  I wasn't clear exactly what code
> > was raising the exception you posted earlier.
>
> code that errored wasn't mine - it was Paul's, and I think he fixed it.  I am
> back to helping flesh out your unicode talk :)
>
>
> Carl K
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>