Why does the "".join(r) do this?

Thu May 20 21:04:37 EDT 2004

"Jim Hefferon" <jhefferon at smcvt.edu> wrote in message
news:545cb8c2.0405201645.16ac3364 at posting.google.com...
> Peter Otten <__peter__ at web.de> wrote
> > So why doesn't it just concatenate? Because there is no way of knowing
how
> > to properly decode chr(174) or any other non-ascii character to unicode:
> >
> > >>> chr(174).decode("latin1")
> >  u'\xae'
> > >>> chr(174).decode("latin2")
> >  u'\u017d'
> > >>>
>
> Forgive me, Peter, but you've only rephrased my question: I'm going to
> decode them later, so why does the concatenator insist on decoding
> them now?  As I understand it (perhaps this is my error),
> encoding/decoding is stuff that you do external to manipulating the
> arrays of characters.

Maybe I can simplify it? The result has to be in a single encoding,
which will be UTF-8 if any of the strings is a unicode string.
Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
the concatination. 8-bit encodings are not, so the concatination
checks that any normal strings are, in fact, Ascii-7. The encoding
is actually doing the validity check, not an encoding conversion.

The only way the system could do a clean concatination between
unicode and one of the 8-bit encodings is to know beforehand which
of the 8-bit encodings it is dealing with, and there is no way that it
currently has of knowing that.

The people who implemented unicode (in 2.0, I believe) seem to
have decided not to guess. That's in line with the "explicit is better
than implicit" principle.

> > Use either unicode or str, but don't mix them. That should keep you out
of
> > trouble.
>
> Well, I got this string as the filename of some kind of Macintosh file
> (I'm on Linux but I'm working with an archive that contains some pre-X
> Mac stuff) while calling some os and os.path functions.  So I'm taking
> strings from a Python library function (and using % to stuff them into
> strings that will end up on the web, which should preserve
> unicode-type-ness, right?) and then .join-ing them.

Ah. The issue then is rather simple: what is the encoding of the normal
strings? I'd presume Latin-1. So simply run the list of strings through a
function that converts any normal string to unicode using the Latin-1
codec, and then they should concatinate fine.

As far as the web goes, I'd suggest you make sure you specify UTF-8
in both the HTTP headers and in a <meta> tag in the HTML header,
and make sure that what you write out is, indeed, UTF-8.

John Roth

>
> I didn't go into the whole story when posting, because I tried to boil
> the question down.  Perhaps I should have.
>
> Thanks; I am often struck by how helpful this group is,
> Jim