choosing a default text-encoding in Python programs (was: To unicode or not to unicode)

Sun Feb 22 20:16:47 EST 2009

On Feb 23, 11:46 am, Joshua Judson Rosen <roz... at geekspace.com> wrote:
> Denis Kasak <denis.ka... at gmail.com> writes:
>
> > > > Python "assumes" ASCII and if the decodes/encoded text doesn't
> > > > fit that encoding it refuses to guess.
>
> > > Which is reasonable given that Python is programming language where it's
> > > better to have more conservative assumption about encodings so errors
> > > can be more quickly diagnosed.  A newsreader however is a different
> > > beast, where it's better to make a less conservative assumption that's
> > > more likely to display messages correctly to the user.  Assuming ISO
> > > 8859-1 in the absense of any specified encoding allows the message to be
> > > correctly displayed if the character set is either ISO 8859-1 or ASCII.
> > > Doing things the "pythonic" way and assuming ASCII only allows such
> > > messages to be displayed if ASCII is used.
>
> > Reading this paragraph, I've began thinking that we've misunderstood
> > each other. I agree that assuming ISO 8859-1 in the absence of
> > specification is a better guess than most (since it's more likely to
> > display the message correctly).
>
> So, yeah--back on the subject of programming in Python and supporting
> charactersets beyond ASCII:
>
> If you have to make an assumption, I'd really think that it'd be
> better to use whatever the host OS's default is, if the host OS has
> such a thing--using an assumption of ISO 8859-1 works only in select
> regions on unix systems, and may fail even in those select regions on
> Windows, Mac OS, and other systems; without the OS considerations,
> just the regional constraints are likely to make an ISO-8859-1
> assumption result in /incorrect/ results anywhere eastward of central
> Europe. Is a user in Russia (or China, or Japan) *really* most likely
> to be using ISO 8859-1?
>
> As a point of reference, here's what's in the man-pages that I have
> installed (note the /complete/ and conspicuous lack of references to
> even some notable eastern languages or character-sets, such as Chinese
> and Japanese, in the /entire/ ISO-8859 spectrum):

1. As a point of reference for what?
2. The ISO 8859 character sets were deliberately restricted to scripts
that would fit in 8 bits. So Chinese, Japanese, Korean and Vietnamese
aren't included. Note that Chinese and Japanese already each had
*multiple* legacy (i.e. non-Unicode) character sets ... they (and the
rest the world) don't want/need yet another character set for each
language and never did want/need one.