[Python-Dev] repr vs. str and locales again

Peter Funk pf@artcom-gmbh.de
Sun, 21 May 2000 17:54:06 +0200 (MEST)


Hi!

Ka-Ping Yee:
> On Fri, 19 May 2000, M.-A. Lemburg wrote:
> > Umm, Jyrki's patch does *not* affect repr(): it's a patch to the
> > string_print API which is used for the tp_print slot,
> 
> Very sorry!  I didn't actually look to see where the patch
> was being applied.
> 
> But then how can this have any effect on squishdot's indexing?

Sigh.  Let me explain this in some detail.

What do you see here: äöüÄÖÜß?  If all went well, you should
see some Umlauts which occur quite often in german words, like
"Begrüssung", "ätzend" or "Grützkacke" and so on.

During the late 80s we here Germany spend a lot of our free time to
patch open source tools software like 'elm', 'B-News', 'less' and
others to make them "8-Bit clean".  For example on ancient Unices
like SCO Xenix where the implementations of C-library functions
like 'is_print', 'is_lower' where out of reach.

After several years everybody seems to agree on ISO-8859-1 as the new
european standard character set, which was also often losely called 
8-Bit ASCII, because ASCII is a true subset of ISO latin1.  Even at least
the german versions of Windows used ISO-8859-1.

As the WWW began to gain popularity nobody with a sane mind really 
used these splendid ASCII escapes like for example 'ä' instead 
of 'ä'.  The same holds true for TeX users community where everybody 
was happy to type real umlauts instead of these ugly backslash escapes
sequences used before: \"a\"o\"u ...

To make a short: A lot of effort has been spend to make *ALL* programs
8-Bit clean: That is to move the bytes through without translating
them from or into a bunch of incompatible multi bytes sequences,
which nobody can read or even wants to look at.

Now to get to back to your question:  There are several nice HTML indexing
engines out there.  I personally use HTDig.  At least on Linux these
programs deal fine with HTML files containing 8-bit chars.  

But if for some reason Umlauts end up as octal escapes ('\344' instead of 'ä')
due to the use of a Python 'print some_tuple' during the creation of HTML
files, a search engine will be unable to find those words with escaped
umlauts.

Mit freundlichen Grüßen, Peter
P.S.: Hope you didn't find my explanation boring or off-topic.