string processing question

Fri May 1 19:58:50 EDT 2009

Piet van Oostrum wrote:
>>>>>> Kurt Mueller <mu at problemlos.ch> (KM) wrote:
> 
>> KM> But from the command line python interprets the code
>> KM> as 'latin_1' I presume. That is why I have to convert
>> KM> the "ä" with unicode().
>> KM> Am I right?
> 
> There are a couple of stages:
> 1. Your terminal emulator interprets your keystrokes, encodes them in a
>    sequence of bytes and passes them to the shell. How the characters
>    are encodes depends on the encoding used in the terminal emulator. So
>    for example when the terminal is set to utf-8, your "ä" is converted
>    to two bytes: \xc3 and \xa4.
> 2. The shell passes these bytes to the python command. 
> 3. The python interpreter must interpret these bytes with some decoding.
>    If you use them in a bytes string they are copied as such, so in the
>    example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
>    If your terminal encoding would have been iso-8859-1, the string
>    would have had a single byte '\xe4'. If you use it in a unicode
>    string the Python parser has to convert it to unicode. If there is an
>    encoding declaration in the source than that is used. Of course it
>    should be the same as the actual encoding used by the shell (or the
>    editor when you have a script saved in a file) otherwise you have a
>    problem. If there is no encoding declaration in the source Python has
>    to guess. It appears that in Python 2.x the default is iso-8859-1 but
>    in Python 3.x it will be utf-8. You should avoid making any
>    assumptions about this default.
> 4. During runtime unicode characters that have to be printed, written to
>    a file, passed as file names or arguments to other processes etc.
>    have to be encoded again to a sequence of bytes. In this case Python
>    refuses to guess. Also you can't use the same encoding as in step 3,
>    because the program can run on a completely different system than
>    were it was compiled to byte code. So if the (unicode) string isn't
>    ASCII and no encoding is given you get an error. The encoding can be
>    given explicitely, or depending on the context, by sys.stdout.encoding,
>    sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on). 
> 
> Unfortunately there is no equivalent to PYTHONIOENCODING for the
> interpretation of the source text, it only works on run-time.
> 
> Example:
> python -c 'print len(u"ä")'
> prints 2 on my system, because my terminal is utf-8 so the ä is passed
> as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
> iso-8859-1 bytes.
> 
> If I do 
> python -c 'print u"ä"' in my terminal I therefore get two characters: Ã¤
> but if I do this in Emacs I get:
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 0-1: ordinal not in range(128)
> because my Emacs doesn't pass the encoding of its terminal emulation.
> 
> However:
> python -c '# -*- coding:utf-8 -*-
> print len(u"ä")'
> will correctly print 1.
===============================

Thank you. I knew there had to be something simpler than brute force.

I have missed seeing the explanations for:
     python -c '# -*- coding:utf-8 -*-
in the 2.5 docs. Where can I find these? (the python -c  is for config, 
I presume?)

By the way - the however: python...\nprint... snippet bombs in 2.5.2
1st bomb:  looking for closing '    #so I add one and remove one below
2nd bomb:  bad syntax               # I play awhile and join EMACS
3rd bomb:   Non-ASCII character '\xe4' in file....no encoding declared..

Python flatly states it's not ASCII and quits. Python print refuses to 
handle high bit set bytes in 2.5.2....

The thank you is for pointing out how it works. I can use sed to fix for 
file listing purposes.  (Python won't like them, but a second pass thru 
sed can give me something python can use and the two names can go on a 
line on the cheat sheet.)

Barry, Kurt - do understand using sed to change the incoming names?
Put the python in a box and use the Linux mc, ls, sed and echo routines 
to get the names into a form python can use while making the cheat sheet 
at the same time.  Substitutions like  a for ä  will generally be 
acceptable. Yes or No?  The cheat sheet can show the ä in the original 
name because the OS functions allow it. I have no doubt there will be 
some exceptions. :(
Once the names are "ASCII" you can get the python out & put it to work.

Just to head off the comments that it's not .... whatever

ls -1 | cheater.scr | python_program.py    IS PURE UNIX

Unix is designed for this. Files from different parts of the world?  If 
you can see the name as something besides ????? make a cheeter for each 
'Page'.    mc /path/to/dir/of/choice
            ls -1 >dummy
            highlight dummy
            F3
            F4    and read the hex
takes me longer to type it in here than to do it. (leading spaces)  :)

Today: 20090430

Steve

ps. Piet - thanks for including the version specifics.  It makes a huge 
difference in expectations and allowances.