right adjusted strings containing umlauts

kurt.alfred.mueller at gmail.com kurt.alfred.mueller at gmail.com
Wed Aug 28 07:17:51 EDT 2013


On Wednesday, August 28, 2013 12:23:12 PM UTC+2, Dave Angel wrote:
> On 28/8/2013 04:01, Kurt Mueller wrote:
> > Because I cannot switch to Python 3 for now my life is not so easy:-)
> > For some text manipulation tasks I need a template to split lines
> > from stdin into a list of strings the way shlex.split() does it.
> > The encoding of the input can vary.
> > For further processing in Python I need the list of strings to be in unicode.
> According to:
>    http://docs.python.org/2/library/shlex.html
> """Prior to Python 2.7.3, this module did not support Unicode
> input"""
> I take that to mean that if you upgrade to Python 2.7.3, 2.7.4, or
> 2.7.5, you'll have Unicode support.

I have Python 2.7.3

> Presumably that would mean you could decode the string before calling
> shlex.split().

Yes, see new template.py:
###############################################################
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# split lines from stdin into a list of unicode strings
# decode before shlex
# Muk 2013-08-28
# Python 2.7.3

from __future__ import print_function
import sys
import shlex
import chardet

bool_cmnt = True  # shlex: skip comments
bool_posx = True  # shlex: posix mode (strings in quotes)

for inpt_line in sys.stdin:
    print( 'inpt_line=' + repr( inpt_line ) )
    enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
    print( 'enco_type=' + repr( enco_type ) )
    strg_unic = inpt_line.decode( enco_type )                       # decode the input line into unicode
    print( 'strg_unic=' + repr( strg_unic ) )                       # unicode input line
    try:
        strg_inpt = shlex.split( strg_unic, bool_cmnt, bool_posx, ) # check if shlex works on unicode
    except Exception, errr:                                         # usually 'No closing quotation'
        print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
        continue
    print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings

###############################################################

$ python -V
Python 2.7.3
$ echo -e "a b c d e\na Ö u 1 2" | template.py
inpt_line='a b c d e\n'
enco_type='ascii'
strg_unic=u'a b c d e\n'
strg_inpt=['a', 'b', 'c', 'd', 'e']
inpt_line='a \xc3\x96 u 1 2\n'
enco_type='utf-8'
strg_unic=u'a \xd6 u 1 2\n'
error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a Ö u 1 2'
$ echo -e "a b c d e\na Ö u 1 2" | recode utf8..latin9 | ./split_shlex_unicode.py 
inpt_line='a b c d e\n'
enco_type='ascii'
strg_unic=u'a b c d e\n'
strg_inpt=['a', 'b', 'c', 'd', 'e']
inpt_line='a \xd6 u 1 2\n'
enco_type='windows-1252'
strg_unic=u'a \xd6 u 1 2\n'
error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a � u 1 2'
$

As can be seen, shlex does work only with unicode strings decoded from 'ascii' strings. (Python 2.7.3)

-- 
Kurt Müller



More information about the Python-list mailing list