split lines from stdin into a list of unicode strings

Wed Aug 28 07:13:36 EDT 2013

On 28/8/2013 04:32, Kurt Mueller wrote:

> This is a follow up to the Subject
> "right adjusted strings containing umlauts"

You started a new thread, with a new subject line.  So presumably we're
starting over with a clean slate.

>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.

Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next?  Your code below assumes the
latter.  That can greatly increase the unreliability of the already
dubious chardet algorithm.

> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet

Is this the one ?
    https://pypi.python.org/pypi/chardet

>
> bool_cmnt = True  # shlex: skip comments
> bool_posx = True  # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
>     print( 'inpt_line=' + repr( inpt_line ) )
>     enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
>     print( 'enco_type=' + repr( enco_type ) )
>     try:
>         strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode

But shlex does, since you're using Python 2.7.3

>     except Exception, errr:                                         # usually 'No closing quotation'
>         print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
>         continue
>     print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
>     strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
>     print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>

Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.

Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.

-- 
DaveA