Why do my list go uni-code by itself?

Mon Dec 20 17:19:29 EST 2010

On 01/-10/-28163 02:59 PM, Martin Hvidberg wrote:
> I'm reading a fixed format text file, line by line. I hereunder present
> the code. I have <snipped> out part not related to the file reading.
> Only relevant detail left out is the lstCutters. It looks like this:
> [[1, 9], [11, 21], [23, 48], [50, 59], [61, 96], [98, 123], [125, 150]]
> It specifies the first and last character position of each token in the
> fixed format of the input line.
> All this works fine, and is only to explain where I'm going.
>
> The code, in the function definition, is broken up in more lines than
> necessary, to be able to monitor the variables, step by step.
>
> --- Code start ------
>
> import codecs
>
> <snip>
>
> def CutLine2List(strIn,lstCut):
> strIn = strIn.strip()
> print '>InNextLine>',strIn
> # skip if line is empty
> if len(strIn)<1:
> return False
> lstIn = list()
> for cc in lstCut:
> strSubline =strIn[cc[0]-1:cc[1]-1].strip()
> lstIn.append(strSubline)
> print '>InSubline2>'+lstIn[len(lstIn)-1]+'<'
> del strIn, lstCut,cc
> print '>InReturLst>',lstIn
> return lstIn
>
> <snip>
>
> filIn = codecs.open(
> strFileNameIn,
> mode='r',
> encoding='utf-8',
> errors='strict',
> buffering=1)
> for linIn in filIn:
> lstIn = CutLine2List(linIn,lstCutters)
>
> --- Code end ------
>
> A sample output, representing one line from the input file looks like this:
>
>  >InNextLine> I 30 2002-12-11 20:01:19.280 563 FANØ
> 2001-12-12-15.46.12.734502 2001-12-12-15.46.12.734502
>  >InSubline2>I<
>  >InSubline2>30<
>  >InSubline2>2002-12-11 20:01:19.280<
>  >InSubline2>563<
>  >InSubline2>FANØ<
>  >InSubline2>2001-12-12-15.46.12.73450<
>  >InSubline2>2001-12-12-15.46.12.73450<
>  >InReturLst> [u'I', u'30', u'2002-12-11 20:01:19.280', u'563',
> u'FAN\xd8', u'2001-12-12-15.46.12.73450', u'2001-12-12-15.46.12.73450']
>
>
> Question:
> In the last printout, tagged >InReturLst> all entries turn into
> uni-code. What happens here?
> Look for the word 'FANØ'. This word changes from 'FANØ' to u'FAN\xd8' --
> That's a problem to me, and I don't want it to change like this.
>
> What do I do to stop this behavior?
>
> Best Regards
> Martin
>
>
If you don't want Unicode, why do you specify that the file is encoded 
as utf-8 ?  If it's ASCII, just open the file, without using a utf-8 
codec.  Of course, then you'll have to fix the input file to make it ASCII.

The character in the input file following the letters "FAN" is not a 
zero, it's some other character, apparently 00D8 in the Unicode table, 
not 0030.

It didn't "change" in the InRturLst line.  You were reading Unicode 
strings from the file.  When you print Unicode, it encodes it in 
whatever your console device specifies.  But when you print a "list," it 
uses repr() on the elements, so you get to see what their real type is.

DaveA