[Tutor] why is unicode converted file double spaced?

Marc Tompkins marc.tompkins at gmail.com
Tue Apr 7 23:52:57 CEST 2009


On Tue, Apr 7, 2009 at 11:59 AM, Pirritano, Matthew <MPirritano at ochca.com>wrote:

>  I did get an error…
>
>
>
> Traceback (most recent call last):
>
>   File "C:\Projects\unicode_convert.py", line 8, in <module>
>
>     outp.write(outLine.strip()+'\n')
>
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 640-641: ordinal not in range(128)
>
Should I be worried about this. And where does this message indicate that
> the error is. And what is the error?
>
>  That error is telling you that your input file contains character(s) that
don't have a valid representation in ASCII (which is the AMERICAN Standard
Code for Information Interchange - no furrin languages need apply!)  I came
into this conversation late (for which I apologize), so I don't actually
know what your input file contains; the fact that it's encoded in UTF-16
indicates to me that its creators anticipated that there might be
non-English symbols in it.

Where is the error... well, it's positions 640-641 (which, since most code
points in UTF-16 are two bytes long, might mean character 319 or 320 if you
opened it in a text editor - or it might mean character 640 or 641, I don't
really know...) of _some_ line.  Unfortunately we don't know which line,
'cause we're not keeping track of line numbers.  Something I often do in
similar situations: add print statements temporarily (printing to the screen
makes a significant performance drag in the middle of a loop, so I comment
out or delete the prints when it's time for production.)

Two approaches: add an index variable and print just the number, or print
the data you're about to try to save to the file.  In your case, I'd go with
the index, 'cause your lines are apparently much longer than screen width.

index = 1
for outLine in inp:
    print(index)
    outp.write(outLine.rstrip() + '\n')
    index += 1

You'll have a long string of numbers, followed by the error/traceback.  The
last number before the error should be the line number that caused the
error; open your file in a decent text editor, go to that line, and look at
columns 319/320 (or 640/641 - I'm curious to find out which) - that will
show you which character caused your problem.

What to do about it, though: the "codecs" module has several ways of dealing
with errors in encoding/decoding; it defaults to 'strict', which is what
you're seeing. Other choices include 'replace' (replaces invalid ASCII
characters with question marks, if I recall correctly) and 'ignore' (which
just drops the invalid character from the output.)

Change your file-opening line to:

inp = codecs.open('g:\\data\\amm\\text files\\test20090320.txt',
'r','utf-16', 'replace')
or
inp = codecs.open('g:\\data\\amm\\text files\\test20090320.txt',
'r','utf-16', 'ignore')

depending on your preference.

Have fun -
-- 
www.fsrtechnologies.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090407/702def40/attachment.htm>


More information about the Tutor mailing list