[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode file reading module]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue Jan 7 18:34:17 2003


> Thanks for the info! Using the codecs module is much better. It's
> interesting to note, though, that when using the codecs module on a real
> utf-16 text file, Python's automatic handling of new line characters
> seems to break down. For example:
>
>   >>> import codecs
>   >>> fh = codecs.open('0022data2.txt', 'r', 'utf-16')
>   >>> a = fh.read()
>   >>> a
> u'\u51fa\r\n'
>   >>> print a
> ??
>
>
>   >>> a = a.strip()
>   >>> print a
> ?



Hi Poor Yorick!

I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.

Can you explain more what you mean by automatic newline handling?  Do you
mean a conversion of '\r\n' to '\n'?




>   >>> a
> u'\u51fa\r\n'
>   >>> print a
> ??
>
>
>   >>> a = a.strip()
>   >>> print a
> ?


I think it's working.  If we look at the actual representation of the
strings, both the carriage return and the newline are being removed:

###
>>> a = u"\u51fa\r\n"
>>> a
u'\u51fa\r\n'
>>> a.strip()
u'\u51fa'
###

I can't actually use print or str() to look at the Unicode characters on
my console, since that '\u51fa' character isn't ASCII.


Best of wishes to you!