[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode
file reading module]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Tue Jan 7 18:34:17 2003
> Thanks for the info! Using the codecs module is much better. It's
> interesting to note, though, that when using the codecs module on a real
> utf-16 text file, Python's automatic handling of new line characters
> seems to break down. For example:
>
> >>> import codecs
> >>> fh = codecs.open('0022data2.txt', 'r', 'utf-16')
> >>> a = fh.read()
> >>> a
> u'\u51fa\r\n'
> >>> print a
> ??
>
>
> >>> a = a.strip()
> >>> print a
> ?
Hi Poor Yorick!
I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.
Can you explain more what you mean by automatic newline handling? Do you
mean a conversion of '\r\n' to '\n'?
> >>> a
> u'\u51fa\r\n'
> >>> print a
> ??
>
>
> >>> a = a.strip()
> >>> print a
> ?
I think it's working. If we look at the actual representation of the
strings, both the carriage return and the newline are being removed:
###
>>> a = u"\u51fa\r\n"
>>> a
u'\u51fa\r\n'
>>> a.strip()
u'\u51fa'
###
I can't actually use print or str() to look at the Unicode characters on
my console, since that '\u51fa' character isn't ASCII.
Best of wishes to you!