UTF16, BOM, and Windows Line endings

Mon Feb 6 17:34:15 EST 2006

Neil Hodgson wrote:
> Fuzzyman:
>
> > How should I handle line-endings for UTF16 ? Is it possible that other
> > programs (on windows) will have line endings as u'\r\n' ?
>
>     Yes, try Notepad and save as Unicode. For the text
>
> Fuzzy
> End of lines
>
>  >>> contents = open("C:\\fuzzy.txt", "rb").read()
>  >>> contents
> '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
> \x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
>  >>>
>
>     The '\r\x00\n\x00' is a u'\r\n'.
>
>  > When saving
> > files for that platform should I make the line endings u'\r\n' ? (This
> > sequence obviously encodes to four bytes in UTF16). I would only do
> > this to ensure compatibility with other programs the user may use to
> > create the text files.
>
>     Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
> applications are OK with other line ends by '\r\n' and u'\r\n' are
> safest on Windows.
>

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

>     Neil