[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Marc-Andre Lemburg report at bugs.python.org
Tue Oct 8 11:25:11 CEST 2013


Marc-Andre Lemburg added the comment:

On 08.10.2013 11:03, Antoine Pitrou wrote:
> 
>>> utf-16 isn't that widely used, so it's probably fine if it becomes
>>> a bit slower.
>>
>> It's the default encoding for Unicode text files and APIs on Windows,
>> so I'd say it *is* widely used :-)
> 
> I've never seen any UTF-16 text files. Do you have other data?

See the link I posted.

MS Notepad and MS Office save Unicode text files in UTF-16-LE,
unless you explicitly specify UTF-8, just like many other Windows
applications that support Unicode text files:

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx
http://superuser.com/questions/294219/what-are-the-differences-between-linux-and-windows-txt-files-unicode-encoding

This is simply due to the fact that MS introduced Unicode plain
text files as UTF-16-LE files and only later added the possibility
to also use UTF-8 with BOM versions.

> APIs are irrelevant. You only pass very small strings to then (e.g.
> file paths).

You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
is all around you when working on Windows, not only in the OS APIs,
but also in most other Unicode APIs you find on Windows:

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12892>
_______________________________________


More information about the Python-bugs-list mailing list