[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Mathias Panzenböck grosser.meister.morti at gmx.net
Mon Jun 10 18:03:59 CEST 2013


On 06/09/2013 09:22 AM, Andrew Barnert wrote:
> From: Mathias Panzenböck <grosser.meister.morti at gmx.net>
>
> Sent: Saturday, June 8, 2013 6:34 PM
>
>
>> On 06/08/2013 08:02 PM, Stephen J. Turnbull wrote:
>>>
>>>   Unicode is a
>>>   universal character set in the sense that it can encode all
>>>   characters,
>>
>> I guess Japanese people beg to differ. There are some Japanese symbols that
>> aren't covered by Unicode,
>> or at least not to the extend Japanese people would like it to be. Which is why
>> they use (Shift-)JIS a
>> lot of the time. Basically Shift-JIS <-> Unicode is not round trip safe.
>
>
> That's not true. Shift-JIS <-> Unicode 6.0 is completely round-trip safe. And there hasn't been a practical problem for UTF-8 or UTF-32 since they were introduced in Unicode 2.0 in the mid-90s.
>
> The problem is with UTF-16. Many early Unicode apps were built around UCS-2, a fixed-width 16-byte encoding. UCS-2 didn't have room for the extra characters that Japanese (among other languages) needed, so it was replaced with UTF-16, a variable-width 16-or-32-byte encoding. But historically, there's been a lot of software that treated UTF-16 as fixed-width (after all, you can test with hiragana and common kanji and it seems to work), which means it breaks if you give it any of the new characters added since the original version. This is sometimes still a problem today for Windows native apps. But again, it does not affect UTF-8, just UTF-16.
>
> Another reason people used Shift-JIS until a few years ago was emoji. But today, Unicode supports more emoji than Shift-JIS, and in fact people complain about only having the original 176 if they're forced to use Shift-JIS.
>
>
> Some Japanese people still refuse to use Unicode because of the Unihan controversy. Briefly: Characters like 刃 (U+5203) are drawn differently in Japanese and Chinese, but Unicode considers them the same character (to get the Chinese variation, you have to use a Chinese font). This is a problem—but Shift-JIS has the exact same problem.
>

That's what I meant, but I thought Shift-JIS doesn't have this problem? I don't work with such encodings, I just read about that problems.

See also "More Information" here:
http://support.microsoft.com/kb/170559
...which isn't where I read about this initially. I can't find where I first read about it.

> Finally, for typical Japanese text, Shift-JIS takes the fewest bits per character of any major charset. Shift-JIS takes 1 byte for ASCII, 2 bytes for everything else. UTF-8 takes 3 bytes for kana and kanji, so if you, e.g., download an article and store it as UTF-8, it'll get almost 50% bigger. UTF-16 solves that by making kana and most kanji 2 bytes (although uncommon kanji are 4), but it makes ASCII 2 bytes instead of 1, which means you double the size of many files. Shift-JIS is a pretty good compromise for compactness.
>


More information about the Python-ideas mailing list