Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Fri Mar 6 11:20:58 EST 2015


On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> 
> [snip example of an analogous situation with NULs]
> 
> > Strawman.
> 
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

Missed my addition? Here it is again –  grammar slightly corrected.

===========
Ah well if you insist on pursuing the nul-char example...
- No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0

- No, the code that "can't cope with a perfectly normal character" is not wrong

- It is C that is wrong for designing a buggy string data structure that cannot
contain a valid char.
===========

In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
it is perhaps too strong even for me.

To elaborate:
Take the buggy-plane analogy I gave in
http://blog.languager.org/2015/03/whimsical-unicode.html

If a plane model crashes once in 10,000 flights compared to others that crash once in
one million flights we can call it bug-prone though not strictly buggy – it does fly  
9999 times safely!
OTOH if a plane is guaranteed to crash we can all it a buggy plane.

C's string is not bug-prone its plain buggy as it cannot represent strings
with nulls.

I would not go that far for UTF-16.
It is bug-inviting but it can also be implemented correctly
> 
> 
> > Lets please stick to UTF-16 shall we?
> > 
> > Now tell me:
> > - Is it broken or not?
> 
> The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> encoding, and considerably better than most other variable-width encodings.
> 
> However, many implementations of UTF-16 are faulty, and assume a
> fixed-width. *That* is broken, not UTF-16.
> 
> (The difference between specification and implementation is critical.)
> 
> 
> > - Is it widely used or not?
> 
> It's quite widely used.
> 
> 
> > - Should programmers be careful of it or not?
> 
> Programmers should be aware whether or not any specific language uses UTF-16
> and whether the implementation is buggy. That will help them decide whether
> or not to use that language.
> 
> 
> > - Should programmers be warned about it or not?
> 
> I'm in favour of people having more knowledge rather than less. I don't
> believe that ignorance is bliss, except perhaps in the case that a giant
> asteroid the size of Texas is heading straight for us.
> 
> Programmers should be aware of the limitations or bugs in any UTF-16
> implementation they are likely to run into. Hence my general
> recommendation:
> 
> - For transmission over networks or storage on permanent media (e.g. the
> content of text files), use UTF-8. It is well-implemented by nearly all
> languages that support Unicode, as far as I know.
> 
> - If you are designing your own language, your implementation of Unicode
> strings should use something like Python's FSR, or UTF-8 with tweaks to
> make string indexing O(1) rather than O(N), or correctly-implemented
> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)

FSR is possible in python for very specific pythonic reasons
- dynamicness
- immutable strings

Drop either and FSR is impossible

> If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 
> 2-byte per code point format, you fail.

Seems obvious enough.
So lets see...
Here's a 2-line python program -- runs well enough when run as a command.
Program:
=========
pp = "💩"
print (pp)
=========
Try open it in idle3 and you get (at least I get):

$ idle3 ff.py 
Traceback (most recent call last):
  File "/usr/bin/idle3", line 5, in <module>
    main()
  File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
    if flist.open(filename) is None:
  File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
    edit = self.EditorWindow(self, filename, key)
  File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
    EditorWindow.__init__(self, *args)
  File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
    if io.loadfile(filename):
  File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
    self.text.insert("1.0", chars)
  File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
    self.top.insert(index, chars, tags)
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
    self.addcmd(InsertCommand(index, chars, tags))
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
    cmd.do(self.delegate)
  File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
    text.insert(self.index1, self.chars, self.tags)
  File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
    self.delegate.insert(index, chars, tags)
  File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
    return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl

So who/what is broken?

> 
> - If you are using an existing language, be aware of any bugs and
> limitations in its Unicode implementation. You may or may not be able to
> work around them, but at least you can decide whether or not you wish to
> try.
> 
> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

Correct.
Windows is broken for using UTF-16
Linux is broken for conflating UTF-8 and byte string.

Lot of breakage out here dont you think?
May be related to the equation

UTF-16 = UCS-2 + Duct-tape

??



More information about the Python-list mailing list