Thinking Unicode

wxjmfauth at gmail.com wxjmfauth at gmail.com
Thu Aug 8 03:52:36 EDT 2013


I wrote many times on this list, the ascii (generic name for
"byte string") world and the unicode world are two incompatible
worlds. There are bridges, basically there are incompatible,
they requires to think differently.

There is an interesting case on the dev list:
http://mail.python.org/pipermail/python-dev/2013-July/127420.html


There is nothing wrong in polishing the documentation, but
interestingly the discussion turned out about the
usage of "--" and "---" instead of real en-dashes and
em-dashes, understand use ascii and not unicode.

It has been argued TeX uses "--" and "---". True for
the pre-unicode engines. It's no more the case.

Steven proposed the usage of \N{EM DASH}, ...
Good point, a real step towards unicode, but why
using ascii when one can use directly "–", "—"?
Is it not the purpose to use unicode in an utf-8
file, many recommand?
If utf-8 is (and has been created to be) compatible with ascii,
it seems today the usage is to make utf-8 compatible with ascii!

The .rst files have been touched and in my last check,
1-2 days ago, the --------- has been replaced by
-------------. No trace of real en-dashes, em-dashes
in diff's.

What happen if confusion is reappearing? Simple,
reopen a discussion and continue to not solve
problems.



-----

Somebody wrote:
"... (and nobody really wants to type three hyphens except
for a handful of typographical nuts)..."

Completely "out of phase". Beyond that comment (or kind of comment),
(I'm "spying" the misc. lists since years), not a suprise that Python
and Unicode never work.


jmf



PS

>>> '–—'.encode('cp1252')
b'\x96\x97'
>>> '–—'.encode('mac-roman')
b'\xd0\xd1'
>>> '–—'.encode('latin-1')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
ordinal not in range(256)




More information about the Python-list mailing list