Best ways of managing text encodings in source/regexes?

tvn tinkerbarbet at gmail.com
Sun Dec 9 12:36:31 EST 2007


Please see the correction from Cliff pasted here after this excerpt.
Tim

> the byte string is ASCII which is a subset of Unicode (IS0-8859-1
> isn't).)

The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.

Thus a file encoded in ascii *is* in fact a utf-8 file.  There is no
way
to distinguish the two.  But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.



More information about the Python-list mailing list