ensuring valid latin-1

John Machin sjmachin at lexicon.net
Wed Nov 29 17:20:30 EST 2006


Chris Curvey wrote:
> Hey all,
>
> I'm trying to write something that will "fail fast" if one of my users
> gives me non-latin-1 characters.  So I tried this:
>
> >>> testString = "\x80"
> >>> foo = unicode(testString, "latin-1")
> >>> foo
> u'\x80'
>
> I would have thought that that should have raised an error, because
> \x80 is not a valid character in latin-1 (according to what I can
> find).  Is this the expected behavior, or am I missing something?

Depends on what you call 'latin-1'. The standard ISO 8859-1 defined
only displayable characters. If you used that definition, even the
basic ASCII carriage return, line feed and tab would raise an error.
However, according to wikipedia:

"""In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the extra
hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the
Internet. This map assigns the C0 and C1 control characters to the code
values 00-1F, 7F, and 80-9F. It thus provides for 256 characters
via every possible 8-bit value."""

'latin-1' and 'iso-8859-1' are the same encoding.

If you articulate your definition of "valid latin-1", we should be able
to help you with some Python code to check it for you.

>
> I'm on Windows, but I have explicitly set the character set to be
> latin-1 in sitecustomize.py

Why??

Don't do that. That's a self-inflicted double whammy.
(1) You should *not* assume that all the legacy str data your machine
will ever process is in only one encoding.
(2) On a Windows machine, your legacy data is extremely likely to be
encoded in a Microsoft-developed encoding (like cp1252), not latin-1.

>
> >>> import sys
> >>> sys.getdefaultencoding()
> 'latin-1'

HTH,
John




More information about the Python-list mailing list