[Python-Dev] unicode() and its error argument

Tim Peters tim.one@comcast.net
Sat, 15 Jun 2002 12:21:03 -0400


[Skip Montanaro]
> The unicode() builtin accepts an optional third argument, errors, which
> defaults to "strict".  According to the docs if errors is set to "ignore",
> decoding errors are silently ignored.  I seem to still get the occasional
> UnicodeError exception, however.  I'm still trying to track down an actual
> example (it doesn't happen often, and I hadn't wrapped unicode() in a
> try/except statement, so all I saw was the error raised, not the input
> string value).

Play with this:

"""
def generrors(encoding, errors, maxlen, maxtries):
    from random import choice, randint
    bytes = [chr(i) for i in range(256)]
    paste = ''.join
    for dummy in xrange(maxtries):
        n = randint(1, maxlen)
        raw = paste([choice(bytes) for dummy in range(n)])
        try:
            u = unicode(raw, encoding, errors)
        except UnicodeError, detail:
            print 'fail w/ errors', errors, '- raw data', repr(raw)
            print '    UnicodeError', str(detail)

errors = ('strict', 'replace', 'ignore')

generrors('mac-turkish', errors[2], 10, 1000)
"""

Plug in your favorite encoding and let it do the work of finding examples.
It generates plenty of errors with 'strict', but so far I haven't seen it
generate one with 'replace' or 'ignore'.