Interpreting string containing \u000a

Peter Otten __peter__ at web.de
Wed Jun 18 08:21:18 EDT 2008


Francis Girard wrote:

> I have an ISO-8859-1 file containing things like
> "Hello\u000d\u000aWorld", i.e. the character '\', followed by the
> character 'u' and then '0', etc.
> 
> What is the easiest way to automatically translate these codes into
> unicode characters ?

If the file really contains the escape sequences use "unicode-escape" as the
encoding:

>>> "Hello\\u000d\\u000aWorld".decode("unicode-escape")
u'Hello\r\nWorld'

If it contains the raw bytes use "iso-8859-1":

>>> "Hello\x0d\x0aWorld".decode("iso-8859-1")
u'Hello\r\nWorld'

Open the file with

codecs.open(filename, encoding=encoding_as_determined_above)

instead of the builtin open().

Peter



More information about the Python-list mailing list