[Python-ideas] Lossless bulletproof conversion to unicode (backslashing)

Paul Moore p.f.moore at gmail.com
Wed May 27 17:28:30 CEST 2015


On 26 May 2015 at 19:30, anatoly techtonik <techtonik at gmail.com> wrote:
> In real world you have to deal with broken and invalid
> output and UnicodeDecode crashes is not an option.
> The unicode() constructor proposes two options to
> deal with invalid output:
>
> 1. ignore  - meaning skip and corrupt the data
> 2. replace  - just corrupt the data

There are other error handlers, specifically surrogateescape is
designed for this use. Only in Python 3.x admittedly, but this list is
about future versions of Python, so that's what matters here.

> The solution is to have filter preprocess the binary
> string to escape all non-unicode symbols so that the
> following lossless transformation becomes possible:
>
>    binary -> escaped utf-8 string -> unicode -> binary
>
> How to accomplish that with Python 2.x?

That question is for python-list. Language changes will only be made
to 3.x - python-ideas isn't appropriate for questions about how to
achieve something in 2.x.

Paul


More information about the Python-ideas mailing list