"convert" string to bytes without changing data (encoding)

Wed Mar 28 17:37:53 EDT 2012

On 3/28/2012 1:43 PM, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x.

I strongly agree with that unless you have reason to use 2.7. Python 3.3 
(.0a1 in nearly out) has an improved unicode implementation, among other 
things.

< The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You are assuming, or must assume, that the text is in an 
ascii-compatible encoding, meaning that bytes 0-127 really represent 
ascii chars. Otherwise, you cannot reliably interpret anything, let 
alone change it.

This problem of knowing that much but not the specific encoding is 
unfortunately common. It has been discussed among core developers and 
others the last few months. Different people prefer one of the following 
approaches.

1. Keep the bytes as bytes and use bytes literals and bytes functions as 
needed. The danger, as you noticed, is forgetting the 'b' prefix.

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' 
chars. When done, encode back to 'latin-1' and the non-ascii chars will 
be as they originally were. The danger is forgetting the pretense, and 
perhaps passing on the the string (as a string, not bytes) to other 
modules that will not know the pretense.

3. Decode using encoding = 'ascii', errors='surrogate_escape'. This 
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars 
(using the surrogate-pair second-half code units). This is probably the 
safest in that invalid operations on the non-chars should raise an 
exception. Re-encoding with the same setting will reproduce the original 
hi-bit chars. The main danger is passing the illegal strings out of your 
local sandbox.

-- 
Terry Jan Reedy