"convert" string to bytes without changing data (encoding)
Terry Reedy
tjreedy at udel.edu
Wed Mar 28 17:37:53 EDT 2012
On 3/28/2012 1:43 PM, Peter Daum wrote:
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x.
I strongly agree with that unless you have reason to use 2.7. Python 3.3
(.0a1 in nearly out) has an improved unicode implementation, among other
things.
< The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.
You are assuming, or must assume, that the text is in an
ascii-compatible encoding, meaning that bytes 0-127 really represent
ascii chars. Otherwise, you cannot reliably interpret anything, let
alone change it.
This problem of knowing that much but not the specific encoding is
unfortunately common. It has been discussed among core developers and
others the last few months. Different people prefer one of the following
approaches.
1. Keep the bytes as bytes and use bytes literals and bytes functions as
needed. The danger, as you noticed, is forgetting the 'b' prefix.
2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
chars. When done, encode back to 'latin-1' and the non-ascii chars will
be as they originally were. The danger is forgetting the pretense, and
perhaps passing on the the string (as a string, not bytes) to other
modules that will not know the pretense.
3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
(using the surrogate-pair second-half code units). This is probably the
safest in that invalid operations on the non-chars should raise an
exception. Re-encoding with the same setting will reproduce the original
hi-bit chars. The main danger is passing the illegal strings out of your
local sandbox.
--
Terry Jan Reedy
More information about the Python-list
mailing list