"convert" string to bytes without changing data (encoding)

Wed Mar 28 14:13:11 EDT 2012

Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

It seems that you're mixing things up wrt. the string/bytes 
distinction; it's not as "complicated" as it might seem.

1) Strings

s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...

all create/refer to string objects. How Python internally stores them 
is none of your concern (actually, that's rather complicated anyway, at 
least with the upcoming Python 3.3), and processing a string basically 
means that you'll work on the natural language characters present in the 
string. Python strings can store (pretty much) all characters and 
surrogates that unicode allows, and when the python interpreter/compiler 
reads strings from input (I'm talking about source files), a default 
encoding defines how the bytes in your input file get interpreted as 
unicode codepoint encodings (generally, it depends on your system locale 
or file header indications) to construct the internal string object 
you're using to access the data in the string.

There is no such thing as a type for a single character; single 
characters are simply strings of length 1 (and so indexing also returns 
a [new] string object).

Single/double quotes work no different.

The internal encoding used by the Python interpreter is of no concern 
to you.

2) Bytes

s = b'this is a byte-string'
s = b'\x22\x33\x44'

The above define bytes. Think of the bytes type as arrays of 8-bit 
integers, only representing a buffer which you can process as an array 
of fixed-width integers. Reading from stdin/a file gets you bytes, and 
not a string, because Python cannot automagically guess what format the 
input is in.

Indexing the bytes type returns an integer (which is the clearest 
distinction between string and bytes).

Being able to input "string-looking" data in source files as bytes is a 
debatable "feature" (IMHO; see the first example), simply because it 
breaks the semantic difference between the two types in the eye of the 
programmer looking at source.

3) Conversions

To get from bytes to string, you have to decode the bytes buffer, 
telling Python what kind of character data is contained in the array of 
integers. After decoding, you'll get a string object which you can 
process using the standard string methods. For decoding to succeed, you 
have to tell Python how the natural language characters are encoded in 
your array of bytes:

b'hello'.decode('iso-8859-15')

To get from string back to bytes (you want to write the natural 
language character data you've processed to a file), you have to encode 
the data in your string buffer, which gets you an array of 8-bit 
integers to write to the output:

'hello'.encode('iso-8859-15')

Most output methods will happily do the encoding for you, using a 
standard encoding, and if that happens to be ASCII, you're getting 
UnicodeEncodeErrors which tell you that a character in your string 
source is unsuited to be transmitted using the encoding you've 
specified.

If the above doesn't make the string/bytes-distinction and usage 
clearer, and you have a C#-background, check out the distinction between 
byte[] (which the System.IO-streams get you), and how you have to use a 
System.Encoding-derived class to get at actual System.String objects to 
manipulate character data. Pythons type system wrt. character data is 
pretty much similar, except for missing the "single character" type 
(char).

Anyway, back to what you wrote: how are you getting the input data? Why 
are "high bytes" in there which you do not know the encoding for? 
Generally, from what I gather, you'll decode data from some source, 
process it, and write it back using the same encoding which you used for 
decoding, which should do exactly what you want and not get you into any 
trouble with encodings.

-- 
--- Heiko.