Problems with struct.pack()

Thu Oct 11 07:51:39 EDT 2007

On Wed, 10 Oct 2007 22:19:49 -0500, Robert Dailey wrote 
> Hi, 
> 
> Thanks for responding. I apologize about my lack of details, I was in a
hurry when I wrote the initial question. I'll provide more details. 
> 
> Basically, I'm attempting to write out unicode strings (16 bits per
character) to a file. Before each string, I write out 4 bytes containing the
number of characters (NOT BYTES) the string contains. I suppose the confusion
comes in because I'm writing out both text information AND binary data at the
same time. I suppose the consistent thing to do would be to write out the
strings as binary instead of as text? I'm originally a C++ programmer and I'm
still learning Python, so figuring out this problem is a little difficult for me. 
> 
> In my initial inquiry, I was writing out 5000 as an example, however this
number would realistically be the number of characters in the string: len(
u"Hello World" ). Once I write out these 4 bytes, I then write out the string
"Hello World" immediately after the 4 bytes. You may be wondering why the
crazy file format. The reason is because this python script is writing out
data that will later be read in by a C++ application. 
> 
> The following works fine for ASCII strings: 
> 
> mystring = "Hello World" 
> file = open( "somefile.txt", "wb" ) 
> file.write( struct.pack ( "I", len(mystring) ) ) 
> file.write( mystring ) 
> 
> Again I do apologize for the lack of detail. If I've still been unclear
please don't hesitate to ask for more details.

This is much clearer, and it explains why you need to mix arbitrary binary
data with unicode text. Because of this mixing, as you have surmised, you're
going to have to treat the file as a binary file in Python. In other words,
don't open the file with the codecs module and do the encoding yourself, like so:

mystring = u"Hello World" 
file = open( "somefile.txt", "wb" ) 
file.write( struct.pack ( "I", len(mystring) ) ) 
file.write( mystring.encode("utf-16-le") )

(Note that I've guessed that you want little-endian byte-order in the
encoding. Without that indication, encode() would put a byte order mark at the
beginning of the string, which you probably don't want.)

Hope this helps,

Carsten.