not quite 1252

Fri Apr 28 08:37:13 EDT 2006

On 28/04/2006 9:21 PM, Anton Vredegoor wrote:
> Serge Orlov wrote:
> 
>> I extracted content.xml from a test file and the header is:
>> <?xml version="1.0" encoding="UTF-8"?>
>>
>> So any xml library should handle it just fine, without you trying to
>> guess the encoding.
> 
> Yes my header also says UTF-8. However some kind person send me an 
> e-mail stating that since I am getting \x94 and such output when using 
> repr (even if str is giving correct output)
> there could be some problem 
> with the XML-file not being completely UTF-8.

I deduce that I am the allegedly kind person.

Firstly you have a problem with the "even if" part of "I am getting \x94 
and such output when using repr (even if str is giving correct output)".

Let txt = "\x93hello\x94". So you print repr(txt) and the result appears 
as '\x92hello\x94'. That is absolutely correct. It is an unambiguous 
REPRresentation of the string. In IDLE (or similar) on a Windows box 
(where the default encoding is cp1252) if you print str(txt) [or merely 
print txt] the display shows hello preceded/followed by the LEFT/RIGHT 
DOUBLE QUOTATION MARK (U+201C/U+201D) -- or some other pair of 
left/right thingies. That is also correct enough.

Secondly,  I stated nothing such about the XML-file. We were discussing 
"extracting from content.xml using OOopy's getiterator function". My 
point was that if you were seeing \x94 anywhere, THE OUTPUT FROM THAT 
FUNCTION must be encoded as cp1252.

Here is the relevant part:
==========================
AV>> First I noticed that by extracting from content.xml using OOopy's 
getiterator function, some \x94 codes were left inside the document.

AV>> But that was an *artifact*, because if one prints something using 
s.__repr__() as is used for example when printing a list of strings 
(duh) the output is not the same as when one prints with 'print s'. I 
guess what is called then is str(s).

JM> Don't *guess*!!!

AV>> Ok, now we have that out of the way, I hope.

JM>> No, not quite. If you saw \x94 in the repr() output, but it looked 
"OK" when displayed using str(), then the only reasonable hypotheses are 
(a) the data was in an 8-bit string, presumably encoded as cp1252 
(definitely NOT UTF-8), rather than a Unicode string (b) you displayed 
it via a file whose encoding was 'cp1252'.

JM>> "... assuming something in the open office xml file was not quite 
1252. In fact it wasn't, it was UTF-8 ..." --- another problem was 
assuming that the encoding used for the output of the OOopy interface 
(apparently cp1252; is there no documentation?) would be the same as in 
the .sxw file (UTF-8).

=== end of extract ===

> Or is there some other 
> reason I'm getting these \x94 codes?

You are getting "these \x94 codes" when you do *WHAT* exactly?

I refer you to Martin's unanswered questions:

"""What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?"""