A question on Encoding and Decoding.

John Machin sjmachin at lexicon.net
Tue Nov 14 01:30:53 EST 2006


kath wrote:
> Hi all,
>
> Platform:            winxp
> Version:             Python 2.3
>
> I have a task of reading files in a folder and creating an one excel
> file with sheets, one sheet per file, with sheet named as filename. I
> am facing problem in handling special characters. I am using XLRD and
> XLW package to read/write  from/to file.

The name of the first package is "xlrd", *NOT* "XLRD".

By "XLW"; do you mean python2xlw? If so, why don't you upgrade to
Python 2.4 or 2.5 so that you can use  pyExcelerator (which is the most
modern and least unmaintained of the pure-Python Excel-writing
gadgets)? If you can't upgrade, consider using pyXLWriter instead.

Note that python2xlw is *NOT* intended to be a general-purpose Excel
file writer. From the top of its XLW.py:
'''Simple XLW file maker.
Can create data sheets and Scatter Charts from the data sheets.
Data sheet is defined as a list of rows.
The first row contains labels, subsequent rows are numeric.
Charts are defined by adding XY series.
Series are defined as zero-based indeces to the dataSheet, XColumn, and
YColumn.
The entire column will be plotted.'''

Are you sure that the description "The first row contains labels,
subsequent rows are numeric." fits your data?

At this stage, sentient beings who are looking for a general-purpose
Excel writer and who are not skeptics would probably turn away. Kinda
makes you believe it won't handle *any* kind of string (either str or
Unicode) at rowx == 76. Game over. "77 Sunset Row" :-)

It turns out that what happens (quite independently of row number) is:
Ints and reals are spat out as Excel NUMBER records.
For *any* other Python type, it attempts (without benefit of try/accept
or any other prophylactic) to write str(value)[:254] to an Excel LABEL
record.
Apart from silent truncation of data, this will cause:
* unicode data that contains non-ASCII characters to produce a
UnicodeDecodeError
* a long item e.g. 1234567890L to appear as an Excel text type with
value"1234567890", not an Excel number type wirth value 1234567890.0

Oh, and if you forgot to call the XLW.XLW.addChart method, it kindly
remedies your deficiency by supplying one. What if you consciously
don't want any ferschlugginer charts?

Unfortunately, outside the Redmond Pale, knowledge of charts in .XLW
files is scant. Given such a file:
* Gnumeric 1.6.something (Windows version) crashes
* OpenOffice.org's Calc 2.0.something (Windows version) thinks silently
for a second or two, then leaves you with an empty worksheet, and no
charts, and no message.
* xlrd grumbles that it was expecting a worksheet, and gives up -- I'll
fix this; it'll expect a chart as a possibility, and ignore it (as with
XLS files).

> But facing problem in handling
> special characters. I am getting encode error.

Have you read any of these:
(a) The notes on Unicode in the xlrd documentation?
(b) The Unicode howto (http://www.amk.ca/python/howto/unicode)
?

>
> UnicodeDecodeError 'ascii' codec can't encode character u'\xdf' in
> position 19: ordinal not in range(128)
> row:  76
>
> the cell value at rowx = 76, colx = 0  is
> 'Activest-Aktien-Großbritannien'

:-)
Gross Brits? Must be talking about the "Barmy Army" :-)
:-)

>
> I used Latin-1 encoding,

If, as it seems, you think you've found out what the problem is, fixed
it, and continued on, why bother telling us?

> but after the file is created I get an error
> 'Unable to read the file'.

*WHAT* program is producing this error message? Under what
circumstances?

>
> When I get the exception I want to format the string so that I can use
> it to write to a file and also query database.

Never mind what you want; what you *need* is to encode your data with
the appropriate encoding, so that you *don't* get the exception. Then
it should be a good old legacy str instance, suitable for writing
anywhere within reason. You can query your database with str and/or
unicode data, depending on how it is stored and whether the database
interface converts on the fly -- read the database-specific docs and/or
ask a specific question.

>
> Can anybody guide how to solve this problem. what encoding I should
> use, and after wring to file I should the same special character.

My parser barfed on that last clause:
ParseFailure: After "wring", expected "neck"
:-)

The encoding that you should use is one that encompasses all your data.
If latin1 doesn't hack it, switch to pyExcelerator (as already
recommended); it will write the latest (only 9 years old) version of
Excel files which are recorded in utf_16_le.

What makes you think that the encoding problem (which you appear to
have fixed (at least temporarily)) is anything at all to do with the
mystery program saying 'Unable to read the file"? Have you tried a test
input file which has *only* ASCII text in it?

>
> Also python IDLE is able to output the same character corretly when I
> say print and why not I?

I have to go out now, so I'll leave something for someone else to
answer.

>
>
> Any suggestions would be greatly appreciated.
>

If it all becomes too hard, try sending me (1) a two-sheet input test
file for your app (2) the corresponding error-message-causing output
file and (3) your code, and I'll have a look at it.

Cheers,
John




More information about the Python-list mailing list