getting data with proper encoding to the finish

Mon Mar 14 19:29:10 EST 2005

Ksenia Marasanova wrote:
> > > There is some amount of data in a database (PG) that must be
inserted
> > > into Excel sheet and emailed. Nothing special, everything works.
> > > Except that non-ascii characters are not displayed properly.
> > > The data is stored as XML into a text field.
> >
> > This sentence doesn't make much sense. Explain.
>
> Sorry, I meant: I use field of the type 'text' in a Postgres table to
> store my data. The data is a XML string.
>
> > Instead of "print data", do "print repr(data)" and show us what you
> > get. What *you* see on the screen is not much use for diagnosis;
it's
> > the values of the bytes in the file that matter.
>
> Thanks for this valuable tip. I take letter "é" as an example.
>
> "print repr(data)" shows this:
> u'\xe9'
>
>
> > Open the spreadsheet with Microsoft Excel, copy-and-paste some data
to
> > a Notepad window, save the Notepad file as Unicode type named (say)
> > "junk.u16"  then at the Python interactive prompt do this:
> >
> > file("junk.u16", "rb").read().decode("utf16")
> >
> > and show us what you get.
>
> (I am on a Mac so I used Textedit to create a UTF-16 encoded file,
right?)
> The result from Python is:
>
> u'\u0439'

So Unicode U+00E9 has become U+0439? Magic! I suspect that there is a
conversion step or two in there that you haven't mentioned. Are you
talking about the spreadsheet after it is created by your script on the
machine that created it [which is what I asked], or are you talking
about the spreadsheet on the recipient's machine, or are you talking
about the spreadsheet after the recipient has e-mailed it back to you,
hopefully untouched?

>
> In Excel sheet it is shown as: й
>
> (Russian again?!)

This is probably indicative that the Latin-1 e-acute (0xE9) is being
converted to Unicode U+0439 by something that thinks it is actually in
an 8-bit Cyrillic encoding (0xE9 is "Cyrillic small letter short I" in
some 8-bit encodings) but the u-umlaut becoming GHE example doesn't fit
this story.

Please do a test where you put several different accented Latin letters
in the one field in your database. No, put ALL the non-ASCII characters
that you expect to be transmitted unchanged into test fields -- this
will make you think about what language(s)/locale(s) your database is
designed for and what language(s)/locale(s) your e-mail targets use.
Having this test data will be useful in the future for verifying that
your system works. Repeat all the above steps. Tell us what you see in
Excel on your machine and on the recipient's machine.

>
> >
> > > and Python email package to email it... and the resulting sheet
> > > is not good:
> >
> > E-mailed how? To whom? [I.e. what country / what cultural
background /
> > on what machine / what operating system / viewed using what
software]
>
> Emailed with Python, please see the code at the end of the message.
> The receiving system is OS X  with languages priority: Dutch,
English,
> German, Russian and Hebrew. Viewer: MS Office 2004.

Sending system is ...?

>
>
> > You are saying (in effect) U+0413 (Cyrillic upper case letter GHE)
is
> > displayed instead of U+00FC (Latin small letter U with diaeresis).
> >
> > OK, we'd already guessed your background from your name :-)
>
> :-)
> >
> > However, what you see isn't necessarily what you've got. How do you
> > know it's not U+0393 (Greek capital letter GAMMA) or something else
> > that looks the same? Could even be from a line-drawing set (top
left
> > corner of a box). What you need to do is find out the ordinal of
the
> > character being displayed.
> >
> > This type of problem arises when a character is written in one
encoding
> > and viewed using another. I've had a quick look through various
> > likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
> > cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
> > encodings that would reproduce anything like your "umlauted-u
becomes
> > gamma-or-similar" problem. Please supply more than 1 example.
>
> Thank you very much for your help and explanation! The "é" letter is
> what I could find till now, see examples above..
>
>
>
>
> The following code fragment is used for creating Excel sheet and
sending email:
> # ##################
> # Create Excel sheet
> #
> f = StringIO()
> workbook  = xl.Writer(f)
> worksheet = workbook.add_worksheet()
> bold = workbook.add_format(bold=1)
> border = workbook.add_format(border=1)
>
> worksheet.write_row('A1', ['First Name', 'Last Name'], border)
> i = 1
> for row in result:
>      print repr(row['firstname'])
>     datarow = [row.get('firstname'), row.get('surname')]

You print the repr() of row['firstname'] but pass row.get('firstname')
to the XLS writer -- do they have the same value? For believability,
print what you pass!!

>     i += 1
>     worksheet.write_row('A%s' % i, datarow)

Check the documentation for the XLS writer module to see if it is doing
an implicit conversion here.

> workbook.close()
> ####################
> # Create email message
>
> # Create the container (outer) email message.
> msg = MIMEMultipart()
> # Attach Excel sheet
> xls = MIMEBase('application', 'vnd.ms-excel')
> xls.set_payload(f.getvalue())
> Encoders.encode_base64(xls)
> xls.add_header('Content-Disposition', 'attachment', filename='some
> file name %s-%s-%s.xls' % (today.day, today.month, today.year))
> msg.attach(xls)
>
> msg['Subject'] = subject
> msg['From'] = fromaddr
> msg['To'] = toaddr
> # Guarantees the message ends in a newline
> msg.epilogue = ''
>
> # Send message
> s = smtplib.SMTP(smtp_host)
> s.sendmail(fromaddr, to_list, msg.as_string())
> 
> -- 
> Ksenia