getting data with proper encoding to the finish

Mon Mar 14 18:34:17 EST 2005

> > There is some amount of data in a database (PG) that must be inserted
> > into Excel sheet and emailed. Nothing special, everything works.
> > Except that non-ascii characters are not displayed properly.
> > The data is stored as XML into a text field.
> 
> This sentence doesn't make much sense. Explain.

Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.

> Instead of "print data", do "print repr(data)" and show us what you
> get. What *you* see on the screen is not much use for diagnosis; it's
> the values of the bytes in the file that matter.

Thanks for this valuable tip. I take letter "é" as an example.

"print repr(data)" shows this:
u'\xe9'

> Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
> a Notepad window, save the Notepad file as Unicode type named (say)
> "junk.u16"  then at the Python interactive prompt do this:
> 
> file("junk.u16", "rb").read().decode("utf16")
> 
> and show us what you get.

(I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?)
The result from Python is:

u'\u0439'

In Excel sheet it is shown as: й

(Russian again?!)

> 
> > and Python email package to email it... and the resulting sheet
> > is not good:
> 
> E-mailed how? To whom? [I.e. what country / what cultural background /
> on what machine / what operating system / viewed using what software]

Emailed with Python, please see the code at the end of the message.
The receiving system is OS X  with languages priority: Dutch, English,
German, Russian and Hebrew. Viewer: MS Office 2004.

> You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
> displayed instead of U+00FC (Latin small letter U with diaeresis).
> 
> OK, we'd already guessed your background from your name :-)

:-) 
> 
> However, what you see isn't necessarily what you've got. How do you
> know it's not U+0393 (Greek capital letter GAMMA) or something else
> that looks the same? Could even be from a line-drawing set (top left
> corner of a box). What you need to do is find out the ordinal of the
> character being displayed.
> 
> This type of problem arises when a character is written in one encoding
> and viewed using another. I've had a quick look through various
> likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
> cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
> encodings that would reproduce anything like your "umlauted-u becomes
> gamma-or-similar" problem. Please supply more than 1 example.

Thank you very much for your help and explanation! The "é" letter is
what I could find till now, see examples above..

The following code fragment is used for creating Excel sheet and sending email:
# ##################
# Create Excel sheet
#
f = StringIO()
workbook  = xl.Writer(f)
worksheet = workbook.add_worksheet()
bold = workbook.add_format(bold=1)
border = workbook.add_format(border=1)

worksheet.write_row('A1', ['First Name', 'Last Name'], border)
i = 1
for row in result:
     print repr(row['firstname'])
    datarow = [row.get('firstname'), row.get('surname')]
    i += 1
    worksheet.write_row('A%s' % i, datarow)
workbook.close()
####################
# Create email message

# Create the container (outer) email message.
msg = MIMEMultipart()
# Attach Excel sheet
xls = MIMEBase('application', 'vnd.ms-excel')
xls.set_payload(f.getvalue())
Encoders.encode_base64(xls)
xls.add_header('Content-Disposition', 'attachment', filename='some
file name %s-%s-%s.xls' % (today.day, today.month, today.year))
msg.attach(xls)

msg['Subject'] = subject
msg['From'] = fromaddr
msg['To'] = toaddr
# Guarantees the message ends in a newline
msg.epilogue = ''

# Send message
s = smtplib.SMTP(smtp_host)
s.sendmail(fromaddr, to_list, msg.as_string())

-- 
Ksenia