getting data with proper encoding to the finish
Ksenia Marasanova
ksenia.marasanova at gmail.com
Mon Mar 14 18:34:17 EST 2005
> > There is some amount of data in a database (PG) that must be inserted
> > into Excel sheet and emailed. Nothing special, everything works.
> > Except that non-ascii characters are not displayed properly.
> > The data is stored as XML into a text field.
>
> This sentence doesn't make much sense. Explain.
Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.
> Instead of "print data", do "print repr(data)" and show us what you
> get. What *you* see on the screen is not much use for diagnosis; it's
> the values of the bytes in the file that matter.
Thanks for this valuable tip. I take letter "é" as an example.
"print repr(data)" shows this:
u'\xe9'
> Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
> a Notepad window, save the Notepad file as Unicode type named (say)
> "junk.u16" then at the Python interactive prompt do this:
>
> file("junk.u16", "rb").read().decode("utf16")
>
> and show us what you get.
(I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?)
The result from Python is:
u'\u0439'
In Excel sheet it is shown as: й
(Russian again?!)
>
> > and Python email package to email it... and the resulting sheet
> > is not good:
>
> E-mailed how? To whom? [I.e. what country / what cultural background /
> on what machine / what operating system / viewed using what software]
Emailed with Python, please see the code at the end of the message.
The receiving system is OS X with languages priority: Dutch, English,
German, Russian and Hebrew. Viewer: MS Office 2004.
> You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
> displayed instead of U+00FC (Latin small letter U with diaeresis).
>
> OK, we'd already guessed your background from your name :-)
:-)
>
> However, what you see isn't necessarily what you've got. How do you
> know it's not U+0393 (Greek capital letter GAMMA) or something else
> that looks the same? Could even be from a line-drawing set (top left
> corner of a box). What you need to do is find out the ordinal of the
> character being displayed.
>
> This type of problem arises when a character is written in one encoding
> and viewed using another. I've had a quick look through various
> likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
> cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
> encodings that would reproduce anything like your "umlauted-u becomes
> gamma-or-similar" problem. Please supply more than 1 example.
Thank you very much for your help and explanation! The "é" letter is
what I could find till now, see examples above..
The following code fragment is used for creating Excel sheet and sending email:
# ##################
# Create Excel sheet
#
f = StringIO()
workbook = xl.Writer(f)
worksheet = workbook.add_worksheet()
bold = workbook.add_format(bold=1)
border = workbook.add_format(border=1)
worksheet.write_row('A1', ['First Name', 'Last Name'], border)
i = 1
for row in result:
print repr(row['firstname'])
datarow = [row.get('firstname'), row.get('surname')]
i += 1
worksheet.write_row('A%s' % i, datarow)
workbook.close()
####################
# Create email message
# Create the container (outer) email message.
msg = MIMEMultipart()
# Attach Excel sheet
xls = MIMEBase('application', 'vnd.ms-excel')
xls.set_payload(f.getvalue())
Encoders.encode_base64(xls)
xls.add_header('Content-Disposition', 'attachment', filename='some
file name %s-%s-%s.xls' % (today.day, today.month, today.year))
msg.attach(xls)
msg['Subject'] = subject
msg['From'] = fromaddr
msg['To'] = toaddr
# Guarantees the message ends in a newline
msg.epilogue = ''
# Send message
s = smtplib.SMTP(smtp_host)
s.sendmail(fromaddr, to_list, msg.as_string())
--
Ksenia
More information about the Python-list
mailing list