Characters aren't displayed correctly

Tue Mar 3 05:21:20 EST 2009

On Mar 3, 8:49 pm, Hussein B <hubaghd... at gmail.com> wrote:
> On Mar 3, 11:05 am, Hussein B <hubaghd... at gmail.com> wrote:
>
>
>
> > On Mar 2, 5:40 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > > On Mar 3, 1:50 am, Hussein B <hubaghd... at gmail.com> wrote:
>
> > > > On Mar 2, 4:31 pm, John Machin <sjmac... at lexicon.net> wrote:> On Mar 2, 7:30 pm, Hussein B <hubaghd... at gmail.com> wrote:
>
> > > > > > On Mar 1, 4:51 pm, Philip Semanchuk <phi... at semanchuk.com> wrote:
>
> > > > > > > On Mar 1, 2009, at 8:31 AM, Hussein B wrote:
>
> > > > > > > > Hey,
> > > > > > > > I'm retrieving records from MySQL database that contains non english
> > > > > > > > characters.
>
> > > > > Can you reveal which language???
>
> > > > Arabic
>
> > > > > > > > Then I create a String that contains HTML markup and column values
> > > > > > > > from the previous result set.
> > > > > > > > +++++
> > > > > > > > markup = u'''<table>.....'''
> > > > > > > > for row in rows:
> > > > > > > >     markup = markup + '<tr><td>' + row['id']
> > > > > > > > markup = markup + '</table>
> > > > > > > > +++++
> > > > > > > > Then I'm sending the email according to this tip:
> > > > > > > >http://code.activestate.com/recipes/473810/
> > > > > > > > Well, the email contains ????? characters for each non english ones.
> > > > > > > > Any ideas?
>
> > > > > > > There's so many places where this could go wrong and you haven't  
> > > > > > > narrowed down the problem.
>
> > > > > > > Are the characters stored in the database correctly?
>
> > > > > > Yes they are.
>
> > > > > How do you KNOW that they are stored correctly? What makes you so
> > > > > sure?
>
> > > > Because MySQL Query Browser displays them correctly, in addition I use
> > > > BIRT as the reporting system and it shows them correctly.
>
> > > > > > > Are they stored consistently (i.e. all using the same encoding, not  
> > > > > > > some using utf-8 and others using iso-8859-1)?
>
> > > > > > Yes.
>
> > > > > So what is the encoding used to store them?
>
> > > > Tables are created with UTF-8 encoding option
>
> > > > > > > What are you getting out of the database? Is it being converted to  
> > > > > > > Unicode correctly, or at all?
>
> > > > > > I don't know, how to make sure of this point?
>
> > > > > You could show us some of the output from the database query. As well
> > > > > as
> > > > >    print the_output
> > > > > you should
> > > > >    print repr(the_output)
> > > > > and show us both, and also tell us what you *expect* to see.
>
> > > > The result of print repr(row['name']) is '??? ??????'
> > > > The '?' characters are supposed to be Arabic characters.
>
> > > Are you expecting 3 Arabic characters, a space, and then 6 Arabic
> > > characters?
>
> > > We now have some interesting evidence: row['name'] is NOT a unicode
> > > object -- otherwise the print would show u'??? ??????'; it's a str
> > > object.
>
> > > So: A utf8-encoded string is being decoded to unicode, and then re-
> > > encoded to some other encoding, using the "replace" (with "?") error-
> > > handling method. That shouldn't be hard to spot! It's about time you
> > > showed us the code you are using to extract the data from the
> > > database, including the print statements you have put in.
>
> > This is how I retrieve the data:
>
> > db = MySQLdb.connect(host = "127.0.0.1", port = 3306, user =
> > "username",
> >                          passwd = "passwd", db = "reporting")
> > cr = db.cursor(MySQLdb.cursors.DictCursor)
> > cr.execute(sql)
> > rows = cr.fetchall()
>
> > Thanks all for your nice help.
>
> Hey,
> I added use_unicode and charset keyword params to the connect() method

Hey, that was a brilliant idea -- I was just about to ask you to try
 use_unicode=True, charset="utf8" ... what were the actual values that
you used?

Let's suppose that you used charset="XXXX" ... as far as I can tell,
not being a mysqldb user myself, this means that your data tables and/
or your default connection don't use XXXX as an encoding. If so, this
might be an issue you might like to take up with whoever created the
database that you are using.

> and I got the following:
> u'\u062f\u062e\u0648\u0644 \u0633\u0631\u064a\u0639
> \u0634\u0647\u0631'
> So characters are getting converted successfully.

I guess so -- U+06nn sure are Arabic characters :-)

However as suggested above, "converted from what?" might be worth
pursuing if you like to understand what is going on instead of just
applying magic recipes ;-)

> Well, using the previous recipe for sending the mail:http://code.activestate.com/recipes/473810/
> I got the following error:
>
> Traceback (most recent call last):
>   File "HtmlMail.py", line 52, in <module>
>     s.sendmail(sender, receiver , msg.as_string())

[big snip]

> _handle_text
>     self._fp.write(payload)
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 115-118: ordinal not in range(128)
>
> Again, any ideas guys? :)

That recipe appears to have been written by an ascii bigot for ascii
bigots :-(

Try reading the docs for email.charset (that's the charset module in
the email package).

Cheers,
John