Unicode from Web to MySQL

Sat Dec 20 16:12:05 EST 2003

Bill Eldridge wrote in message ...
>etc., etc., but after numerous tries in the end I still keep getting
>either SQL errors or
>the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

Here's your clue: your string contains a byte which is not representable by
ascii.

>Can someone give me any explanation of how to do this easily?

Quickstart guide:
You first need to decode your string to unicode.  You do this by
'stringfromweb'.decode('encoding-of-the-string').  So if you grab a web page
that's in latin-1, you do 'stringfromweb'.decode('latin-1') and get unicode.
If you later want utf-8 (to plunk into SQL), take that unicode and
.encode('utf8').

Path-to-understanding:
You need to understand how unicode plays in to this first.

Unicode is not an encoding.  Unicode is an abstract mapping of numbers
(called code points) to letters.  Pure, undistilled "Unicode" is what you
see in those huge charts which show a number on the left and a long
uppercase letter/symbol description on the right. Unicode itself has nothing
to do with bytes, or even with computers.

A Python Unicode object is just that: an ordered sequence of unicode code
points.  It has no natural byte representation.  If you want that, you need
to encode it.

Note that unicode objects have no "decode" method.  This is because unicode
is a LACK of encoding!  Encoding maps symbols to byte representations, and a
unicode object is the explicit lack of a byte representation.  So there are
no bytes to decode from.  (Now of course the computer needs *some*
representation, becuase all it knows is bytes, but that could be anything,
and is entirely an implementation detail that you don't need to know about.
But you can see it with the 'unicode-internal' codec.)

A Python str object is an ordered sequence of 8-bit bytes.  It is not really
a string--that's a holdover from the bygone days of pre-unicode Python.
When you encode a unicode object, you get raw bytes in some representation
of unicode characters, which are held by a str.  When you want a unicode
object, you give it a str and a *known encoding*.

Now, what is the encoding of a str?  You see this is like a strange Koan,
because bytes is bytes.  Bytes have no intrinsic meaning until we give them
some.  So whenever you decode a string to get unicode, you MUST supply the
encoding of the string!

There are ways to specify a default encoding for strings in Python (see your
site.py and sys.get/setdefaultencoding), but the default default is ascii.
Hence if byte '\xef' is found in a str, any attempt to encode it will choke,
because that byte is not in the 'ascii' encoding and thus the claim that
this str is encoded in ascii is false.  (str.encode(codec) is really
shorthand for str.decode(default-encoding) -> unicode.encode(codec) )

Now, lets examine:

>>> 'abc'.decode('utf8')
u'abc'

"Take three bytes 'abc', and decode it as if it were a unicode string
encoded as utf8."

>>> '\xef'.encode('utf8')
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What you really need to do, then, is:

>>> PureUnicodeUnsulliedByBits = stringfromtheBADBADweb.decode('latin-1')

Or:

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = data.decode(<the-encoding-of-this-string>).encode('utf8')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Finding the encoding of that string from the web is where the tears come in.
If you're lucky, urllib.urlopen('http://....').info().getencoding() will
give it to you.  However, this gets its info from the http headers, and if
they don't specify encoding, it defaults to '7bit'.  But the html page
itself *might* have a different idea about its own encoding in the <meta>
element, 'content' attribute, which may be of the form "text/html;
charset=ISO-8859-1".  Or it might not, who knows?

In other words, there is no standard, 100% reliable method of getting the
encoding of a web page.  In an ideal world, the http header would have it,
and that's that.  In the real world, you have to juggle various combinations
of information, missing information, and disinformation from the http
protocol header's info, the html file's meta info, and charset guessing
algorithms (look for Enca).

There might be a way to get urllib to request an encoding (as browsers do),
so that the http header will at least give some slightly more useful
information back, but I don't know how.  As it is, it will almost always not
specify the charset if urllib is used, forcing you to look in the html file
itself.

But once you get the encoding, everything is fine....
--
Francis Avila