utf-8 encoding issue
Fredrik Lundh
fredrik at pythonware.com
Fri Sep 19 08:34:16 EDT 2003
Marc Petitmermet wrote:
> In a web form, the user enters "öttinger" and wants to search with this
> search string. My idea is now to convert the search string (which also
> could be e.g. some cyrillic text) into unicode and then to utf-8:
>
> unicode(search_string).encode('utf-8')
>
> This gives me the utf-8 encoded version of the string but not yet in the
> correct representation. How can I get the correct one (is this the hex
> version? I don't know the correct terminology.)?
>
> In short: how do I e.g. convert a sting containing a "ö" into a string
> containing a "%Ö"?
that's not UTF-8, that's HTML/XML-style charrefs.
if mysql translates the charref's to unicode characters, you can simply
use:
s = u.encode("ascii", "xmlcharrefreplace")
where "u" is a unicode string.
if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:
def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))
to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.
decoding the charrefs *before* you add the strings to the database
is a better idea, though.
</F>
More information about the Python-list
mailing list