unicode and strings

Diez B. Roggisch deetsNOSPAM at web.de
Wed Nov 3 05:29:59 EST 2004


Jacob Friis wrote:

> I'm trying to learn Python via Marks Feedparser.
> 
> <snip src="http://feedparser.org/docs/character-encoding.html">
> If the character encoding can not be determined, Universal Feed Parser
> sets the bozo bit to 1 and sets bozo_exception to
> feedparser.CharacterEncodingUnknown. In this case, parsed values will be
> strings, not Unicode strings.
> </snip>
> 
> I guess this means that all data will be unicode, and to put in a
> database I could use my mycode function. Correct?

No. It means that you don't get unicode objects, but strings which are
basically sequences of bytes. And there is no way to be sure what encoding
they are in.

> 
> def mycode(value):
> if isinstance(value, unicode):
> value = value.encode('utf-8')
> return value

this will either yield a string in utf8-encoding, or a string in an unknown
encoding.

-- 
Regards,

Diez B. Roggisch



More information about the Python-list mailing list