minidom xml & non ascii / unicode & files

Fri Aug 5 13:54:06 EDT 2005

webdev wrote:

> lo all,
> 
> some of the questions i'll ask below have most certainly been discussed
> already, i just hope someone's kind enough to answer them again to help
> me out..
> 
> so i started a python 2.3 script that grabs some web pages from the web,
> regex parse the data and stores it localy to xml file for further use..
> 
> at first i had no problem using python minidom and everything concerning
> my regex/xml processing works fine, until i tested my tool on some
> french page with "non ascii" chars and my script started to throw errors
> all over the place..
> 
> I've looked into the matter and discovered the unicode / string encoding
> processes implied when dealing with non ascii texts and i must say i
> almost lost my mind.. I'm loosing it actually..

The general idea is:
- convert everything that's coming in (from the net, database, files) into
unicode
- do all your processing with unicode strings
- encode the strings to your preferred/the required encoding when you write
it to the net/database/file

> so here are the few questions i'd like to have answers for :
> 
> 1. when fetching a web page from the net, how am i supposed to know how
> it's encoded.. And can i decode it to unicode and encode it back to a
> byte string so i can use it in my code, with the charsets i want, like
> utf-8.. ?

First look at the HTTP 'Content-Type' header. If it has a parameter
'charset', that the encoding to use, e.g.
Content-Type: text/html; charset=iso-8859-1

If there's not encoding specified in the header, look at the <?xml .. ?>
prolog, if you have a XHTML document at hand (and it's present). Look below
for the syntax.

The last fallback is the <meta http-equiv="Content-Type" content="..."> tag.
The content attribute has the same format as the HTTP header.

But you can still run into UnicodeDecodeErrors, because many website just
don't get their encoding issues right. Browser do some (more or less)
educated guesses and often manage to display the document as intended.
You should probably use htmlData.encode(encoding, "ignore") or
htmlData.encode(encoding, "replace") to work around these problems (but
loose some characters).

And, as said above: don't encode the unicode string into bytestrings and
process the bytestrings in your program - that's a bad idea. Defer the
encoding until you absolutely necessary (usually file.write()).

> 2. in the same idea could anyone try to post the few lines that would
> actually parse an xml file, with non ascii chars, with minidom
> (parseString i guess).

The parser determines the encoding of the file from the <?xml..?> line. E.g.
if your file is encoded in utf-8, add the line
<?xml version="1.0" encoding="utf-8"?>
at the top of it, if it's not already present.
The parser will then decode everything into unicode strings - all TextNodes,
attributes etc. should be unicode strings.

When writing the manipulated DOM back to disk, use toxml() which has an
encoding argument.

> Then convert a string grabbed from the net so parts of it can be
> inserted in that dom object into new nodes or existing nodes.
> And finally write that dom object back to a file in a way it can be used
> again later with the same script..

Just insert the unicode strings.

> I've been trying to do that for a few days with no luck..
> I can do each separate part of the job, not that i'm quite sure how i
> decode/encode stuff in there, but as soon as i try to do everything at
> the same time i get encoding errors thrown all the time..
> 
> 3. in order to help me understand what's going on when doing
> encodes/decodes could you please tell me if in the following example, s
> and backToBytes are actually the same thing ??
> 
> s = "hello normal string"
> u = unicode( s, "utf-8" )
> backToBytes = u.encode( "utf-8" )
> 
> i knwo they both are bytestrings but i doubt they have actually the same
> content..

Why not try it yourself?
"hello normal string" is just US-ASCII. The utf-8 encoded version of the
unicode string u"hello normal string" will be identical to the ASCII byte
string "hello normal string".

> 
> 4. I've also tried to set the default encoding of python for my script
> using the sys.setdefaultencoding('utf-8') but it keeps telling me that
> this module does not have that method.. i'm left no choice but to edit
> the site.py file manually to change "ascii" to "utf-8", but i won't be
> able to do that on the client computers so..
> Anyways i don't know if it would help my script at all..

There was just recently a discussing on setdefaultencoding() on various
pythonistic blogs, e.g.
http://blog.ianbicking.org/python-unicode-doesnt-really-suck.html

> 
> any help will be greatly appreciated
> thx
> 
> Marc

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/