minidom xml & non ascii / unicode & files

Fri Aug 5 20:01:49 EDT 2005

webdev wrote:
> 1. when fetching a web page from the net, how am i supposed to know how
> it's encoded.. And can i decode it to unicode and encode it back to a
> byte string so i can use it in my code, with the charsets i want, like
> utf-8.. ?

It depends on the content type. If the HTTP header declares a charset=
attribute for content-type, then use that (beware: some web servers
report the content type incorrectly. To deal with that gracefully,
you have to implement very complex algorithms, which are part of
any recent web browser).

If there is no charset= attribute, then
- if the content type is text/html, look at a meta http-equiv tag
  in the content. If that declares a charset, use that.
- if the content type is xml (plain, or xhtml+xml), look at the
  XML declaration. Alternatively, pass it to your XML parser.

> 2. in the same idea could anyone try to post the few lines that would
> actually parse an xml file, with non ascii chars, with minidom
> (parseString i guess).

doc = xml.dom.minidom.parse("foo.xml")

> Then convert a string grabbed from the net so parts of it can be
> inserted in that dom object into new nodes or existing nodes.

doc..documentElement.setAttribute("bar", text_from_net.decode("koi-8r"))

> And finally write that dom object back to a file in a way it can be used
> again later with the same script..

open("/tmp/foo.txt","w").write(doc.toxml())

> I've been trying to do that for a few days with no luck..
> I can do each separate part of the job, not that i'm quite sure how i
> decode/encode stuff in there, but as soon as i try to do everything at
> the same time i get encoding errors thrown all the time..

It would help if you would state what precise code you are using,
and what precise error you are getting (for what precise input).

> 
> 3. in order to help me understand what's going on when doing
> encodes/decodes could you please tell me if in the following example, s
> and backToBytes are actually the same thing ??
> 
> s = "hello normal string"
> u = unicode( s, "utf-8" )
> backToBytes = u.encode( "utf-8" )
> 
> i knwo they both are bytestrings but i doubt they have actually the same
> content..

They do have the same content. There is nothing to a byte string except
for the bytes. If the byte string is meant to represent characters,
they are the same "thing" only if the assumed encoding is the same.
Since the assumed encoding is "utf-8" for both s and backToBytes,
they are the same thing.

> 4. I've also tried to set the default encoding of python for my script
> using the sys.setdefaultencoding('utf-8') but it keeps telling me that
> this module does not have that method.. i'm left no choice but to edit
> the site.py file manually to change "ascii" to "utf-8", but i won't be
> able to do that on the client computers so..

Don't do that. It's meant as a last resort for backwards compatibility,
and shouldn't be used for new code.

Regards,
Martin