minidom xml & non ascii / unicode & files

webdev webdev at chaosmedia.org
Fri Aug 5 11:15:16 EDT 2005


lo all,

some of the questions i'll ask below have most certainly been discussed 
already, i just hope someone's kind enough to answer them again to help 
me out..

so i started a python 2.3 script that grabs some web pages from the web, 
regex parse the data and stores it localy to xml file for further use..

at first i had no problem using python minidom and everything concerning 
my regex/xml processing works fine, until i tested my tool on some 
french page with "non ascii" chars and my script started to throw errors 
all over the place..

I've looked into the matter and discovered the unicode / string encoding 
processes implied when dealing with non ascii texts and i must say i 
almost lost my mind.. I'm loosing it actually..

so here are the few questions i'd like to have answers for :

1. when fetching a web page from the net, how am i supposed to know how 
it's encoded.. And can i decode it to unicode and encode it back to a 
byte string so i can use it in my code, with the charsets i want, like 
utf-8.. ?

2. in the same idea could anyone try to post the few lines that would 
actually parse an xml file, with non ascii chars, with minidom 
(parseString i guess).
Then convert a string grabbed from the net so parts of it can be 
inserted in that dom object into new nodes or existing nodes.
And finally write that dom object back to a file in a way it can be used 
again later with the same script..

I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i 
decode/encode stuff in there, but as soon as i try to do everything at 
the same time i get encoding errors thrown all the time..

3. in order to help me understand what's going on when doing 
encodes/decodes could you please tell me if in the following example, s 
and backToBytes are actually the same thing ??

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same 
content..

4. I've also tried to set the default encoding of python for my script 
using the sys.setdefaultencoding('utf-8') but it keeps telling me that 
this module does not have that method.. i'm left no choice but to edit 
the site.py file manually to change "ascii" to "utf-8", but i won't be 
able to do that on the client computers so..
Anyways i don't know if it would help my script at all..

any help will be greatly appreciated
thx

Marc



More information about the Python-list mailing list