minidom xml & non ascii / unicode & files
webdev
webdev at chaosmedia.org
Fri Aug 5 11:15:16 EDT 2005
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3 script that grabs some web pages from the web,
regex parse the data and stores it localy to xml file for further use..
at first i had no problem using python minidom and everything concerning
my regex/xml processing works fine, until i tested my tool on some
french page with "non ascii" chars and my script started to throw errors
all over the place..
I've looked into the matter and discovered the unicode / string encoding
processes implied when dealing with non ascii texts and i must say i
almost lost my mind.. I'm loosing it actually..
so here are the few questions i'd like to have answers for :
1. when fetching a web page from the net, how am i supposed to know how
it's encoded.. And can i decode it to unicode and encode it back to a
byte string so i can use it in my code, with the charsets i want, like
utf-8.. ?
2. in the same idea could anyone try to post the few lines that would
actually parse an xml file, with non ascii chars, with minidom
(parseString i guess).
Then convert a string grabbed from the net so parts of it can be
inserted in that dom object into new nodes or existing nodes.
And finally write that dom object back to a file in a way it can be used
again later with the same script..
I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i
decode/encode stuff in there, but as soon as i try to do everything at
the same time i get encoding errors thrown all the time..
3. in order to help me understand what's going on when doing
encodes/decodes could you please tell me if in the following example, s
and backToBytes are actually the same thing ??
s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )
i knwo they both are bytestrings but i doubt they have actually the same
content..
4. I've also tried to set the default encoding of python for my script
using the sys.setdefaultencoding('utf-8') but it keeps telling me that
this module does not have that method.. i'm left no choice but to edit
the site.py file manually to change "ascii" to "utf-8", but i won't be
able to do that on the client computers so..
Anyways i don't know if it would help my script at all..
any help will be greatly appreciated
thx
Marc
More information about the Python-list
mailing list