expat having problems with entities (&)

nnguyen nguyenn at gmail.com
Fri Dec 11 16:37:35 EST 2009


On Dec 11, 4:23 pm, nnguyen <nguy... at gmail.com> wrote:
> I need expat to parse this block of xml:
>
> <datafield tag="991">
>   <subfield code="b">c-P&P</subfield>
>   <subfield code="h">LOT 3677</subfield>
>   <subfield code="m">(F)</subfield>
> </datafield>
>
> I need to parse the xml and return a dictionary that follows roughly
> the same layout as the xml. Currently the code for the class handling
> this is:
>
> class XML2Map():
>
>     def __init__(self):
>         """ """
>         self.parser = expat.ParserCreate()
>
>         self.parser.StartElementHandler = self.start_element
>         self.parser.EndElementHandler = self.end_element
>         self.parser.CharacterDataHandler = self.char_data
>
>         self.map = [] #not a dictionary
>
>         self.current_tag = ''
>         self.current_subfields = []
>         self.current_sub = ''
>         self.current_data = ''
>
>     def parse_xml(self, xml_text):
>         self.parser.Parse(xml_text, 1)
>
>     def start_element(self, name, attrs):
>         if name == 'datafield':
>             self.current_tag = attrs['tag']
>
>         elif name == 'subfield':
>             self.current_sub = attrs['code']
>
>     def char_data(self, data):
>         self.current_data = data
>
>     def end_element(self, name):
>         if name == 'subfield':
>             self.current_subfields.append([self.current_sub,
> self.current_data])
>
>         elif name == 'datafield':
>             self.map.append({'tag': self.current_tag, 'subfields':
> self.current_subfields})
>             self.current_subfields = [] #resetting the values for next
> subfields
>
> Right now my problem is that when it's parsing the subfield element
> with the data "c-P&P", it's not taking the whole data, but instead
> it's breaking it into "c-P", "&", "P". i'm not an expert with expat,
> and I couldn't find a lot of information on how it handles specific
> entities.
>
> In the resulting map, instead of:
>
> {'tag': u'991', 'subfields': [[u'b', u'c-P&P'], [u'h', u'LOT 3677'],
> [u'm', u'(F)']], 'inds': [u' ', u' ']}
>
> I get this:
>
> {'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'],
> [u'm', u'(F)']], 'inds': [u' ', u' ']}
>
> In the debugger, I can see that current_data gets assigned "c-P", then
> "&", and then "P".
>
> Any ideas on any expat tricks I'm missing out on? I'm also inclined to
> try another parser that can keep the string together when there are
> entities, or at least ampersands.

I forgot, ignore the "'inds':..." in the output above, it's just
another part of the xml I had to parse that isn't important to this
discussion.



More information about the Python-list mailing list