HTML to dictionary

Tue Feb 27 06:28:18 EST 2007

On 27 Feb, 11:08, Tina I <tina... at bestemselv.com> wrote:
>
> I have a small, probably trivial even, problem. I have the following HTML:
>> <b>
>>  METAR:
>> </b>
>> ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
>> <br />
>> <b>
>>  short-TAF:
>> </b>
>> ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
>> <br />
>> <b>
>>  long-TAF:
>> </b>
>> ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010
>> BECMG 2124 15012KT
>> <br />

This looks almost like XHTML which means that you might be able to use
a normal XML parser.

> I need to make this into a dictionary like this:
>
> dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
> NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
> , "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
> SNRA VV010 BECMG 2124 15012KT"}

So what you want to do is to find each "b" element, extract the
contents to produce a dictionary key, and then find all following text
nodes up to the "br" element, extracting the contents of those nodes
to produce the corresponding dictionary value.

Now, with a DOM/XPath library, the first part is quite
straightforward. Let's first parse the document, though:

import libxml2dom              # my favourite ;-)
d = libxml2dom.parse(the_file) # add html=1 if it's HTML

Now, let's get the "b" elements providing the keys:

key_elements = d.xpath("//b")

The above will find all "b" elements throughout the document. If
that's too broad a search, you can specify something more narrow. For
example:

key_elements = d.xpath("/html/body/b")

At this point, key_elements should contain a list of nodes, each
corresponding to a "b" element, and you can get the contents of each
element by asking for all the text nodes inside it and joining them
together, stripping the whitespace off each end to make the dictionary
key itself:

def get_key(key_element):
    texts = []
    # Get all text child nodes, collecting the contents.
    for n in key_element.xpath("text()"):
        texts.append(n.nodeValue)
    # Join them together, removing leading/trailing space.
    return "".join(texts).strip()

(Currently, libxml2dom lets you ask an element for its nodeValue,
erroneously returning text inside that element, but I don't want to
promote this as a solution since I may change it at some point.)

The process of getting the dictionary values is a bit more difficult.
What we need to do is to ask for the following siblings of the "b"
element, then to loop over them until we find a "br" element. The
dictionary value is then obtained from the discovered text fragments
by joining them together and stripping whitespace from the ends:

def get_value(key_element):
    texts = []
    # Loop over nodes following the element...
    for n in key_element.xpath("following-sibling::node()"):
        # Stop looping if we find a "br" element.
        if n.nodeType == n.ELEMENT_NODE and n.localName == "br":
            break
        # Otherwise get the (assumed) text content.
        texts.append(n.nodeValue)
    # Join the texts and remove leading/trailing space.
    return "".join(texts).strip()

So, putting this together, you should get something like this:

dictionary = {}
for key_element in key_elements:
    dictionary[get_key(key_element)] = get_value(key_element)

As always with HTML processing, your mileage may vary with such an
approach, but I hope this is helpful. You should also be able to use
something like 4Suite or PyXML with the above code, albeit possibly
slightly modified.

Paul

P.S. Hopefully, Google Groups won't wrap the code badly. Whatever
happened to the preview option, Google?