[Tutor] Extracting data from HTML files

Fri Dec 30 23:51:38 CET 2005

Kent Johnson wrote:
> > 
> >>>>import re
> >>>>file = open("file1.html")
> >>>>data = file.read()
> >>>>catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')
> 
> Thi regex does not agree with the data you originally posted. Your 
> original data was
> <strong>Category:</strong>Category1<br><br>
> 
> Do you see the difference? Your regex has a different ending.

Yes I took the regex you sent me and modified it for one of the fields of
data I have to extract from the HTML, which is formatted like this:
      <strong>Title:</strong> Title1 <br><strong>

Will this affect the actual regex? The HTML docs have the same structure but
all the data is not formatted the same way,for example the Category field is
formatted like I wrote before:
      <strong>Category:</strong>Category1<br><br> while the Title field has
the same formatting as above

> > # I searched around the docs on regexes I have and found that the "r"
> #after
> > the re.compile(' will detect repeating words.Why is this useful in #my
> case?
> > I want to read the whole string even if it has repeating words.  #Also,
> I
> > dont understand the actual regex (.*?) . If I want to match #everything
> > inside </strong> and <br><strong> , shouldn`t I just put a "*"
> > # ? I tried that and it  gave me an error of course.
> 
> As Danny said, the r is not part of the regex, it marks a 'raw' string. 
> In this case it is not needed but I use it always for regex strings out 
> of habit.

Yes, I`ve been reading the regex HOWTO again.It`s not easy stuff but very
powerful and I`m really liking it.

> 
> The parentheses create a group which you can use to pull out the part of 
> the string which matched inside them. This is the data you want.
> 
> > 
> > 
> >>>>m = catRe.search(data)
> >>>>category = m.group(1)
> > 
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > AttributeError: 'NoneType' object has no attribute 'group'
> 
> In this case the match failed, so m is None and m.group(1) gives an
>error.

So my problem is in the actual regex? I`ve been trying to match other pieces
of data with no luck. I`ll double check the formatting in the HTML to see if
that is the problem.

Thank you very much for your help everyone.This list is an excellent
resource for newbies to python like me.

Oswaldo

-- 
10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++