Text over multiple lines
William Park
opengeometry at yahoo.ca
Mon Jun 21 01:06:50 EDT 2004
Rigga <Rigga at hasnomail.com> wrote:
> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
>
> > Rigga <Rigga at hasnomail.com> writes:
> >> I am using the HTMLParser to parse a web page, part of the routine
> >> I need to write (I am new to Python) involves looking for a
> >> particular tag and once I know the start and the end of the tag
> >> then to assign all the data in between the tags to a variable, this
> >> is easy if the tag starts and ends on the same line however how
> >> would I go about doing it if its split over two or more lines?
> >
> > I often have variants of this problem too. The simplest way to make
> > it work is to read all the HTML in at once with a single call to
> > file.read(), and then use a regular expression. Note that you
> > probably don't need re.MULTILINE, although you should take a look at
> > what it means in the docs just to know.
> >
> > This works fine as long as you expect your files to be relatively
> > small (under a meg or so).
>
> Im reading the entire file in to a variable at the moment and passing
> it through HTMLParser. I have ran in to another problem that I am
> having a hard time working out, my data is in this format:
>
> <TD><SPAN class=qv id=EmployeeNo
> title="Employee Number">123456</SPAN></TD></TR>
>
> Some times the data is spread over 3 lines like:
>
> <TD><SPAN class=qv id=BusinessName
> title="Business Name">Some Shady Business
> Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
>
> The data I need to get is the data enclosed in quotes after the word
> title= and data after the > and before the </SPAN, in the case aove
> would be: Some Shady Business Group Ltd.
Approach:
1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is
<SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN>
with parenthized groups giving
submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
submatch[2]='Some Shady Business\nGroup Ltd.'
2. Split submatch[1] into
class=qv
id=BusinessName
title="Business Name"
Homework:
Write a Python script.
Bash solution:
First, you need my patched Bash which can be found at
http://freshmeat.net/projects/bashdiff/
You need to patch the Bash shell, and compile. It has many Python
features, particularly regex and array. Shell solution is
text='<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'
newf () { # Usage: newf match submatch1 submatch2
eval $2 # --> class, id, title
echo $title > title
echo $3 > name
}
x=()
array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
cat title
cat name
I can explain the steps, that it's rather long. :-)
--
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
No, I will not fix your computer! I'll reformat your harddisk, though.
More information about the Python-list
mailing list