Text over multiple lines
Rigga
Rigga at hasnomail.com
Mon Jun 21 02:14:39 EDT 2004
On Mon, 21 Jun 2004 05:06:50 +0000, William Park wrote:
> Rigga <Rigga at hasnomail.com> wrote:
>> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
>>
>> > Rigga <Rigga at hasnomail.com> writes:
>> >> I am using the HTMLParser to parse a web page, part of the routine
>> >> I need to write (I am new to Python) involves looking for a
>> >> particular tag and once I know the start and the end of the tag
>> >> then to assign all the data in between the tags to a variable, this
>> >> is easy if the tag starts and ends on the same line however how
>> >> would I go about doing it if its split over two or more lines?
>> >
>> > I often have variants of this problem too. The simplest way to make
>> > it work is to read all the HTML in at once with a single call to
>> > file.read(), and then use a regular expression. Note that you
>> > probably don't need re.MULTILINE, although you should take a look at
>> > what it means in the docs just to know.
>> >
>> > This works fine as long as you expect your files to be relatively
>> > small (under a meg or so).
>>
>> Im reading the entire file in to a variable at the moment and passing
>> it through HTMLParser. I have ran in to another problem that I am
>> having a hard time working out, my data is in this format:
>>
>> <TD><SPAN class=qv id=EmployeeNo
>> title="Employee Number">123456</SPAN></TD></TR>
>>
>> Some times the data is spread over 3 lines like:
>>
>> <TD><SPAN class=qv id=BusinessName
>> title="Business Name">Some Shady Business
>> Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
>>
>> The data I need to get is the data enclosed in quotes after the word
>> title= and data after the > and before the </SPAN, in the case aove
>> would be: Some Shady Business Group Ltd.
>
> Approach:
>
> 1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is
>
> <SPAN class=qv id=BusinessName
> title="Business Name">Some Shady Business
> Group Ltd.</SPAN>
>
> with parenthized groups giving
>
> submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
> submatch[2]='Some Shady Business\nGroup Ltd.'
>
> 2. Split submatch[1] into
>
> class=qv
> id=BusinessName
> title="Business Name"
>
> Homework:
>
> Write a Python script.
>
> Bash solution:
>
> First, you need my patched Bash which can be found at
>
> http://freshmeat.net/projects/bashdiff/
>
> You need to patch the Bash shell, and compile. It has many Python
> features, particularly regex and array. Shell solution is
>
> text='<TD><SPAN class=qv id=BusinessName
> title="Business Name">Some Shady Business
> Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'
>
> newf () { # Usage: newf match submatch1 submatch2
> eval $2 # --> class, id, title
> echo $title > title
> echo $3 > name
> }
> x=()
> array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
> cat title
> cat name
>
> I can explain the steps, that it's rather long. :-)
Thanks for everyones help, I have now worked out a way that works for me
, your input has helped me immensley.
many thanks
R
More information about the Python-list
mailing list