Text over multiple lines

Rigga Rigga at hasnomail.com
Mon Jun 21 02:14:39 EDT 2004


On Mon, 21 Jun 2004 05:06:50 +0000, William Park wrote:

> Rigga <Rigga at hasnomail.com> wrote:
>> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
>> 
>> > Rigga <Rigga at hasnomail.com> writes:
>> >> I am using the HTMLParser to parse a web page, part of the routine
>> >> I need to write (I am new to Python) involves looking for a
>> >> particular tag and once I know the start and the end of the tag
>> >> then to assign all the data in between the tags to a variable, this
>> >> is easy if the tag starts and ends on the same line however how
>> >> would I go about doing it if its split over two or more lines?
>> > 
>> > I often have variants of this problem too. The simplest way to make
>> > it work is to read all the HTML in at once with a single call to
>> > file.read(), and then use a regular expression. Note that you
>> > probably don't need re.MULTILINE, although you should take a look at
>> > what it means in the docs just to know.
>> > 
>> > This works fine as long as you expect your files to be relatively
>> > small (under a meg or so).
>> 
>> Im reading the entire file in to a variable at the moment and passing
>> it through HTMLParser.  I have ran in to another problem that I am
>> having a hard time working out, my data is in this format:
>> 
>>         <TD><SPAN class=qv id=EmployeeNo
>>         title="Employee Number">123456</SPAN></TD></TR>
>> 
>> Some times the data is spread over 3 lines like:
>> 
>>         <TD><SPAN class=qv id=BusinessName
>>         title="Business Name">Some Shady Business
>>         Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
>> 
>> The data I need to get is the data enclosed in quotes after the word
>> title= and data after the > and before the </SPAN, in the case aove
>> would be: Some Shady Business Group Ltd.
> 
> Approach:
> 
> 1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is
> 
> 	<SPAN class=qv id=BusinessName
> 	title="Business Name">Some Shady Business
> 	Group Ltd.</SPAN>
> 
>     with parenthized groups giving
> 
> 	submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
> 	submatch[2]='Some Shady Business\nGroup Ltd.'
> 
> 2. Split submatch[1] into
> 
> 	class=qv
> 	id=BusinessName
> 	title="Business Name"
> 
> Homework:
> 
>     Write a Python script.
> 
> Bash solution:
> 
>     First, you need my patched Bash which can be found at
> 
> 	http://freshmeat.net/projects/bashdiff/
> 
>     You need to patch the Bash shell, and compile.  It has many Python
>     features, particularly regex and array.  Shell solution is
> 
> 	text='<TD><SPAN class=qv id=BusinessName
> 	title="Business Name">Some Shady Business
> 	Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'
> 
> 	newf () {	# Usage: newf match submatch1 submatch2
> 	    eval $2	# --> class, id, title
> 	    echo $title > title
> 	    echo $3 > name
> 	}
> 	x=()
> 	array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
> 	cat title
> 	cat name
> 
>     I can explain the steps, that it's rather long. :-)

Thanks for everyones help, I have now worked out a way that works for me
, your input has helped me immensley.

many thanks

R



More information about the Python-list mailing list