[Tutor] Searching in a file

Dave Angel davea at ieee.org
Fri Jan 15 16:05:09 CET 2010


Paul Melvin wrote:
> Hi,
>
> Thanks very much to all your suggestions, I am looking into the suggestions
> of Hugo and Alan.
>
> The file is not very big, only 700KB (~20000 lines), which I think should be
> fine to be loaded into memory?
>
> I have two further questions though please, the lines are like this:
>
> 				<img width="13" height="15" alt="NEW"
> src="/m/I/I/star.png" />
> 			<strong><a href="/browse/post/5354361/">Revenge
> (2011)</a></strong>
>
> </td>
> <td class="final">
> 			<span title="Exact date/time: 05-01-2011 23:08"
> class="ageVeryNew">5 days </span>
> </td>
> <td class="final">
> 			<span title="Exact date/time: 18-01-2011 16:06"
> class="ageVeryNew">65 minutes </span>
>
> Etc with a chunk (between each NEW) being about 60 lines, I need to extract
> info from these lines, e.g. /browse/post/5354361/ and Revenge (2011) to pass
> back to the output, is re the best option to get all these various bits,
> maybe a generic function that I pass the search strings too?
>
> And if I use the split suggestion of Alan's I assume the last one would be
> the rest of the file, would the next() option just let me search for the
> next /browse/post/5354361/ etc after the NEW? (maybe putting this info into
> a list)
>
>   
One way to handle "the rest of the file" is to add a marker at the end 
of the data.  So if you read the whole thing with readlines(), you can 
append another "NEW" so that all matches are between one NEW and the next.
> Thanks again
>
> paul
> <snip>
>   
If this file is valid html, or xml, then perhaps you should use one of 
the html or xml parsing tools, rather than anything so esoteric as 
regex.  In any case, it now appears that NEW won't necessarily be 
unique, so you might want to start with  'alt="NEW"'  or something like 
that.  A key question becomes whether this data was automatically 
generated, or whether it might have variations from one sample to the 
next.  (for example,  alt =    "NEW"  with different spacing.  or  
ALT="NEW")  And whether it's definitely valid html, or just close.

DaveA



More information about the Tutor mailing list