[Tutor] search/match file position q

Peter Otten __peter__ at web.de
Tue Oct 7 20:10:11 CEST 2014


Clayton Kirkwood wrote:

> 
> 
> !-----Original Message-----
> !From: Tutor [mailto:tutor-bounces+crk=godblessthe.us at python.org] On
> !Behalf Of Peter Otten
> !Sent: Tuesday, October 07, 2014 3:50 AM
> !To: tutor at python.org
> !Subject: Re: [Tutor] search/match file position q
> !
> !Clayton Kirkwood wrote:
> !
> !> I was trying to keep it generic.
> !> Wrapped data file:
> !>                    <tr data-row-symbol="SWKS"><td class="col-symbol
> !>                    txt"><span class="wrapper "
> !>                    data-model="name:DatumModel;id:null;" data-
> !tmpl=""><a
> !>                    data-ylk="cat:portfolio;cpos:1"
> !>                    href="http://finance.yahoo.com/q?s=SWKS"
> !>                    data-rapid_p="18">SWKS</a></span></td><td
> !>                    class="col-fiftytwo_week_low cell-
> !raw:23.270000"><span
> !>                    class="wrapper "
> !>                    data-model="name:DatumModel;id:SWKS:qsi:wk52:low;"
> !>                    data-tmpl="change:yfin.datum">23.27</span></td><td
> !>                    class="col-prev_close cell-raw:58.049999"><span
> !>                    class="wrapper " data-model="name:DatumMo
> !
> !Doesn't Yahoo make the data available as CSV? That would be the way to
> !go then.
> 
> 
> Yes, Yahoo has a few columns that are csv, but I have maybe 15 fields that
> aren't provided. Besides, what fun would that be, I try to find tasks that
> allow me to expand my knowledge"<)))
> 
> !
> !Anyway, regular expressions are definitely the wrong tool here, and
> !reading the file one line at a time only makes it worse.
> 
> 
> Why is it making it only worse? I don't think a char by char would be
> helpful, the line happens to be very long, and I don't have a way of
> peeking around the corner to the next line so to speak. If I broke it into
> shorter strings, it would be much more onerous to jump over the end of the
> current to potentially many next strings.

I meant you should slurp in the whole file instead of reading it line after 
line. That way you'd at least have a chance to find elements that spread 
over more than one line like

<a 
href="example.com">Example</a>


> !> import re, os
> !>     line_in = file.readline()
> !	# read in humongous html line
> !>         stock = re.search('\s*<tr data-row-symbol="([A-Z]+)">',
> !line_in)
> !>         #scan to SWKS"> in data
> !							#line, stock
> !should be SWKS
> !>         low_52 = re.search('.+wk52:low.+([\d\.]+)<', line_in)
> !#want to
> !>         pick up from
> !							#SWKS">,
> !low_52 should be 23.27
> !>
> !> I am trying to figure out if each re.match starts scanning at the
> !> beginning of the same line over and over or does each scan start at
> !> the end of the last match. It appears to start over??
> !>
> !> This is stock:
> !> <_sre.SRE_Match object; span=(0, 47), match='                    <tr
> !> data-row-symbol="SWKS">'> This is low_52:
> !> <_sre.SRE_Match object; span=(0, 502875), match='
> !<tr
> !> data-row-symbol="SWKS"><t>
> !> If necessary, how do I pick up and move forward to the point right
> !> after the previous match?  File.tell() and file.__sizeof__(), don't
> !> seem to play a useful role.
> !
> !You should try BeautifulSoup. Let's play:
> !
> !>>> from bs4 import BeautifulSoup
> !>>> soup = BeautifulSoup("""<tr data-row-symbol="SWKS"><td

> !>>> span.text
> !'23.27'
> 
> 
> So, what makes regex wrong for this job? 

A regex doesn't understand the structure of an html document. For example 
you need to keep track of the nesting level manually to find the cells of 
the inner of two nested tables. 

> question still remains: does the
> search start at the beginning of the line each time or does it step
> forward from the last search? 

re.search() doesn't keep track of prior searches; whatever string you feed 
it (in your case a line cut out of an html document) is searched.

> I will check out beautiful soup as suggested
> in a subsequent mail; I'd still like to finish this process:<}}

Do you say that when someone points out that you are eating your shoe?



More information about the Tutor mailing list