Elementary string-parsing

Mon Feb 4 23:03:04 EST 2008

In article <13qeo77n2p7nv26 at corp.supernews.com>,
 Dennis Lee Bieber <wlfraed at ix.netcom.com> wrote:

> On Mon, 04 Feb 2008 09:43:04 GMT, Odysseus
> <odysseus1479-at at yahoo-dot.ca> declaimed the following in
> comp.lang.python:
> 
> > 
> > Thanks, that will be very useful. I was casting about for a replacement 
> > for PostScript's "for" loop, and the "while" loop (which PS lacks -- and 
> > which I've never missed there) was all I could come up with.
> >
> 	Have you read the language reference manual yet? It is a rather
> short document given that the language syntactic elements are not that
> complex -- but would have exposed you to the "for" statement (along with
> "return" and passing arguments).

Sorry, translation problem: I am acquainted with Python's "for" -- if 
far from fluent with it, so to speak -- but the PS operator that's most 
similar (traversing a compound object, element by element, without any 
explicit indexing or counting) is called "forall". PS's "for" loop is 
similar to BASIC's (and ISTR Fortran's):

start_value increment end_value {procedure} for

I don't know the proper generic term -- "indexed loop"? -- but at any 
rate it provides a counter, unlike Python's command of the same name.

> If your only other programming experience is base PostScript you 
> wouldn't really be familiar with passing arguments or returning 
> values -- as an RPN stack-based language, argument passing is just 
> listing the arguments before a function call (putting a copy of them 
> on the stack), and returns are whatever the function left on the 
> stack at the end; hence they appear sort of global.

Working directly in the operand stack is efficient, but can make 
interpretation by humans -- and debugging -- very difficult. So for the 
sake of coder-friendliness it's generally advisable to use variables 
(i.e. assign values to keys in a dictionary) in most cases instead of 
passing values 'silently' via the stack. I'm beginning to realize that 
for Python the situation is just about the opposite ...

Anyway, I have been reading the documentation on the website, but much 
of the terminology is unfamiliar to me. When looking things up I seem to 
get an inordinate number of 404 errors from links returned by the search 
function, and often the language-reference or tutorial entries (if any) 
are buried several pages down. In general I'm finding the docs rather 
frustrating to navigate.

> 	After the language reference manual, the library reference manual
> chapter on built-ins and data types would be next for study -- the rest
> can usually be handled via search functions (working with time
> conversions, look for modules with date or time <G>).

As I mentioned elsethread, I did look at the "time" documentation; it 
was there that I found a reference to the "calendar.timegm" function I 
used in my first attempt.

> 	It looked a bit like you were using a SAX-style parser to collect
> "names" and "cells" -- and then passing the "bunch" to another function
> to trim out and convert data... It would take me a bit to restudy the
> SAX parsing scheme (I did it once, back in the days of v1.5 or so) but
> the way I'd /try/ to do it is to have the stream handler keep track of
> which cell (<td> tag) is currently being parsed, and convert the string
> data at that level. You'd initialize the record dictionary to {} (and
> cell position to 0) on the <tr> tag, and return the populated record on
> the </tr> tag.

This is what my setup looks like -- mostly cribbed from _Dive Into 
Python_ -- where "PageParser" is a class based on "SGMLParser":

from sgmllib import SGMLParser
from urllib import urlopen

# ...

def parse_page(url):
    usock = urlopen(url)
    parser = PageParser()
    parser.feed(usock.read())
    parser.close()
    usock.close()
    return parser

# ...

captured = parse_page(base_url + suffix)

I only use "parse_page" the once at this stage, but my plan was to call 
it repeatedly while varying "suffix" (depending on the data found by the 
previous pass). On each pass the class will initialize itself, which is 
why I was collecting the data into a 'standing' (global) dictionary. Are 
you suggesting essentially that I'd do better to make the text-parsing 
function into a method of "PageParser"? Can one add, to such a derived 
class, methods that don't have protoypes in the parent?

> 	Might want to check into making a class/instance of the parser so
> you can make the record dictionary and column (cell) position instance
> attributes (avoiding globals).

AFAICT my "captured" is an instance of "PageParser", but I'm unclear on 
how I would add attributes to it -- and as things stand it will get 
rebuilt from scratch each time a page is read in.

> > [...] I'm somewhat intimidated by the whole concept of 
> > exception-handling (among others). How do you know to expect a 
> > "ValueError" if the string isn't a representation of a number?
> 
> 	Read the library reference for the function in question? Though it
> appears the reference doesn't list the error raised for an invalid
> string representation -- in which case just try one in the interactive
> shell...

Under "2.1 Built-in Functions"

<http://docs.python.org/lib/built-in-funcs.html>

"""float([x])
Convert a string or a number to floating point. If the argument is a 
string, it must contain a possibly signed decimal or floating point 
number, possibly embedded in whitespace. Otherwise, the argument may be 
a plain or long integer or a floating point number, and a floating point 
number with the same value (within Python's floating point precision) is 
returned. If no argument is given, returns 0.0.
Note: When passing in a string, values for NaN and Infinity may be 
returned, depending on the underlying C library. The specific set of 
strings accepted which cause these values to be returned depends 
entirely on the C library and is known to vary.
"""

Not a word about errors. If I understand the note correctly, 
"float('---')" might cause an error or might happily return "NaN", so it 
appears experimentation is the only way to go. (On my system I get the 
error, but if I wanted to run the program elsewhere, or share it, I 
suppose the code would have to be tested in each environment.)

> > Is there a list of common exceptions somewhere? (Searching for 
> > "ValueError" turned up hundreds of passing mentions, but I couldn't 
> > find a definition or explanation.)
> 
> 	Library reference -- under "Exception" (note the cap). {Section 2.4
> Built-in Exceptions in my version of Python -- same chapter I mentioned
> above about data types}

Thanks -- one would think my search should have directed me there ...

> 	Tuples are "read-only" whereas a list can be modified in-place...
> But many look on tuples as a data "record" where each position has a
> different meaning; lists are collections where each position has a
> different value but the same meaning...
> 
> 	Tuple:	(name, street, city, state, zip)

Why wouldn't one use a dictionary for that?

> 	List:	[name1, name2, name3, name...] 
> 
> 	for name in List:
> 		#makes sense as you would do the same process on each name
> 
> 	for field in Tuple:
> 		#does NOT make sense; why would you do the same 
> 		#process on street and zip, for example

I see the distinction. Thanks again ...

-- 
Odysseus