Elementary string-parsing

Mon Feb 4 07:25:24 EST 2008

In article <13qd6ec9vv1qv9a at corp.supernews.com>,
 Dennis Lee Bieber <wlfraed at ix.netcom.com> wrote:

<snip>

> 	Rather complicated description... A sample of the real/actual input
> /file/ would be useful.

Sorry, I didn't want to go on too long about the background, but I guess 
more context would have helped. The data actually come from a web page; 
I use a class based on SGMLParser to do the initial collection. The 
items in the "names" list were originally "title" attributes of anchor 
tags and are obtained with a "start_a" method, while "cells" holds the 
contents of the <td> tags, obtained by a "handle_data" method according 
to the state of a flag that's set to True by a "start_td" method and to 
False by an "end_td". I don't care about anything else on the page, so I 
didn't define most of the tag-specific methods available.

<snip>

> 		cellRoot = 10 * i + na	#where did na come from?
> 								#heck, where do names and cells
> 								#come from? Globals? Not recommended..

The variable "na" is the number of 'not applicable' items (headings and 
whatnot) preceding the data I'm interested in.

I'm not clear on what makes an object global, other than appearing as an 
operand of a "global" statement, which I don't use anywhere. But "na" is 
assigned its value in the program body, not within any function: does 
that make it global? Why is this not recommended? If I wrap the 
assignment in a function, making "na" a local variable, how can 
"extract_data" then access it?

The lists of data are attributes (?) of my SGMLParser class; in my 
misguided attempt to pare irrelevant details from "extract_data" I 
obfuscated this aspect. I have a "parse_page(url)" function that returns 
an instance of the class, as "captured", and the lists in question are 
actually called "captured.names" and "captured.cells". The 
"parse_page(url)" function is called in the program body; does that make 
its output global as well?

> 	use
> 
> def extract_data(names, na, cells):
> 
> 	and 
> 
> 	return <something>

What should it return? A Boolean indicating success or failure? All the 
data I want should all have been stored in the "found" dictionary by the 
time the function finishes traversing the list of names.

> >         for k in ('time', 'score1', 'score2'):
> >             v = found[name][k]
> >             if v != "---" and v != "n/a": # skip non-numeric data
> >                 v = ''.join(v.split(",")) # remove commas between 000s
> >                 found[name][k] = float(v)
> 
> 	I'd suggest splitting this into a short function, and invoking it in
> the preceding... say it is called "parsed"
> 
> 			"time" : parsed(cells[cellRoot + 5]),

Will do. I guess part of my problem is that being unsure of myself I'm 
reluctant to attempt too much in a single complex statement, finding it 
easier to take small and simple (but inefficient) steps. I'll have to 
learn to consolidate things as I go.

> 	Did you check the library for time/date parsing/formatting
> operations?
> 
> >>> import time
> >>> aTime = "03 Feb 2008 20:35:46 UTC"	#DD Mth YYYY HH:MM:SS UTC
> >>> time.strptime(aTime, "%d %b %Y %H:%M:%S %Z")
> (2008, 2, 3, 20, 35, 46, 6, 34, 0)

I looked at the documentation for the "time" module, including 
"strptime", but I didn't realize the "%b" directive would match the 
month abbreviations I'm dealing with. It's described as "Locale's 
abbreviated month name"; if someone were to run my program on a French 
system e.g., wouldn't it try to find a match among "jan", "fév", ..., 
"déc" (or whatever) and fail? Is there a way to declare a "locale" that 
will override the user's settings? Are the locale-specific strings 
documented anywhere? Can one assume them to be identical in all 
English-speaking countries, at least? Now it's pretty unlikely in this 
case that such an 'international situation' will arise, but I didn't 
want to burn any bridges ...

I was also somewhat put off "strptime" on reading the caveat "Note: This 
function relies entirely on the underlying platform's C library for the 
date parsing, and some of these libraries are buggy. There's nothing to 
be done about this short of a new, portable implementation of 
strptime()." If it works, however, it'll be a lot tidier than what I was 
doing. I'll make a point of testing it on its own, with a variety of 
inputs.

> Note that the %Z is a problematic entry...

> ValueError: time data did not match format:  data=03 Feb 2008 
> 20:35:46 PST  fmt=%d %b %Y %H:%M:%S %Z

All the times are UTC, so fortunately this is a non-issue for my 
purposes of the moment. May I assume that leaving the zone out will 
cause the time to be treated as UTC?

Thanks for your help, and for bearing with my elementary questions and 
my fumbling about.

-- 
Odysseus