Elementary string-parsing

Steve Holden steve at holdenweb.com
Tue Feb 5 08:07:46 EST 2008


Dennis Lee Bieber wrote:
> On Tue, 05 Feb 2008 04:03:04 GMT, Odysseus
> <odysseus1479-at at yahoo-dot.ca> declaimed the following in
> comp.lang.python:
> 
>> Sorry, translation problem: I am acquainted with Python's "for" -- if 
>> far from fluent with it, so to speak -- but the PS operator that's most 
>> similar (traversing a compound object, element by element, without any 
>> explicit indexing or counting) is called "forall". PS's "for" loop is 
>> similar to BASIC's (and ISTR Fortran's):
>>
>> start_value increment end_value {procedure} for
>>
>> I don't know the proper generic term -- "indexed loop"? -- but at any 
>> rate it provides a counter, unlike Python's command of the same name.
>>
> 	The convention is Python is to use range() (or xrange() ) to
> generate a sequence of "index" values for the for statement to loop
> over:
> 
> 	for i in range([start], end, [step]):
> 
> with the caveat that "end" will not be one of the values, start defaults
> to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
> values starting at 0].
>  
If you have a sequence of values s and you want to associate each with 
its index value as you loop over the sequence the easiest way to do this 
is the enumerate built-in function:

 >>> for x in enumerate(['this', 'is', 'a', 'list']):
...   print x
...
(0, 'this')
(1, 'is')
(2, 'a')
(3, 'list')

It's usually (though not always) much more convenient to bind the index 
and the value to separate names, as in

 >>> for i, v in enumerate(['this', 'is', 'a', 'list']):
...   print i, v
...
0 this
1 is
2 a
3 list

[...]
> 	The whole idea behind the SGML parser is that YOU add methods to
> handle each tag type you need... Also, FYI, there IS an HTML parser (in
> module htmllib) that is already derived from sgmllib.
> 
> class PageParser(SGMLParser):
> 	def __init__(self):
> 		#need to call the parent __init__, and then
> 		#initialize any needed attributes -- like someplace to collect
> 		#the parsed out cell data
> 		self.row = {}
> 		self.all_data = []
> 
> 	def	start_table(self, attrs):
> 		self.inTable = True
> 		.....
> 
> 	def end_table(self):
> 		self.inTable = False
> 		.....
> 
> 	def start_tr(self, attrs):
> 		if self.inRow:
> 			#unclosed row!
> 			self.end_tr()
> 		self.inRow = True
> 		self.cellCount = 0
> 		...
> 
> 	def end_tr(self):
> 		self.inRow = False
> 		# add/append collected row data to master stuff
> 		self.all_data.append(self.row)
> 		...
> 
> 	def start_td(self, attrs):
> 		if self.inCell:
> 			self.end_td()
> 		self.inCell = True
> 		...
> 
> 	def end_td(self):
> 		self.cellCount = self.cellCount + 1
> 		...
> 
> 	def handle_data(self, text):
> 		if self.inTable and self.inRow and self.inCell:
> 			if self.cellCount == 0:
> 				#first column stuff
> 				self.row["Epoch1"] = convert_if_needed(text)
> 			elif self.cellCount == 1:
> 				#second column stuff
> 		...
> 
> 
> 	Hope you don't have nested tables -- it could get ugly as this style
> of parser requires the start_tag()/end_tag() methods to set instance
> attributes for the purpose of tracking state needed in later methods
> (notice the complexity of the handle_data() method just to ensure that
> the text is from a table cell, and not some random text).
> 
There is, of course, nothing to stop you building a recursive data 
structure, so that encountering a new opening tag such as <table> adds 
another level to some stack-like object, and the corresponding closing 
tag pops it off again, but this *does* add to the complexity somewhat.

It seems natural that more complex input possibilities lead to more 
complex parsers.

> 	And somewhere before you close the parser, get a handle on the
> collected data...
> 
> 
> 	parsed_data = parser.all_data
> 	parser.close()
> 	return parsed_data
> 
> 
>> Why wouldn't one use a dictionary for that?
>>
> 	The overhead may not be needed... Tuples can also be used as the
> keys /in/ a dictionary.
>  
regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/




More information about the Python-list mailing list