How to efficiently extract information from structured text file

Rhodri James rhodri at wildebst.demon.co.uk
Tue Feb 16 19:29:44 EST 2010


On Tue, 16 Feb 2010 23:48:17 -0000, Imaginationworks <xiajunyi at gmail.com>  
wrote:

> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file.  Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'

The usual idiom is to process a line at a time, which avoids the memory  
overhead of reading the entire file in, creating the list, and so on.   
Assuming your input file is laid out as neatly as you said, that's  
straightforward to do:

for line in myfile:
     if "= {" in line:
         start_a_new_object(line)
     elif "};" in line:
         end_current_object(line)
     else:
         add_stuff_to_current_object(line)

You probably want more robust tests than I used there, but that depends on  
how well-defined your input file is.  If it can be edited by hand, you'll  
need to be more defensive!

> 2) Is there any efficient ways to extract the object information you
> may suggest?

That depends on what you mean by "extract the object information".  If you  
mean "get the object name", just split the line at the "=" and strip off  
the whitespace you don't want.  If you mean "track how objects are  
connected to one another, have each object keep a list of its immediate  
sub-objects (which will have lists of their immediate sub-objects, and so  
on); it's fairly easy to keep track of which objects are current using a  
list as a stack.  If you mean something else, sorry but my crystal ball is  
cloudy tonight.

-- 
Rhodri James *-* Wildebeeste Herder to the Masses



More information about the Python-list mailing list