[Tutor] parse text file

Dave Angel davea at ieee.org
Tue Feb 2 13:25:35 CET 2010


Norman Khine wrote:
> thanks denis,
>
> On Tue, Feb 2, 2010 at 9:30 AM, spir <denis.spir at free.fr> wrote:
>   
>> On Mon, 1 Feb 2010 16:30:02 +0100
>> Norman Khine <norman at khine.net> wrote:
>>
>>     
>>> On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson <kent37 at tds.net> wrote:
>>>       
>>>> On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine <norman at khine.net> wrote:
>>>>
>>>>         
>>>>> thanks, what about the whitespace problem?
>>>>>           
>>>> \s* will match any amount of whitespace includin newlines.
>>>>         
>>> thank you, this worked well.
>>>
>>> here is the code:
>>>
>>> ###
>>> import re
>>> file=en('producers_google_map_code.txt', 'r')
>>> data =repr( file.read().decode('utf-8') )
>>>
>>> block =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
>>> b =lock.findall(data)
>>> block_list =]
>>> for html in b:
>>>       namespace =}
>>>       t =e.compile(r"""<strong>(.*)<\/strong>""")
>>>       title =.findall(html)
>>>       for item in title:
>>>               namespace['title'] =tem
>>>       u =e.compile(r"""a href=\"\/(.*)\">En savoir plus""")
>>>       url =.findall(html)
>>>       for item in url:
>>>               namespace['url'] =tem
>>>       g =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>>>       lat =.findall(html)
>>>       for item in lat:
>>>               namespace['LatLng'] =tem
>>>       block_list.append(namespace)
>>>
>>> ###
>>>
>>> can this be made better?
>>>       
>> The 3 regex patterns are constants: they can be put out of the loop.
>>
>> You may also rename b to blocks, and find a more a more accurate name for block_list; eg block_records, where record =et of (named) fields.
>>
>> A short desc and/or example of the overall and partial data formats can greatly help later review, since regex patterns alone are hard to decode.
>>     
>
> here are the changes:
>
> import re
> file=en('producers_google_map_code.txt', 'r')
> data =repr( file.read().decode('utf-8') )
>
> get_record =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
> get_title =e.compile(r"""<strong>(.*)<\/strong>""")
> get_url =e.compile(r"""a href=\"\/(.*)\">En savoir plus""")
> get_latlng =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>
> records =et_record.findall(data)
> block_record =]
> for record in records:
> 	namespace =}
> 	titles =et_title.findall(record)
> 	for title in titles:
> 		namespace['title'] =itle
> 	urls =et_url.findall(record)
> 	for url in urls:
> 		namespace['url'] =rl
> 	latlngs =et_latlng.findall(record)
> 	for latlng in latlngs:
> 		namespace['latlng'] =atlng
> 	block_record.append(namespace)
>
> print block_record
>   
>> The def of "namespace" would be clearer imo in a single line:
>>    namespace =title:t, url:url, lat:g}
>>     
>
> i am not sure how this will fit into the code!
>
>   
>> This also reveals a kind of name confusion, doesn't it?
>>
>>
>> Denis
>>
>>     
>
Your variable 'file' is hiding a built-in name for the file type.  No 
harm in this example, but it's a bad habit to get into.

What did you intend to happen if the number of titles, urls, and latIngs 
are not each exactly one?  As you have it now, if there's more than one, 
you spend time adding them all to the dictionary, but only the last one 
survives.  And if there aren't any, you don't make an entry in the 
dictionary.

If that's the exact behavior you want, then you could replace the loop 
with an if statement:   (untested)

	if titles:
		namespace['title'] = titles[-1]


On the other hand, if you want a None in your dictionary for missing 
information, then something like:  (untested)

for record in records:


	titles = get_title.findall(record)
	title = titles[-1] if titles else None
	urls = get_url.findall(record)
	url = urls[-1] if urls else None
	latlngs = get_latlng.findall(record)
	lating = latings[-1] if latings else None
	block_record.append( {'title':title, 'url':url, 'lating':lating{ )


DaveA


More information about the Tutor mailing list