[Tutor] parse text file

Norman Khine norman at khine.net
Tue Feb 2 22:56:22 CET 2010


On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson <kent37 at tds.net> wrote:
> On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine <norman at khine.net> wrote:
>> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson <kent37 at tds.net> wrote:
>>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine <norman at khine.net> wrote:
>>>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson <kent37 at tds.net> wrote:
>>>>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine <norman at khine.net> wrote:
>
>>>>> Why do you use repr() here?
>
>>>
>>> It smells of programming by guess rather than a correct solution to
>>> some problem. What happens if you take it out?
>>
>> when i take it out, i get an empty list.
>>
>> whereas both
>> data = repr( file.read().decode('latin-1') )
>> and
>> data = repr( file.read().decode('utf-8') )
>>
>> returns the full list.
>
> Try this version:
>
> data = file.read()
>
> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
> myIcon\n""", re.DOTALL).findall
> get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall
> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
> get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>
> then as before.
>
> Your repr() call is essentially removing newlines from the input by
> converting them to literal '\n' pairs. This allows your regex to work
> without the DOTALL modifier.
>
> Note you will get slightly different results with my version - it will
> give you correct utf-8 text for the titles whereas yours gives \
> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
> version returns
>
> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>
> Mine gives
> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>
> This is showing the repr() of the title so they both have \ but note
> that yours has two \\ indicating that the \ is in the text; mine has
> only one \.

i am no expert, but there seems to be a bigger difference.

with repr(), i get:
Sat\\xe9re Maw\\xe9

where as you get

Sat\xc3\xa9re Maw\xc3\xa9

repr()'s
é == \\xe9
whereas on your version
é == \xc3\xa9

>
> Kent
>

also, i still get an empty list when i run the code as suggested.


More information about the Tutor mailing list