Best way to clean up list items?

DFS nospam at dfs.com
Mon May 2 15:04:10 EDT 2016


On 5/2/2016 2:27 PM, Jussi Piitulainen wrote:
> DFS writes:
>
>> On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:
>>> DFS writes:
>>>
>>>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>>> Want: list1 = ['Item 1','Item 2']
>
> . .
>
>>> Funny-looking data you have.
>>
>> I know - sadly, it's actual data:
>>
>> --------------------------------------------------------------------
>> from lxml import html
>> import requests
>>
>> webpage =
>> "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"
>>
>> page  = requests.get(webpage)
>> tree  = html.fromstring(page.content)
>> addr1 = tree.xpath('//span[@class="text3"]/text()')
>> print 'Addresses: ', addr1
>> --------------------------------------------------------------------
>>
>> I couldn't figure out a better way to extract it from the HTML (maybe
>> XML and DOM?)
>
> I should have guessed :) But now I'm a bit worried about those spaces
> inside your items. Can it happen that item text is split into strings in
> the middle?

Meaning split by me, or comes 'malformed' from the data source?


> Then the above sanitation does the wrong thing.
>
> If someone has the right solution, I'm watching, too.


Here's the raw data as stored in the tree:

---------------------------------------------------------------------------
1st page

['\r\n                        ', '\r\n                        1918 W End 
Ave, Nashville, TN 37203', '\r\n
               ', '\r\n                        1806 Hayes St, Nashville, 
TN 37203', '\r\n                        ', '\r\n 
1701 Broadway, Nashville, TN 37203', '\r\n                        ', '\r\n
             209 10th Ave S, Nashville, TN 37203', '\r\n 
        ', '\r\n                        907 20th Ave S, Nashville, TN 
37212', '\r\n                        ', '\r\n                        911 
20th Ave S, Nashville, TN 37212', '\r\n                        ', '\r\n 
                       1722 W End Ave, Nashville, TN 37203', '\r\n 
                  ', '\r\n                        1905 Hayes St, 
Nashville, TN 37203', '\r\n
               ', '\r\n                        2000 W End Ave, 
Nashville, TN 37203']

---------------------------------------------------------------------------

Next page

['\r\n                        ', '\r\n                        120 19th 
Ave N, Nashville, TN 37203', '\r\n
               ', '\r\n                        1719 W End Ave Ste 101, 
Nashville, TN 37203', '\r\n
       ', '\r\n                        1922 W End Ave, Nashville, TN 
37203', '\r\n                        ', '\r\n
                       909 20th Ave S, Nashville, TN 37212', '\r\n 
                  ', '\r\n
       1807 Church St, Nashville, TN 37203', '\r\n 
  ', '\r\n                        1721 Church St, Nashville, TN 37203', 
'\r\n                        ', '\r\n                        718 
Division St, Nashville, TN 37203', '\r\n                        ', '\r\n 
                        907 12th Ave S, Nashville, TN 37203', '\r\n 
                   ', '\r\n                        204 21st Ave S, 
Nashville, TN 37203', '\r\n
           ', '\r\n                        1811 Division St, Nashville, 
TN 37203', '\r\n                        ', '\r\n 
903 Gleaves St, Nashville, TN 37203', '\r\n                        ', '\r\n
             1720 W End Ave Ste 530, Nashville, TN 37203', '\r\n 
                ', '\r\n
     1200 Division St Ste 100-A, Nashville, TN 37203', '\r\n 
            ', '\r\n
422 7th Ave S, Nashville, TN 37203', '\r\n                        ', 
'\r\n                        605 8th Ave S, Nashville, TN 37203']

and so on
---------------------------------------------------------------------------

I've checked a couple hundred addresses visually, and so far I've only 
seen 2 formats:

1. '\r\n            '
2. '\r\n   address  '





More information about the Python-list mailing list