Excess whitespace in my soup

John Machin sjmachin at lexicon.net
Sun Jan 20 05:37:41 EST 2008


Stefan Behnel wrote:
> John Machin wrote:
>   
>> On Jan 19, 11:00 pm, Fredrik Lundh <fred... at pythonware.com> wrote:
>>     
>>> John Machin wrote:
>>>       
>>>> I'm happy enough with reassembling the second item. The problem is in
>>>> reliably and  correctly collapsing the whitespace in each of the above
>>>>         
>>>  > fiveelements. The standard Python idiom of u' '.join(text.split())
>>>  > won't work because the text is Unicode and u'\xa0' is whitespace
>>>
>>>       
>>>> and would be converted to a space.
>>>>         
>>> would this (or some variation of it) work?
>>>
>>>  >>> re.sub("[ \n\r\t]+", " ", u"foo\n  frab\xa0farn")
>>> u'foo frab\xa0farn'
>>>
>>> </F>
>>>       
>> Yes, partially. Leading and trailing whitespace has to be removed
>> entirely, not replaced by one space.
>>     
>
> Sounds like adding a .strip() to me ...
>
>
>   

Sounds like adding a .strip(u' ') to me, otherwise any leading/trailing 
u'\xa0' gets blown away and this must not happen.



More information about the Python-list mailing list