character encoding conversion

Mon Dec 13 09:55:57 EST 2004

Forgot a part... You need the encoding list:

encodings = [
     'utf-8',
     'latin-1',
     'ascii',
     'cp1252',
     ]

Christian Ergh wrote:
> Dylan wrote:
> 
>> Here's what I'm trying to do:
>>
>> - scrape some html content from various sources
>>
>> The issue I'm running to:
>>
>> - some of the sources have incorrectly encoded characters... for
>> example, cp1252 curly quotes that were likely the result of the author
>> copying and pasting content from Word
>>
> Finally: For me this works, all inside my own class, and the module has 
> a logger, for reuse you would need to fix this stuff... Im am updating a 
> postgreSQL Database, in case someone wonders about the __setattr__, and 
> my class inherits from SQLObject.
> 
>     def doDecode(self, st):
>         "Returns an encoding that doesn't fail"
>         for encoding in encodings:
>             try:
>                 stEncoded = st.decode(encoding)
>                 return stEncoded
>             except UnicodeError:
>                 pass
> 
>     def setAttribute(self, name, data):
>         import HTMLFilter
>         data = self.doDecode(data)
>         try:
>             data = data.encode('ascii', "xmlcharrefreplace")
>         except:
>             log.warn('new method did not fit')
> 
>         try:
>             if '&#' in data:
>                 data = HTMLFilter.HTMLDecode(data)
>         except UnicodeDecodeError:
>             log.debug('HTML decoding failed!!!')
> 
>         try:
>             data = data.encode('utf-8')
>         except:
>             log.warn('new utf 8 method did not fit')
> 
>         try:
>             self.__setattr__(name, data)
>         except:
>             log.debug('1. try failed: ')
>             log.warning(type(data))
>             log.debug(data)
>             log.warning('Some unicode error while updating')