character encoding conversion

Mon Dec 13 09:50:06 EST 2004

Dylan wrote:
> Here's what I'm trying to do:
> 
> - scrape some html content from various sources
> 
> The issue I'm running to:
> 
> - some of the sources have incorrectly encoded characters... for
> example, cp1252 curly quotes that were likely the result of the author
> copying and pasting content from Word
> 
Finally: For me this works, all inside my own class, and the module has 
a logger, for reuse you would need to fix this stuff... Im am updating a 
postgreSQL Database, in case someone wonders about the __setattr__, and 
my class inherits from SQLObject.

     def doDecode(self, st):
         "Returns an encoding that doesn't fail"
         for encoding in encodings:
             try:
                 stEncoded = st.decode(encoding)
                 return stEncoded
             except UnicodeError:
                 pass

     def setAttribute(self, name, data):
         import HTMLFilter
         data = self.doDecode(data)
         try:
             data = data.encode('ascii', "xmlcharrefreplace")
         except:
             log.warn('new method did not fit')

         try:
             if '&#' in data:
                 data = HTMLFilter.HTMLDecode(data)
         except UnicodeDecodeError:
             log.debug('HTML decoding failed!!!')

         try:
             data = data.encode('utf-8')
         except:
             log.warn('new utf 8 method did not fit')

         try:
             self.__setattr__(name, data)
         except:
             log.debug('1. try failed: ')
             log.warning(type(data))
             log.debug(data)
             log.warning('Some unicode error while updating')