character encoding conversion
Christian Ergh
christian.ergh at gmail.com
Mon Dec 13 09:55:57 EST 2004
Forgot a part... You need the encoding list:
encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]
Christian Ergh wrote:
> Dylan wrote:
>
>> Here's what I'm trying to do:
>>
>> - scrape some html content from various sources
>>
>> The issue I'm running to:
>>
>> - some of the sources have incorrectly encoded characters... for
>> example, cp1252 curly quotes that were likely the result of the author
>> copying and pasting content from Word
>>
> Finally: For me this works, all inside my own class, and the module has
> a logger, for reuse you would need to fix this stuff... Im am updating a
> postgreSQL Database, in case someone wonders about the __setattr__, and
> my class inherits from SQLObject.
>
> def doDecode(self, st):
> "Returns an encoding that doesn't fail"
> for encoding in encodings:
> try:
> stEncoded = st.decode(encoding)
> return stEncoded
> except UnicodeError:
> pass
>
> def setAttribute(self, name, data):
> import HTMLFilter
> data = self.doDecode(data)
> try:
> data = data.encode('ascii', "xmlcharrefreplace")
> except:
> log.warn('new method did not fit')
>
> try:
> if '&#' in data:
> data = HTMLFilter.HTMLDecode(data)
> except UnicodeDecodeError:
> log.debug('HTML decoding failed!!!')
>
> try:
> data = data.encode('utf-8')
> except:
> log.warn('new utf 8 method did not fit')
>
> try:
> self.__setattr__(name, data)
> except:
> log.debug('1. try failed: ')
> log.warning(type(data))
> log.debug(data)
> log.warning('Some unicode error while updating')
More information about the Python-list
mailing list