Trouble with unicode

Charlie Clark charlie at begeistert.org
Tue May 15 04:03:54 EDT 2001


>>they look about 3 characters long but are only 1 really, I already have
>>experience converting Unix characters over.
>
>Sounds like UTF-8. If it is, you can just replace 'latin-1' below with
>'utf-8' :-)
UnicodeError: UTF-8 decoding error: invalid data 
mm, how can I check to see what type of encoding it is?

>> But I'm no closer, am I? I don't quite understand what the
>> codecs module is
>> and how it works.
>
>You're closer :-)
thanx. There's probably a good reason for all the smoke and mirrors but I 
don't see why I can't do simple encode, decodes and it's going to take a 
while working out how this lookup works - I would call it cryptic even if 
though I realise it's being very OO.
>
>OK, it looks like you are starting with a string containing Latin-1
>characters. If I understand correctly, you want to remove the characters
>that are not in the ASCII set (i.e > 127). There are two ways to do that:
>
>1. Fancy (change 'latin-1' to the actual encoding):
>
>>>> from codecs import lookup
>>>> fromLatin1 = lookup( 'latin-1' )[1]
>>>> toASCII = lookup( 'ASCII' )[0]
>>>> asLatin1, dummy = fromLatin1( '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc,
>\xdf' )
>>>> toASCII( asLatin1, 'replace' )
>('?, ?, ?, ?, ?, ?, ?', 19)
It works. But I don't want a load of question marks! I want the special 
characters. I particularly want to be able to replace them all at once.

This is how I've previously done this:
def unix_to_unicode(text):
	special = {"\xc4": "Ä", "\xe4": "ä", "\xd6" : "Ö", "\xf6" : "ö", "\xdc" : 
"ü", "\xfc" :  "ü", "\xdf" : "ß"}
	for key in special.keys():
		text = text.replace(key, special[key])
		return text

This would work fine with single character entities so I will be able to work 
around this.

Maybe I should provide the context for my work. I've written a script which 
reads orders which come via e-mail, and writes the significant data to file 
attributes generating a pseudo database in the file system. This is all BeOS 
specific but I'm using Python for it all 'cos it's the only way I'll ever 
understand what I'm doing!

Thanx again for your help!

Charlie
-- 
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
http://www.begeistert.org





More information about the Python-list mailing list