Handling text lines from files with some (few) starnge chars

MRAB python at mrabarnett.plus.com
Sat Jun 5 22:14:49 EDT 2010


Paulo da Silva wrote:
> Em 06-06-2010 00:41, Chris Rebert escreveu:
>> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
>> <psdasilva.nospam at netcabonospam.pt> wrote:
> ...
> 
>> Specify the encoding of the text when opening the file using the
>> `encoding` parameter. For Windows-1252 for example:
>>
>> your_file = open("path/to/file.ext", 'r', encoding='cp1252')
>>
> 
> OK! This fixes my current problem. I used encoding="iso-8859-15". This
> is how my text files are encoded.
> But what about a more general case where the encoding of the text file
> is unknown? Is there anything like "autodetect"?
 >
An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
How could you tell which was the correct encoding?

Well, if the file contained words in a certain language and some of the
characters were wrong, then you'd know that the encoding was wrong. This
does imply, though, that you'd need to know what the language should
look like!

You could try different encodings, and for each one try to identify what
could be words, then look them up in dictionaries for various languages
to see whether they are real words...



More information about the Python-list mailing list