encoding of a file

Fri Aug 4 05:03:42 EDT 2006

[Thomas Thomas]

| how can I find the encoding to use to open a file.. I have a 
| file with "£" chararcter..
| is there some utility function in python that I can use
|  
| how can I know which encoding to use

[This is going to be a longer answer than you really
want. The short answer is "probably iso-8859-1 but
there's no way of being certain without trying it out".]

The general answer to "how can I know which encoding 
to use for an arbitrary text file?" is that: you can't. 
The more helpful answer is that there are various heuristics 
(polite term for "good guessing algorithms") which will help 
you out. I believe that the latest BeautifulSoup has one:

http://www.crummy.com/software/BeautifulSoup/

and I'm sure there are others. To be certain, though,
you need to be told -- somehow -- what encoding was
in use when the file was saved.

However, that's not quite what you're asking. You
say you have a file with a "£" character. But what
does that mean? Ultimately, that you have some text
in a file, one character of which you expect to display
as a pound sign (that's a British pound sign, not
the # which Americans bizarrely call a pound sign ;).

Someone, somewhere, got this pound sign into a file.
Maybe it was from a text editor, maybe through a
database. However it happened, the application
saved its data to disk using some encoding. If it
was a naive tool (non-unicode-aware) then it was
probably ASCII with some kind of extension above
the 7-bit mark. iso-8859-1 / latin-1 (same thing)
often cope with that. If the app was unicode-aware,
it'll be a specific unicode encoding. Quite possibly
utf-8.

To experiment, pick the necessary byte/bytes out of
your text stream and compare with a few encodings:

<dump>
import sys
from unicodedata import name

#
# This is, for example, your original "pound sign"
#
bytes = "\x9c"

#
# This is what we're aiming for: what unicode 
# thinks of as a pound sign
#
print name (u"£")
# -> POUND SIGN

#
# Let's try ascii
#
print name (bytes.decode ("ascii"))
#
# Whoops!
# -> UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 0: ordinal not in range(128)

#
# iso-8859-1 / latin-1
#
print name (bytes.decode ("iso-8859-1"))
#
# Still not right
# -> ValueError: no such name

#
# Cheating, slightly...
#
print name (bytes.decode (sys.stdin.encoding))
#
# Bingo!
# -> POUND SIGN

print sys.stdin.encoding
# -> cp437
print sys.stdout.encoding
# -> cp437

</dump>

So in this case it was cp437 (since I got the bytes from
typing "£" into the interpreter, something I can do on
my keyboard. You might well find it was some other encoding.

If this doesn't take you anwhere -- or you don't understand it --
try dumping a bit of your data into an email and posting it. If
nothing else, someone will probably be able to tell you what
encoding you need!

TJG

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________