Latest approach to controlling non-printable / multi-byte characters

Thu Feb 8 15:20:28 EST 2007

There is no end to the number of frantic pleas for help with
characters in the realm beyond ASCII.

However, in searching thru them, I do not see a workable approach to
changing them into other things.

I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
AMHERST.

However, if I do a grep for the text the shell returns this:

MASSACHUSETTSâ€“AMHERST

and od -tc returns this:

0000540        O   F       M   A   S   S   A   C   H   U   S   E   T
T
0000560    S 342 200 223   A   M   H   E   R   S   T   ;       U   N
I

So, the conclusion is the "dash" is actually 3 octal characters. My
goal is to take those 3 octal characters and convert them to an ascii
dash. Any idea how I might write such a filter? The closest I have got
it:

unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')

but that puts a question mark there.