Latest approach to controlling non-printable / multi-byte characters

Fri Feb 9 04:52:29 EST 2007

metaperl wrote:

> There is no end to the number of frantic pleas for help with
> characters in the realm beyond ASCII.

And the answer is "first decode to unicode, then modify" in nine out of ten
cases.

> However, in searching thru them, I do not see a workable approach to
> changing them into other things.
> 
> I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
> AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
> AMHERST.
> 
> However, if I do a grep for the text the shell returns this:
> 
> MASSACHUSETTSâ€“AMHERST
> 
> and od -tc returns this:
> 
> 0000540        O   F       M   A   S   S   A   C   H   U   S   E   T
> T
> 0000560    S 342 200 223   A   M   H   E   R   S   T   ;       U   N
> I
> 
> 
> So, the conclusion is the "dash" is actually 3 octal characters. My
> goal is to take those 3 octal characters and convert them to an ascii
> dash. Any idea how I might write such a filter? The closest I have got
> it:
> 
> unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')
> 
> but that puts a question mark there.

No idea where the character references come from but the dump suggests that
your text is in UTF-8.

>>> "MASSACHUSETS\342\200\223AMHERST".decode("utf8")
u'MASSACHUSETS\u2013AMHERST'
>>> "MASSACHUSETS\342\200\223AMHERST".decode("utf8").replace(u"\u2013", "-")
u'MASSACHUSETS-AMHERST'

u"\2013" is indeed a dash, by the way:
>>> import unicodedata
>>> unicodedata.name(u"\u2013")
'EN DASH'

Peter