cleaning up an ASCII file?

Thu Jun 11 00:58:28 EDT 2009

Nick Matzke <matzke <at> berkeley.edu> writes:

> 
> 
> Looks like this was a solution:
> 
> 1. Use this guy's unescape function to convert from HTML/XML Entities to 
> unicode
> http://effbot.org/zone/re-sub.htm#unescape-html

Looks like you didn't notice "this guy"'s unaccent.py :-)
http://effbot.org/zone/unicode-convert.htm

[Aside: Has anyone sighted the effbot recently? He's been very quiet.]

> 2. Take the unicode and convert to approximate plain ASCII matches with 
> unicodedata (after import unicodedata)
> 
> ascii_content2 = unescape(line)
> 
> ascii_content = unicodedata.normalize('NFKD', 
> unicode(ascii_content2)).encode('ascii','ignore')

The normalize hack gets you only so far. Many Latin-based characters are not
decomposable. Look for the thread in this newsgroup with subject "convert
unicode characters to visibly similar ascii characters" around 2008-07-01 or
google("hefferon unicode2ascii")

Alternative: If you told us which platform you are running on, people familiar
with that platform could help you set up your terminal to display non-ASCII
characters correctly.

HTH,
John