Filtering out non-readable characters

Bengt Richter bokr at oz.net
Fri Jul 15 21:13:05 EDT 2005


On 15 Jul 2005 17:33:39 -0700, "MKoool" <mohan at terabolic.com> wrote:

>I have a file with binary and ascii characters in it.  I massage the
>data and convert it to a more readable format, however it still comes
>up with some binary characters mixed in.  I'd like to write something
>to just replace all non-printable characters with '' (I want to delete
>non-printable characters).
>
>I am having trouble figuring out an easy python way to do this... is
>the easiest way to just write some regular expression that does
>something like replace [^\p] with ''?
>
>Or is it better to go through every character and do ord(character),
>check the ascii values?
>
>What's the easiest way to do something like this?
>

 >>> import string
 >>> string.printable
 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
 >>> identity = ''.join([chr(i) for i in xrange(256)])
 >>> unprintable = ''.join([c for c in identity if c not in string.printable])
 >>>
 >>> def remove_unprintable(s):
 ...     return s.translate(identity, unprintable)
 ...
 >>> set(remove_unprintable(identity)) - set(string.printable)
 set([])
 >>> set(remove_unprintable(identity))
 set(['\x0c', ' ', '$', '(', ',', '0', '4', '8', '<', '@', 'D', 'H', 'L', 'P', 'T', 'X', '\\', '`
 ', 'd', 'h', 'l', 'p', 't', 'x', '|', '\x0b', '#', "'", '+', '/', '3', '7', ';', '?', 'C', 'G',
 'K', 'O', 'S', 'W', '[', '_', 'c', 'g', 'k', 'o', 's', 'w', '{', '\n', '"', '&', '*', '.', '2',
 '6', ':', '>', 'B', 'F', 'J', 'N', 'R', 'V', 'Z', '^', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '~', '
 \t', '\r', '!', '%', ')', '-', '1', '5', '9', '=', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y', ']', 'a',
 'e', 'i', 'm', 'q', 'u', 'y', '}'])
 >>> sorted(set(remove_unprintable(identity))) == sorted(set(string.printable))
 True
 >>> sorted((remove_unprintable(identity))) == sorted((string.printable))
 True

After that, to get clean file text, something like

  cleantext = remove_unprintable(file('unclean.txt').read())

should do it. Or you should be able to iterate by lines something like (untested)

    for uncleanline in file('unclean.txt'):
        cleanline = remove_unprintable(uncleanline)
        # ... do whatever with clean line

If there is something in string.printable that you don't want included, just use your own
string of printables. BTW,

 >>> help(str.translate)
 Help on method_descriptor:

 translate(...)
     S.translate(table [,deletechars]) -> string

     Return a copy of the string S, where all characters occurring
     in the optional argument deletechars are removed, and the
     remaining characters have been mapped through the given
     translation table, which must be a string of length 256.

Regards,
Bengt Richter



More information about the Python-list mailing list