Filtering out non-readable characters
Bengt Richter
bokr at oz.net
Fri Jul 15 21:13:05 EDT 2005
On 15 Jul 2005 17:33:39 -0700, "MKoool" <mohan at terabolic.com> wrote:
>I have a file with binary and ascii characters in it. I massage the
>data and convert it to a more readable format, however it still comes
>up with some binary characters mixed in. I'd like to write something
>to just replace all non-printable characters with '' (I want to delete
>non-printable characters).
>
>I am having trouble figuring out an easy python way to do this... is
>the easiest way to just write some regular expression that does
>something like replace [^\p] with ''?
>
>Or is it better to go through every character and do ord(character),
>check the ascii values?
>
>What's the easiest way to do something like this?
>
>>> import string
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> identity = ''.join([chr(i) for i in xrange(256)])
>>> unprintable = ''.join([c for c in identity if c not in string.printable])
>>>
>>> def remove_unprintable(s):
... return s.translate(identity, unprintable)
...
>>> set(remove_unprintable(identity)) - set(string.printable)
set([])
>>> set(remove_unprintable(identity))
set(['\x0c', ' ', '$', '(', ',', '0', '4', '8', '<', '@', 'D', 'H', 'L', 'P', 'T', 'X', '\\', '`
', 'd', 'h', 'l', 'p', 't', 'x', '|', '\x0b', '#', "'", '+', '/', '3', '7', ';', '?', 'C', 'G',
'K', 'O', 'S', 'W', '[', '_', 'c', 'g', 'k', 'o', 's', 'w', '{', '\n', '"', '&', '*', '.', '2',
'6', ':', '>', 'B', 'F', 'J', 'N', 'R', 'V', 'Z', '^', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '~', '
\t', '\r', '!', '%', ')', '-', '1', '5', '9', '=', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y', ']', 'a',
'e', 'i', 'm', 'q', 'u', 'y', '}'])
>>> sorted(set(remove_unprintable(identity))) == sorted(set(string.printable))
True
>>> sorted((remove_unprintable(identity))) == sorted((string.printable))
True
After that, to get clean file text, something like
cleantext = remove_unprintable(file('unclean.txt').read())
should do it. Or you should be able to iterate by lines something like (untested)
for uncleanline in file('unclean.txt'):
cleanline = remove_unprintable(uncleanline)
# ... do whatever with clean line
If there is something in string.printable that you don't want included, just use your own
string of printables. BTW,
>>> help(str.translate)
Help on method_descriptor:
translate(...)
S.translate(table [,deletechars]) -> string
Return a copy of the string S, where all characters occurring
in the optional argument deletechars are removed, and the
remaining characters have been mapped through the given
translation table, which must be a string of length 256.
Regards,
Bengt Richter
More information about the Python-list
mailing list