Finding nonprintable characters?

Gustavo Cordova gcordova at hebmex.com
Tue Feb 19 14:13:40 EST 2002


> 
> Hello,
> 

Gweepnings.

>
> I have a function
> 
> isBinary(filehandle)
> 

Shades of Perl's "if(-b $filename)" !!!!

>
> that I'm not sure how to implement.
>

damn. :-(

> I've decided to define binary as 
> containing characters above \x80.  But  what is the best way 
> to do this?
> 
> 1. iterate through xreadline, so the whole thing doesn't get 
> loaded into 
> memory?
>

def isBinary(filehandle):
  # Save current position.
  lastPos = filehandle.tell()
  # Search for binary chars.
  line = filehandle.readline()
  while line:
    if .... (find a char > \x7F)  how??

> 
> 2. String searching?  If so, for what string?  Searching for anything 
> greater  than \x7f?
> 

But where??

In the first line?

Char by char?

In the first n Kb of text?

>
> 3. Re searching?  for what class?
> 

class like [\x7F-\xFF] I'd think.

> 
> Thanks in advance,
> 
> Van
> 

My suggestion:
	1. read a block of the file, say, the first 2Kb.
	2. Scan with a regex like r'[\x80-\xFF]'.
	3. If no chars found, the it's text, else it's binary.


import sre
def isBinary(filehandle, blockSize=2048):
  start = filehandle.read(blockSize)
  filehandle.seek(0)
  
  # Check for "binary" chars.
  if sre.search(r'[\x80-\xFF]',start,sre.S):
    # Sure enough, it's one of them dastardly BINARY files!
    return 1
  
  # Wait! Is there at least ONE \n in the text?
  if not sre.search(r'\n', start, sre.S):
    # Shuks, seem'd decent enough.
    return None
  
  # OK, you're good, I guess.
  return 1


More or less. I added the '\n' requirement for "textyness",
because 2Kb withough a single new-line doesn't seem quite
texty to me. Of course, you might think diferent.

Good luck :-)

-gustavo



-gus




More information about the Python-list mailing list