[Tutor] Re: test if file is not ascii

Paul Tremblay phthenry at earthlink.net
Fri Oct 31 14:19:35 EST 2003


On Fri, Oct 31, 2003 at 07:37:05AM +0100, Abel Daniel wrote:
> 
> You could try decoding it and catch the exception:
> 
> >>> ''.decode('us-ascii')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 0: ordinal not in range(128)
> >>> 'asdfsdfg'.decode('us-ascii')
> u'asdfsdfg'
> >>> 
> 

Thanks. This is fast. For a huge, 2M file, checking each line only takes
around 19 seconds on my 350MHZ machine.

But I realize there is another problem. Bogus RTF files might contain
characters with the value less than 19, which will also result in
invalid XML. But characters with values less than 19 represent valid
ascii, so the above code will not work.

I have tried this code:

for letter in line:
   char = ord(letter)
   if char < 20:
	sys.stderr.write('File contains illegal characters.\n')
	return 101

But unfortuanately, this code seems to take way too long. It takes about
1 minutes and 45 secnds to process the same file.

I guess I won't do any checking for values less than 20. To do a check
will add quite a bit of time each time a file is processed.

Thanks

Paul 

-- 

************************
*Paul Tremblay         *
*phthenry at earthlink.net*
************************



More information about the Tutor mailing list