Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Dave Angel d at davea.name
Mon Oct 31 22:12:26 EDT 2011


On 10/31/2011 08:32 PM, Patrick Maupin wrote:
> On Mon, Oct 31, 2011 at 4:08 PM, Dave Angel<d... at davea.name>  wrote:
>
> Yes.  Actually, you don't even need the split() -- you can pass an
> optional deletechars parameter to translate().
>
>
> On Oct 31, 5:52 pm, Ian Kelly<ian.g.ke... at gmail.com>  wrote:
>
>> That sounds overly complicated and error-prone.
> Not really.
>
>>   For instance, split() will split on vertical tab,
>> which is not one of the characters the OP wanted.
> That's just the default behavior.  You can explicitly specify the
> separator to split on.  But it's probably more efficient to just use
> translate with deletechars.
>
>> I would probably use a regular expression for this.
> I use 'em all the time, but not for stuff this simple.
>
> Regards,
> Pat
I would claim that a well-written (in C) translate function, without 
using the delete option, should be much quicker than any python loop, 
even if it does copy the data.  Incidentally, on the Pentium family, 
there's a machine instruction for that, to do the whole loop in one 
instruction (with rep prefix).  I don't know if the library version is 
done so.  And with the delete option, it wouldn't be copying anything, 
if the data is all legal.  As for processing a gig of data, I never said 
to do it all in one pass.  Process it maybe 4k at a time, and quit the 
first time you encounter a character not in the table.

But I didn't try to post any code, since the OP never specified Python 
version, nor the encoding of the data.  He just said string.  And we all 
know that without measuring, it's all speculation.

DaveA


-- 

DaveA




More information about the Python-list mailing list