Most direct way to strip unoprintable characters out of a string?

Diez B. Roggisch deets at nospam.web.de
Mon Sep 26 08:52:20 EDT 2005


Steve Bergman wrote:
> Fredrik Lundh wrote:
> 
>> ("sanitizing" HTML data by running filters over encoded 8-bit data is 
>> hardly
>> ever the right thing to do...)
>>
>>
>>  
>>
> I'm very much open to suggestions as to the right way to do this.  I'm 
> working on this primarily as a learning project and security is my 
> motivation for wanting to strip  the unprintables.
> 
> Is there a better way? (This is a mod_python app , just for reference.)

Deal with encodings properly. That characters are "unprintable" means 
that you have an encoding mismatch - your output device (usually a 
terminal, but a browser is a sort of device too) can't make sense of 
certain byte codes - and pukes on you. But these bytecode come from 
somewhere, and aren't "random".

So I suggest you read upon the subjects of unicode, encodings - and this 
in the context of python, of course :)

BTW: if that HTML was XHTML, it weren't valid if the contents didn't 
match the specified encoding in the header - which doesn't mean that 
sometimes these mismatch because of misunderstandings on the programmer 
side.

Diez



More information about the Python-list mailing list