Most direct way to strip unoprintable characters out of a string?
Diez B. Roggisch
deets at nospam.web.de
Mon Sep 26 08:52:20 EDT 2005
Steve Bergman wrote:
> Fredrik Lundh wrote:
>
>> ("sanitizing" HTML data by running filters over encoded 8-bit data is
>> hardly
>> ever the right thing to do...)
>>
>>
>>
>>
> I'm very much open to suggestions as to the right way to do this. I'm
> working on this primarily as a learning project and security is my
> motivation for wanting to strip the unprintables.
>
> Is there a better way? (This is a mod_python app , just for reference.)
Deal with encodings properly. That characters are "unprintable" means
that you have an encoding mismatch - your output device (usually a
terminal, but a browser is a sort of device too) can't make sense of
certain byte codes - and pukes on you. But these bytecode come from
somewhere, and aren't "random".
So I suggest you read upon the subjects of unicode, encodings - and this
in the context of python, of course :)
BTW: if that HTML was XHTML, it weren't valid if the contents didn't
match the specified encoding in the header - which doesn't mean that
sometimes these mismatch because of misunderstandings on the programmer
side.
Diez
More information about the Python-list
mailing list