How do I automate the removal of all non-ascii characters from my code?

Vlastimil Brom vlastimil.brom at gmail.com
Tue Sep 13 14:13:44 EDT 2011


2011/9/13 Alec Taylor <alec.taylor6 at gmail.com>:
> Hmm, nothing mentioned so far works for me...
>
> Here's a very small test case:
>
>>>> python -u "Convert to Creole.py"
>  File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>>>> Exit Code: 1
>
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
>
> On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
> <vlastimil.brom at gmail.com> wrote:
>> 2011/9/13 ron <vacorama at gmail.com>:
>>>
>>> Depending on the load, you can do something like:
>>>
>>> "".join([x for x in string if ord(x) < 128])
>>>
>>> It's worked great for me in cleaning input on webapps where there's a
>>> lot of copy/paste from varied sources.
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>>
>>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
>> u'text  with non ASCII in between ...'
>>>>>
>>
>> vbr
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>

Ok, in that case the encoding probably would be utf-8; \xe2 is just
the first part of the encoded data

>>> u'≤'.encode("utf-8")
'\xe2\x89\xa4'
>>>

Setting this encoding at the beginning of the file, as mentioned
before, might solve the problem while retaining the symbol in question
(or you could move from syntax error to some unicode related error
depending on other circumstances...).

vbr



More information about the Python-list mailing list