Stripping ASCII codes when parsing

Tony Nelson *firstname*nlsnews at georgea*lastname*.com
Mon Oct 17 23:11:24 EDT 2005


In article <mailman.2178.1129571437.509.python-list at python.org>,
 David Pratt <fairwinds at eastlink.ca> wrote:

> This is very nice :-)  Thank you Tony.  I think this will be the way to  
> go.  My concern ATM is where it will be best to unicode. The data after  
> this will go into dict and a few processes and into database. Because  
> input source if not explicit encoding, I will have to assume ISO-8859-1  
> I believe but could well be cp1252 for most part ( because it says no  
> ASCII (0-30) but alright ASCII chars 128-254) and because most are  
> Windows users.  Am thinking to unicode after stripping these characters  
> and validating text, then unicoding (utf-8) so it is unicode in dict.  
> Then when I perform these other processes it should be uniform and then  
> it will go into database as unicode.  I think this should be ok.

Definitely "".translate() then unicode().  See the docs for 
"".translate().  As far as charset, well, if you can't know in advance 
you'll want to have some way to configure it for when it's wrong.  Also, 
maybe 255 is not allowed and should be checked for?
________________________________________________________________________
TonyN.:'                        *firstname*nlsnews at georgea*lastname*.com
      '                                  <http://www.georgeanelson.com/>



More information about the Python-list mailing list