[Tutor] Stuck: unicode in regular expressions

Tue Aug 9 17:01:04 CEST 2005

Ron Phillips wrote:
> I am expecting users to cut-and-paste DMS data into an application — 
> like:  +40 30 15   E40 15 34.56, -81 0 0,   81 57 34.27E, W 40° 13’ 
> 27.343”, 40° 13’ 27.343” S, 140° 13’ 27.343”S, S40° 13’ 27.34454,  
> 81:57:34.27E 
>  
> I've been able to write a regex that seems to work in redemo.py, but it 
> doesn't do at all what I want when I try to code it using the re module. 
> The problem seems to be the way I am using unicode — specifically all 
> those punctuation marks that might get pasted in. I anticipate the 
> program getting its input from a browser; maybe that will narrow down 
> the range somewhat. 

I'm guessing a bit here, but you have to know what encoding you are getting from the browser. If the input is from a form, I think you will get back results in the same encoding as the page containing the form. Then I think you can either
- convert the form data to unicode and use unicode in the regex, or
- use the same encoding for the regex as the form data

A good way to start would be to
print repr(formdata)
that will show you exactly what is in the data.

Kent

>  
> Anyway, given the string above, what regex will match the  ” and    ’ 
> characters, please? I have tried \x02BC and \x92 and \x2019 for the ’ , 
> but no result. I am sure it's simple; I am sure some other newbie has 
> asked it, but I have Googled my brains out, and can't find it.
>  
> Ron 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor