Regular expressions and Unicode
Peter Otten
__peter__ at web.de
Thu Oct 2 16:11:05 EDT 2008
Jeffrey Barish wrote:
> I have a regular expression that I use to extract the surname:
>
> surname = r'(?u).+ (\w+)'
>
> However, when I apply it to this Unicode string, I get only the first 3
> letters of the surname:
>
> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
That's a byte string. You can either modify the literal
name = u'Anton\xedn Dvo\u0159\xe1k'
or decode it with the proper encoding
name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
name = name.decode("utf-8")
> surname_re = re.compile(surname)
> m = surname_re.search(name)
> m.groups()
> ('Dvo\xc5',)
>
> I suppose that there is an encoding problem, but I don't understand
> Unicode well enough to know what to do to digest properly the Unicode
> characters in the surname.
>>> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
>>> re.compile(r"(?u).+ (\w+)").search(name.decode("utf-8")).groups()
(u'Dvo\u0159\xe1k',)
>>> print _[0]
Dvořák
Peter
More information about the Python-list
mailing list