Identifying unicode punctuation characters with Python regex

Mark Tolonen M8R-yfto6h at mailinator.com
Fri Nov 14 05:43:07 EST 2008


"Shiao" <multiseed at gmail.com> wrote in message 
news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72 at l33g2000pri.googlegroups.com...
> Hello,
> I'm trying to build a regex in python to identify punctuation
> characters in all the languages. Some regex implementations support an
> extended syntax \p{P} that does just that. As far as I know, python re
> doesn't. Any idea of a possible alternative?
>
> Apart from manually including the punctuation character range for each
> and every language, I don't see how this can be done.
>
> Thank in advance for any suggestions.
>
> John

You can always build your own pattern.  Something like (Python 3.0rc2):

>>> import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) == 
'Po')
>>> import re
>>> r=re.compile('['+Po+']')
>>> x='我是美國人。'
>>> x
'我是美國人。'
>>> r.findall(x)
['。']

-Mark




More information about the Python-list mailing list