Identifying unicode punctuation characters with Python regex

Mark Tolonen M8R-yfto6h at mailinator.com
Fri Nov 14 06:30:39 EST 2008


"Mark Tolonen" <M8R-yfto6h at mailinator.com> wrote in message 
news:xsydnXWBAriky4DUnZ2dnUVZ_jCdnZ2d at comcast.com...
>
> "Shiao" <multiseed at gmail.com> wrote in message 
> news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72 at l33g2000pri.googlegroups.com...
>> Hello,
>> I'm trying to build a regex in python to identify punctuation
>> characters in all the languages. Some regex implementations support an
>> extended syntax \p{P} that does just that. As far as I know, python re
>> doesn't. Any idea of a possible alternative?
>>
>> Apart from manually including the punctuation character range for each
>> and every language, I don't see how this can be done.
>>
>> Thank in advance for any suggestions.
>>
>> John
>
> You can always build your own pattern.  Something like (Python 3.0rc2):
>
>>>> import unicodedata
> Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) == 
> 'Po')
>>>> import re
>>>> r=re.compile('['+Po+']')
>>>> x='我是美國人。'
>>>> x
> '我是美國人。'
>>>> r.findall(x)
> ['。']
>
> -Mark
>

This was an interesting problem.  Need to escape \ and ] to find all the 
punctuation correctly, and it turns out those characters are sequential in 
the Unicode character set, so ] was coincidentally escaped in my first 
attempt.

IDLE 3.0rc2
>>> import unicodedata as u
>>> A=''.join(chr(i) for i in range(65536))
>>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
>>> len(A)
65536
>>> len(P)
491
>>> len(re.findall('['+P+']',A))                     # ] was naturally 
>>> escaped
490
>>> set(P)-set(re.findall('['+P+']',A))         # so only missing \
{'\\'}
>>> P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.
>>> len(re.findall('['+P+']',A))
491

-Mark




More information about the Python-list mailing list