Script for finding words of any size that do NOT contain vowels with acute diacritic marks?

Wed Oct 17 11:32:52 EDT 2012

Le mercredi 17 octobre 2012 17:00:46 UTC+2, Dave Angel a écrit :
> On 10/17/2012 10:31 AM, nwaits wrote:
> 
> > I'm very impressed with python's wordlist script for plain text.  Is there a script for finding words that do NOT have certain diacritic marks, like acute or grave accents (utf-8), over the vowels?  
> 
> > Thank you.
> 
> 
> 
> if you can construct a list of "illegal" characters, then you can simply
> 
> check each character of the word against the list, and if it succeeds
> 
> for all of the characters, it's a winner.
> 
> 
> 
> If that's not fast enough, you can build a translation table from the
> 
> list of illegal characters, and use translate on each word.  Then it
> 
> becomes a question of checking if the translated word is all zeroes.  
> 
> More setup time, but much faster looping for each word.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Lazy way.
Py3.2

>>> import unicodedata
>>> def HasDiacritics(w):
...     w_decomposed = unicodedata.normalize('NFKD', w)
...     return 'no' if len(w) == len(w_decomposed) else 'yes'
...     
>>> HasDiacritics('éléphant')
'yes'
>>> HasDiacritics('elephant')
'no'
>>> HasDiacritics('\N{LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON}')
'yes'
>>> HasDiacritics('U')
'no'
>>>

Should be ok for the CombiningDiacriticalMarks unicode range
(common diacritics)

jmf