Special chars with HTMLParser

Piet van Oostrum piet at cs.uu.nl
Wed Aug 5 14:20:47 EDT 2009


>>>>> Fafounet <fafounet at gmail.com> (F) wrote:

>F> Thank you, now I can get the correct character.
>F> Now when I have the string ab&#xE9;cd I can get ab then é thanks to
>F> your function and then cd. But how is it possible to know that cd is
>F> still the same word ?

That depends on your definition of `word'. And that is
language-dependent. 

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful. 
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the Python-list mailing list