[Tutor] A regular expression problem

Tue Nov 30 15:32:21 CET 2010

On Sun, Nov 28, 2010 at 6:14 PM, Steven D'Aprano <steve at pearwood.info> wrote:
<snip>
> Have you considered just using the isalnum() method?
>
>>>> '¿de'.isalnum()
> False

Mmm. No, I didn't consider it because I didn't even know such a method
existed. This can turn out to be very handy but I don't think it would
help me at this stage because the texts I'm working with contain also
a lot of non alpha-numeric characters that occur in isolation. So I
would get a lot of noise.

> The first thing to do is to isolate the cause of the problem. In your code
> below, you do four different things. In no particular order:
>
> 1 open and read an input file;
> 2 open and write an output file;
> 3 create a mysterious "RegexpTokenizer" object, whatever that is;
> 4 tokenize the input.
>
> We can't run your code because:
>
> 1 we don't have access to your input file;
> 2 most of us don't have the NLTK package;
> 3 we don't know what RegexTokenizer does;
> 4 we don't know what tokenizing does.

As I said in my answer to Evert, I assumed the problem I was having
had to do exclusively with the regular expression pattern I was using.
The code for RegexTokenizer seems to be pretty simple
(http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539)
and all it does is:

"""
Tokenizers that divide strings into substrings using regular
expressions that can match either tokens or separators between tokens.
"""

<snip>

> you should write:
>
> r'[^a-zA-Z\s0-9]+\w+\S'

Now you can understand why I didn't use r' ' The methods in the module
already use this internally and I just need to insert the regular
expression as the argument.

> Your regex says to match:
>
> - one or more characters that aren't letters a...z (in either
>  case), space or any digit (note that this is *not* the same as
>  characters that aren't alphanum);
>
> - followed by one or more alphanum character;
>
> - followed by exactly one character that is not whitespace.
>
> I'm guessing the "not whitespace" is troublesome -- it will match characters
> like ¿ because it isn't whitespace.

This was my first attempt to match strings like:

'&patre--' or '&patre'

The "not whitespace" was intended to match the occurrence of
non-alphanumeric characters appearing after "regular" characters. I
realize I should have added '*' after '\S' since I also want to match
words that do not have a non alpha-numeric symbol at the end (i.e
'&patre' as opposed to '&patre--'

>
> I'd try this:
>
> # untested
> \b.*?\W.*?\b
>
> which should match any word with a non-alphanumeric character in it:
>
> - \b ... \b matches the start and end of the word;
>
> - .*? matches zero or more characters (as few as possible);
>
> - \W matches a single non-alphanumeric character.
>
> So putting it all together, that should match a word with at least one
> non-alphanumeric character in it.

But since '.' matches any character except for a newline, this would
also yield strings where all the characters are non-alphanumeric. I
should have said this in my initial message but the texts I'm working
with contain lots of these strings with sequences of non-alphanumeric
characters (i.e. '&%+' or '&//'). I'm trying to match only strings
that are a mixture of both non-alphanumeric characters and [a-zA-Z].

> [...]
>>
>> If you notice, there are some words that have an accented character
>> that get treated in a strange way: all the characters that don't have
>> a tilde get deleted and the accented character behaves as if it were a
>> non alpha-numeric symbol.
>
> Your regex matches if the first character isn't a space, a digit, or a
> a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match.

I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they? I mean, how do you deal with
languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.

Josep M.