[Tutor] Extracting words(quest 2)

Alexandre Ratti alex@gabuzomeu.net
Tue, 26 Mar 2002 17:06:51 +0100


Hi Nicole,


At 16:22 26/03/2002 +0100, Nicole Seitz wrote:
>Thanx, this was very helpful! Though, there are some lines(see below) I 
>don't understand.Hope you don't mind  explaining.

> > import re, pprint
> >
> > def indexWord(filename):
> >      wordDict = {}
> >      lineCount = 0
> >      file = open(filename)
> >      expr = re.compile("\w+", re.LOCALE)
>
>What's exactly the meaning of the flag LOCALE?

It will match extended characters (eg. French accents, German Umlaut 
letters, etc.). Here is a longer explanation:

\w
When the LOCALE and UNICODE flags are not specified, matches any 
alphanumeric character; this is equivalent to the set [a-zA-Z0-9_]. With 
LOCALE, it will match the set [0-9_] plus whatever characters are defined 
as letters for the current locale. If UNICODE is set, this will match the 
characters [0-9_] plus whatever is classified as alphanumeric in the 
Unicode character properties database.
Source: http://www.python.org/doc/current/lib/re-syntax.html


> >      while 1:
> >          line = file.readline()
> >          if not line:
> >              break
> >          lineCount = lineCount + 1
> >          resultList  = expr.findall(line)
>
>So I can't use match()???

match() only matches patterns at the beginning of strings. To match 
patterns anywhere in a string, use search() instead.

Besides, match() returns a match object and you need to use the group() 
method of this object to get the matches text.

I find it easier to user findall(), which directly returns a list of match 
text strings.

This howto is a good reference to understand Python regular expressions:
http://py-howto.sourceforge.net/regex/regex.html
http://py-howto.sourceforge.net/pdf/regex.pdf


> > if __name__ == "__main__":
> >      filename = r"c:\foo\bar\baz.txt"
> >      wordDict = indexWord(filename)
>What's happening here?

The if __name__ == "__main__" idiom allows you to easily test a code 
snippet. This code is only executed when you run the module directly, not 
when the module is imported into another one.

wordDict = indexWord(filename) just calls the indexWord() function with the 
source file name as a parameter. This function returns a Python dictionary 
of words.

>By the way, the program now works pretty well, though the output is 
>sometimes  a bit awkward, for example, when there are many occurences of 
>one word.Doesn't look very nice. I'm trying to change this now.

Yes, pprint is just a quick way to display the content of a dictionary or 
list. For a nicer output, you need to write some more code.

>Oh, I nearly forgot :
>How come that the words in the output are alphabetically ordered?

I was surprised too. The pprint function (prettyprint) seems to sort the 
output by default. If you change the line

         pprint.pprint(wordDict)

to
         print wordDict

you'll see that the output is unsorted (in dictionaries, entry order is 
randomized to speed up access).


Cheers.

Alexandre