[Tutor] Extracting words(quest 2)
Alexandre Ratti
alex@gabuzomeu.net
Tue, 26 Mar 2002 17:06:51 +0100
Hi Nicole,
At 16:22 26/03/2002 +0100, Nicole Seitz wrote:
>Thanx, this was very helpful! Though, there are some lines(see below) I
>don't understand.Hope you don't mind explaining.
> > import re, pprint
> >
> > def indexWord(filename):
> > wordDict = {}
> > lineCount = 0
> > file = open(filename)
> > expr = re.compile("\w+", re.LOCALE)
>
>What's exactly the meaning of the flag LOCALE?
It will match extended characters (eg. French accents, German Umlaut
letters, etc.). Here is a longer explanation:
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character; this is equivalent to the set [a-zA-Z0-9_]. With
LOCALE, it will match the set [0-9_] plus whatever characters are defined
as letters for the current locale. If UNICODE is set, this will match the
characters [0-9_] plus whatever is classified as alphanumeric in the
Unicode character properties database.
Source: http://www.python.org/doc/current/lib/re-syntax.html
> > while 1:
> > line = file.readline()
> > if not line:
> > break
> > lineCount = lineCount + 1
> > resultList = expr.findall(line)
>
>So I can't use match()???
match() only matches patterns at the beginning of strings. To match
patterns anywhere in a string, use search() instead.
Besides, match() returns a match object and you need to use the group()
method of this object to get the matches text.
I find it easier to user findall(), which directly returns a list of match
text strings.
This howto is a good reference to understand Python regular expressions:
http://py-howto.sourceforge.net/regex/regex.html
http://py-howto.sourceforge.net/pdf/regex.pdf
> > if __name__ == "__main__":
> > filename = r"c:\foo\bar\baz.txt"
> > wordDict = indexWord(filename)
>What's happening here?
The if __name__ == "__main__" idiom allows you to easily test a code
snippet. This code is only executed when you run the module directly, not
when the module is imported into another one.
wordDict = indexWord(filename) just calls the indexWord() function with the
source file name as a parameter. This function returns a Python dictionary
of words.
>By the way, the program now works pretty well, though the output is
>sometimes a bit awkward, for example, when there are many occurences of
>one word.Doesn't look very nice. I'm trying to change this now.
Yes, pprint is just a quick way to display the content of a dictionary or
list. For a nicer output, you need to write some more code.
>Oh, I nearly forgot :
>How come that the words in the output are alphabetically ordered?
I was surprised too. The pprint function (prettyprint) seems to sort the
output by default. If you change the line
pprint.pprint(wordDict)
to
print wordDict
you'll see that the output is unsorted (in dictionaries, entry order is
randomized to speed up access).
Cheers.
Alexandre