[Tutor] about pyhton + regular expression

Michael Janssen Janssen@rz.uni-frankfurt.de
Thu Mar 20 11:42:10 2003


On Wed, 19 Mar 2003, Abdirizak abdi wrote:

>
> Hi everyone,
>
> thanks gregor and Michael for your contribution:
>
> While
>
>  >>> buf = re.compile("[a-zA-Z]+")
>  >>> buf.findall(str)
> ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in',
> 'statistical', 'methods', 'for', 'natural', 'language', 'processing']
>  >>>
> this is the result that I want:
>
> ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in',
> 'statistical', 'methods', 'for', 'natural', 'language', 'processing', '.']

exp_token = re.compile(r"""
([-a-zA-Z0-9_]+|    # a token can be a word
[\"\'.,;:!\?])      # *or* a single character of this character set
""", re.VERBOSE)    # VERBOSE ignores all this whitespace and comments

character sets must be finetuned

>
> Another question
>
>  when you are reading a text from a file is it really necesary to scan
> by using while loop or the following is enough and then scan with a
> loop to manipulate what is the real difference ?
>
> infile = open(' file.txt ')
> buffer = infile.readline()

both not. In recent version (otherwise while loop, correct) of Python, you
can do:
for line in open('file.txt'):
     # process line

# modern spelling is:
for line in file('file.txt'):
     # process line


readline() reads one line of the file. read() the whole file as a string.
readlines() the whole file as a list of lines.


Michael

>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!