[Tutor] about pyhton + regular expression
Michael Janssen
Janssen@rz.uni-frankfurt.de
Thu Mar 20 11:42:10 2003
On Wed, 19 Mar 2003, Abdirizak abdi wrote:
>
> Hi everyone,
>
> thanks gregor and Michael for your contribution:
>
> While
>
> >>> buf = re.compile("[a-zA-Z]+")
> >>> buf.findall(str)
> ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in',
> 'statistical', 'methods', 'for', 'natural', 'language', 'processing']
> >>>
> this is the result that I want:
>
> ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in',
> 'statistical', 'methods', 'for', 'natural', 'language', 'processing', '.']
exp_token = re.compile(r"""
([-a-zA-Z0-9_]+| # a token can be a word
[\"\'.,;:!\?]) # *or* a single character of this character set
""", re.VERBOSE) # VERBOSE ignores all this whitespace and comments
character sets must be finetuned
>
> Another question
>
> when you are reading a text from a file is it really necesary to scan
> by using while loop or the following is enough and then scan with a
> loop to manipulate what is the real difference ?
>
> infile = open(' file.txt ')
> buffer = infile.readline()
both not. In recent version (otherwise while loop, correct) of Python, you
can do:
for line in open('file.txt'):
# process line
# modern spelling is:
for line in file('file.txt'):
# process line
readline() reads one line of the file. read() the whole file as a string.
readlines() the whole file as a list of lines.
Michael
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!