Parser Generator?

Mon Aug 27 08:40:09 EDT 2007

On Aug 26, 10:48 pm, Steven Bethard <steven.beth... at gmail.com> wrote:
> Paul McGuire wrote:
> > On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw... at ginstrom.com> wrote:
> >> The only caveat being that since Chinese and Japanese scripts don't
> >> typically delimit "words" with spaces, I think you'd have to pass the text
> >> through a tokenizer (like ChaSen for Japanese) before using PyParsing.
>
> > Did you think pyparsing is so mundane as to require spaces between
> > tokens?  Pyparsing has been doing this type of token-recognition since
> > Day 1.  Looking for tokens without delimiting spaces was one of the
> > first applications for pyparsing.  This issue is not unique to Chinese
> > or Japanese text.  Pyparsing will easily find the tokens in this
> > string:
>
> > y=a*x**2+b*x+c
>
> > as
>
> > ['y','=','a','*','x','**','2','+','b','*','x','+','c']
>
> The difference is that in the expression above (and in many other
> tokenization problems) you can determine "word" boundaries by looking at
> the class of character, e.g. alphanumeric vs. punctuation vs. whatever.
>
> In Japanese and Chinese tokenization, word boundaries are not marked by
> different classes of characters. They only exist in the mind of the
> reader who knows which sequences of characters could be words given the
> context, and which sequences of characters couldn't.
>
> The closest analog would be to ask pyparsing to find the words in the
> following sentence:
>
> ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.
>
> Most approaches that have been even marginally successful on these kinds
> of tasks have used statistical machine learning approaches.
>
> STeVe- Hide quoted text -
>
> - Show quoted text -

Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
    'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
    'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words.  Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to".  Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities.  But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul