Parser Generator?

Mon Aug 27 10:36:52 EDT 2007

Paul McGuire wrote:
> On Aug 26, 10:48 pm, Steven Bethard <steven.beth... at gmail.com> wrote:
>> In Japanese and Chinese tokenization, word boundaries are not marked by
>> different classes of characters. They only exist in the mind of the
>> reader who knows which sequences of characters could be words given the
>> context, and which sequences of characters couldn't.
>>
>> The closest analog would be to ask pyparsing to find the words in the
>> following sentence:
>>
>> ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.
>>
>> Most approaches that have been even marginally successful on these kinds
>> of tasks have used statistical machine learning approaches.
> 
> You mean like this?
> 
> from pyparsing import *
> 
> knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
>     'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
>     'that', 'in', 'python', 'library', 'provides', 'code', 'to']
> 
> knownWord = oneOf( knownWords, caseless=True )
> sentence = OneOrMore( knownWord ) + "."
> 
> mush =
> "ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."
> 
> print sentence.parseString( mush )
> 
> prints:
> 
> ['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
> 'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
> 'the', 'grammar', 'directly', 'in', 'python', 'code', '.']
> 
> In fact, this is almost the exact scheme used by Zhpy for extracting
> Chinese versions of Python keywords, and mapping them back to English/
> Latin words.  Of course, this is not practical for natural language
> processing, as the vocabulary gets too large. And you can get
> ambiguous matches, such as a vocabulary containing the words ['in',
> 'to', 'into'] - the runtogether "into" will always be assumed to be
> "into", and never "in to".

Yep, and these kinds of things occur quite frequently with Chinese and 
Japanese. The point was not that pyparsing couldn't do it for a small 
subset of characters/words, but that pyparsing is probably not the right 
solution for general purpose Japanese/Chinese tokenization.

Steve