Parser Generator?
Steven Bethard
steven.bethard at gmail.com
Sun Aug 26 23:48:30 EDT 2007
Paul McGuire wrote:
> On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw... at ginstrom.com> wrote:
>> The only caveat being that since Chinese and Japanese scripts don't
>> typically delimit "words" with spaces, I think you'd have to pass the text
>> through a tokenizer (like ChaSen for Japanese) before using PyParsing.
>
> Did you think pyparsing is so mundane as to require spaces between
> tokens? Pyparsing has been doing this type of token-recognition since
> Day 1. Looking for tokens without delimiting spaces was one of the
> first applications for pyparsing. This issue is not unique to Chinese
> or Japanese text. Pyparsing will easily find the tokens in this
> string:
>
> y=a*x**2+b*x+c
>
> as
>
> ['y','=','a','*','x','**','2','+','b','*','x','+','c']
The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.
In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.
The closest analog would be to ask pyparsing to find the words in the
following sentence:
ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.
Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.
STeVe
More information about the Python-list
mailing list