[melbourne-pug] Code review comments for next issue of The Python Papers

Tennessee Leeuwenburg tleeuwenburg at gmail.com
Fri Jun 8 04:29:10 CEST 2007


Thanks Maurice and John for your comments. Let's see if we can turn some of
these into feature requests, and I'll go ahead and try to meet them.

1) As Maurice and John both identified, words are identified only by using
split(). This results in punctuation forming a part of the words under
consideration. Especially in the case of the full stop at the end of a
sentence, this appears to be rather less than ideal. Hyphenation, on the
other hand, is something that I can see pros and cons for. It may be of
interest to note that a word has gained a hyphen, or it may be deemed
irrelevant. Ditto capitalisation.

It would appear that it would be a great enchancement to allow more
flexibility in the tokenising of each sentence. This could be done with
regular expressions, or some other mechanism. Does the list have any
recommendations? What are the requirements that we have to meet?

2) At the moment, typos are not treated any differently. The system which
actually uses this code doesn't make typos, it's generated automatically.
What behaviour is desireable for typos? Should they be highlighted (as
grammatically/syntactically important) or ignored (as semantically
identical)?

3) Cleanups: blank lines before else -- I haven't coded to any particular
style standard. What do people recommend? I believe there is a PEP covering
this, but I am not certain. Unnecessary use of strip() -- probably worth
getting rid of these to make the code clearer. If not is None -- A habit I
picked up. Something was broken once, and I had wondered if "is not None"
worked differently to my expectations, and so I've never quite gone back. I
should clear this out if it makes no difference.

4) Tree structure -- more comments should be added. isinstance(node, str) --
indeed, what about unicode? In Python 2.5, is a unicode string a str? I'll
have to research this to make sure.

5) Testing. I'm not familiar with unit testing frameworks. The best thing
would probably be to identify some kind of preferred testing framework and
write a better set of formal tests. Any suggestions?

Cheers,
-T

On 6/8/07, John Machin <sjmachin at lexicon.net> wrote:
>
> On 7/06/2007 11:38 PM, Maurice Ling wrote:
> > Hi Tennessee,
> >
> > Given my background in text analysis, I can't help but wonder 2 main
> > issues which are essentially word tokenization problems:
> >
> > 1. How are the words identified? By whitespaces? If so, then there is a
> > false removal (substitution) in this case:
> > original: Tom ate an apple.
> > new: Tom ate an apple and an orange.
> >
> > 2. Hyphenations etc? For example, "Tom is twenty-three years old this
> > year" and "Tom is twenty three years old this year".
> >
>
> Capitalisation is another problem:
> original: Envy and pride are ...
> new: Sloth, envy and pride are ...
>
> Comments say "words are atomic": what about typos? stuff cheesw?
>
> At the Python level -- based on [possibly incorrect] recollections from
> reading it yesterday; detailed dissection later :-)
>
> 1. tokens produced by str.split() don't need str.strip() applied to them
>
> 2. blank lines in unexpected places e.g. before else:
>
> 3. "if not thing is None" -- syntactically correct but stylistically
> chundrous IMHO; what's wrong with "if thing is not None"?
>
> 4. put in comments that explain your tree structure, or at the very
> least position the tree-creating method(s) before the tree-examining
> method(s) -- save gentle readers the need to nut out the meaning of:
>      node is None
>      node == ""
>      isinstance(node, str) # what about unicode?
>      node is none of the above
>
> 5. Testing/example architecture could be a bit more robust than a
> collection of commented pairs of sentences down the end.
>
> Cheers,
> John
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug at python.org
> http://mail.python.org/mailman/listinfo/melbourne-pug
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/melbourne-pug/attachments/20070608/e09561b6/attachment.htm 


More information about the melbourne-pug mailing list