[melbourne-pug] Code review comments for next issue of The Python Papers
John Machin
sjmachin at lexicon.net
Fri Jun 8 00:39:54 CEST 2007
On 7/06/2007 11:38 PM, Maurice Ling wrote:
> Hi Tennessee,
>
> Given my background in text analysis, I can't help but wonder 2 main
> issues which are essentially word tokenization problems:
>
> 1. How are the words identified? By whitespaces? If so, then there is a
> false removal (substitution) in this case:
> original: Tom ate an apple.
> new: Tom ate an apple and an orange.
>
> 2. Hyphenations etc? For example, "Tom is twenty-three years old this
> year" and "Tom is twenty three years old this year".
>
Capitalisation is another problem:
original: Envy and pride are ...
new: Sloth, envy and pride are ...
Comments say "words are atomic": what about typos? stuff cheesw?
At the Python level -- based on [possibly incorrect] recollections from
reading it yesterday; detailed dissection later :-)
1. tokens produced by str.split() don't need str.strip() applied to them
2. blank lines in unexpected places e.g. before else:
3. "if not thing is None" -- syntactically correct but stylistically
chundrous IMHO; what's wrong with "if thing is not None"?
4. put in comments that explain your tree structure, or at the very
least position the tree-creating method(s) before the tree-examining
method(s) -- save gentle readers the need to nut out the meaning of:
node is None
node == ""
isinstance(node, str) # what about unicode?
node is none of the above
5. Testing/example architecture could be a bit more robust than a
collection of commented pairs of sentences down the end.
Cheers,
John
More information about the melbourne-pug
mailing list