[melbourne-pug] Code review comments for next issue of The Python Papers

Fri Jun 8 00:39:54 CEST 2007

On 7/06/2007 11:38 PM, Maurice Ling wrote:
> Hi Tennessee,
> 
> Given my background in text analysis, I can't help but wonder 2 main 
> issues which are essentially word tokenization problems:
> 
> 1. How are the words identified? By whitespaces? If so, then there is a 
> false removal (substitution) in this case:
> original: Tom ate an apple.
> new: Tom ate an apple and an orange.
> 
> 2. Hyphenations etc? For example, "Tom is twenty-three years old this 
> year" and "Tom is twenty three years old this year".
> 

Capitalisation is another problem:
original: Envy and pride are ...
new: Sloth, envy and pride are ...

Comments say "words are atomic": what about typos? stuff cheesw?

At the Python level -- based on [possibly incorrect] recollections from 
reading it yesterday; detailed dissection later :-)

1. tokens produced by str.split() don't need str.strip() applied to them

2. blank lines in unexpected places e.g. before else:

3. "if not thing is None" -- syntactically correct but stylistically 
chundrous IMHO; what's wrong with "if thing is not None"?

4. put in comments that explain your tree structure, or at the very 
least position the tree-creating method(s) before the tree-examining 
method(s) -- save gentle readers the need to nut out the meaning of:
     node is None
     node == ""
     isinstance(node, str) # what about unicode?
     node is none of the above

5. Testing/example architecture could be a bit more robust than a 
collection of commented pairs of sentences down the end.

Cheers,
John