[melbourne-pug] Code review comments for next issue of The Python Papers

Tue Jun 26 03:06:46 CEST 2007

Hi Anthony and others,

I have taken many, but not all of your comments on board. Anthony, I
foolishly started making changes before looking at the line numbers you were
referring to, and now I can't recall. I was hoping you wouldn't mind
reviewing the changes briefly and letting me know what still needs to be
done.

I have committed the additional file LCSTest.py, which uses the unittest
module to add some test cases. In the end, I chose unittest because it ships
with the standard Python interpreter, so readers can minimise the number of
new modules they need to consider.

I have accepted a tokenize function in the initialise method. The regular
expression suggested did not work as hoped, which could be replicated by
testing custom tokenize function against the unit tests (I just tried it in
place then reverted).

One of the ugliest things is that basestring has no split method.
Str.splitdoesn't work on unicode strings, and vice versa. If anyone
can suggest an
elegant way to handle a default constructor which will use the appropriate
split method depending on the string types, that would be much appreciated.

In any case, sorry if some of your changes didn't get in and/or aren't
mentioned here. Please remind me and I will get to them. I'm just so busy
that it's hard to be totally careful about going through everyone's
suggestions.

Cheers,
-T

On 6/8/07, Anthony Briggs <abriggs at westnet.com.au> wrote:
>
> On Fri, Jun 08, 2007 at 12:29:10PM +1000, Tennessee Leeuwenburg wrote:
> > Thanks Maurice and John for your comments. Let's see if we can turn some
> of
> > these into feature requests, and I'll go ahead and try to meet them.
> >
> > 1) As Maurice and John both identified, words are identified only by
> using
> > split(). This results in punctuation forming a part of the words under
> > consideration. Especially in the case of the full stop at the end of a
> > sentence, this appears to be rather less than ideal.
>
> Yep. Removing punctuation (.,!?;: at the least) would be a good idea.
>
> > Hyphenation, on the
> > other hand, is something that I can see pros and cons for. It may be of
> > interest to note that a word has gained a hyphen, or it may be deemed
> > irrelevant. Ditto capitalisation.
>
> Or the hyphenated/capitalised version may be completely different. Two
> examples that I can think of are if you're parsing names (Smith vs.
> Jones vs. Smith-Jones) or computer languages (FOO vs. Foo vs. foo).
>
> > It would appear that it would be a great enchancement to allow more
> > flexibility in the tokenising of each sentence. This could be done with
> > regular expressions, or some other mechanism. Does the list have any
> > recommendations? What are the requirements that we have to meet?
>
> One option is to go for a functional approach, and allow the
> specification of a different tokeniser when initialising your class,
> with a reasonable default. If you really wanted regular expressions,
> then you could do it that way within the function.
>
>
> > 2) At the moment, typos are not treated any differently. The system
> which
> > actually uses this code doesn't make typos, it's generated
> automatically.
>
> So it does make typos, but only if the programmers do ;)
>
> > What behaviour is desireable for typos? Should they be highlighted (as
> > grammatically/syntactically important) or ignored (as semantically
> > identical)?
>
> Hmm, typos. You could spend all day trying to fix them. See
> http://www.google.com/jobs/britney.html for a good idea of the sort of
> input you can expect -- I'd tackle most of your other issues first,
> unless you can find an easy way to tie into an existing spell checker.
>
>
> > 3) Cleanups: blank lines before else -- I haven't coded to any
> particular
> > style standard. What do people recommend? I believe there is a PEP
> covering
> > this, but I am not certain.
>
> Yeah, my coding style is normally PEP 8 + common sense, eg. extra blank
> lines are OK as paragraphs if they help you figure out your code later
> on, ie. you're splitting up separate things that would otherwise run
> together and make reading hard.
>
> > If not is None -- A habit I
> > picked up. Something was broken once, and I had wondered if "is not
> None"
> > worked differently to my expectations, and so I've never quite gone
> back. I
> > should clear this out if it makes no difference.
>
> Yeah, probably a good idea. The easy way to resolve questions like that
> is to fire up the Python interpreter and try it out.
>
> > 4) Tree structure -- more comments should be added. isinstance(node,
> str) --
> > indeed, what about unicode? In Python 2.5, is a unicode string a str?
> I'll
> > have to research this to make sure.
>
> >>> foo = u"foo"
> >>> foo
> u'foo'
> >>> isinstance(foo, str)
> False
> >>> isinstance(foo, unicode)
> True
>
> A few other points - these are stylistic though, which I'm not sure is
> what you want, but anyway:
>
>         if lcs != "":
>             myString = myString + lcs
>
> is a no-op as far as I can tell. Since you only use it the once, you
> probably also don't need the 'lcs = self.lcs' part either.
>
> You've also got a couple of places where you're comparing the left side
> of the tree and then the right side of the tree. For example,
>
>         if not lTree is None:
>             if isinstance(lTree, str):
>                 if lTree is not "":
>                     myString +=  " (added %s) " %  (lTree)
>             else:
>                 myString += lTree.lString()
>         else:
>             if DEBUG: print 'lTree is None'
>
> and the other rTree one could become something like:
>
>         myString += self.parseTree(lTree, 'added')
>         myString += self.parseTree(rTree, 'removed')
>
> with self.parseTree being something like:
>
>     def parseTree(self, tree, mechanism):
>         """ Recursive function for parsing a tree """
>         if tree is None:
>             # no branch
>             return ""
>         if isinstance(tree, str) or isinstance(tree, unicode):
>             # leaf node
>             return " (%s %s) " % (mechanism, tree)
>         # branch
>         return tree.lString()
>
> Similarly for the other tree building/exploring functions (lines 109,
> 116).
>
> Other picky code style type things:
>
> On 76 + 77, you set lTree and rTree, even though all three branches set
> them anyway.
>
> You seem to be using a fair few placeholder methods, and then not using
> them, eg. string1 and string2 on lines 79 and 80.
>
> When you're comparing string1 and string2, you might benefit (in terms
> of clarity of code) from returning early. eg.
>
>         if string1 == "":
>             self.lTree = ""
>             self.rTree = self.string2
>             self.lcs = ""
>             return
>
>         if string2 == "":
>             ...
>
> And you seem to be running if statements onto one line, which I find
> makes things harder to read, eg.
>
>                     if v > longest:longest = v
>                     if v == longest: LCS = words1[i - v+1:i+1]
>
> would (IMO) be better as:
>
>                     if v > longest:
>                         longest = v
>                         LCS = words1[i - v+1:i+1]
>
> > 5) Testing. I'm not familiar with unit testing frameworks. The best
> thing
> > would probably be to identify some kind of preferred testing framework
> and
> > write a better set of formal tests. Any suggestions?
>
> py.test is my personal favorite - it's a bit less class heavy and easier
> to work with than unittest.
>
>
> >
> > Cheers,
> > -T
> >
> > On 6/8/07, John Machin <sjmachin at lexicon.net> wrote:
> > >
> > >On 7/06/2007 11:38 PM, Maurice Ling wrote:
> > >> Hi Tennessee,
> > >>
> > >> Given my background in text analysis, I can't help but wonder 2 main
> > >> issues which are essentially word tokenization problems:
> > >>
> > >> 1. How are the words identified? By whitespaces? If so, then there is
> a
> > >> false removal (substitution) in this case:
> > >> original: Tom ate an apple.
> > >> new: Tom ate an apple and an orange.
> > >>
> > >> 2. Hyphenations etc? For example, "Tom is twenty-three years old this
> > >> year" and "Tom is twenty three years old this year".
> > >>
> > >
> > >Capitalisation is another problem:
> > >original: Envy and pride are ...
> > >new: Sloth, envy and pride are ...
> > >
> > >Comments say "words are atomic": what about typos? stuff cheesw?
> > >
> > >At the Python level -- based on [possibly incorrect] recollections from
> > >reading it yesterday; detailed dissection later :-)
> > >
> > >1. tokens produced by str.split() don't need str.strip() applied to
> them
> > >
> > >2. blank lines in unexpected places e.g. before else:
> > >
> > >3. "if not thing is None" -- syntactically correct but stylistically
> > >chundrous IMHO; what's wrong with "if thing is not None"?
> > >
> > >4. put in comments that explain your tree structure, or at the very
> > >least position the tree-creating method(s) before the tree-examining
> > >method(s) -- save gentle readers the need to nut out the meaning of:
> > >     node is None
> > >     node == ""
> > >     isinstance(node, str) # what about unicode?
> > >     node is none of the above
> > >
> > >5. Testing/example architecture could be a bit more robust than a
> > >collection of commented pairs of sentences down the end.
> > >
> > >Cheers,
> > >John
> > >_______________________________________________
> > >melbourne-pug mailing list
> > >melbourne-pug at python.org
> > >http://mail.python.org/mailman/listinfo/melbourne-pug
> > >
>
> > _______________________________________________
> > melbourne-pug mailing list
> > melbourne-pug at python.org
> > http://mail.python.org/mailman/listinfo/melbourne-pug
>
>
> --
> ------------------------------------------------------
> HyPerACtIVe?! HEY, Who ArE yoU cAllInG HYPERaCTive?!
> abriggs at westnet.com.au
> ------------------------------------------------------
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug at python.org
> http://mail.python.org/mailman/listinfo/melbourne-pug
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/melbourne-pug/attachments/20070626/8f5ae9b1/attachment.htm