Simple Text Processing Help

Tue Oct 16 04:14:44 EDT 2007

patrick.waldo wrote:

> manipulation?  Also, I conceptually get it, but would you mind walking
> me through

>> for key, group in groupby(instream, unicode.isspace):
>>         if not key:
>>             yield "".join(group)

itertools.groupby() splits a sequence into groups with the same key; e. g.
to group names by their first letter you'd do the following:

>>> def first_letter(s): return s[:1]
... 
>>> for key, group in groupby(["Anne", "Andrew", "Bill", "Brett", "Alex"], first_letter):
...     print "--- %s ---" % key
...     for item in group:
...             print item
... 
--- A ---
Anne
Andrew
--- B ---
Bill
Brett
--- A ---
Alex

Note that there are two groups with the same initial; groupby() considers
only consecutive items in the sequence for the same group.

In your case the sequence are the lines in the file, converted to unicode
strings -- the key is a boolean indicating whether the line consists
entirely of whitespace or not,

>>> u"\n".isspace()
True
>>> u"alpha\n".isspace()
False

but I call it slightly differently, as an unbound method:

>>> unicode.isspace(u"alpha\n")
False

This is only possible because all items in the sequence are known to be
unicode instances. So far we have, using a list instead of a file:

>>> instream = [u"alpha\n", u"beta\n", u"\n", u"gamma\n", u"\n",  u"\n", u"delta\n"]
>>> for key, group in groupby(instream, unicode.isspace):
...     print "--- %s ---" % key
...     for item in group:
...             print repr(item)
... 
--- False ---
u'alpha\n'
u'beta\n'
--- True ---
u'\n'
--- False ---
u'gamma\n'
--- True ---
u'\n'
u'\n'
--- False ---
u'delta\n'

As you see, groups with real data alternate with groups that contain only
blank lines, and the key for the latter is True, so we can skip them with

if not key: # it's not a separator group
   yield group 

As the final refinement we join all lines of the group into a single
string

>>> "".join(group)
u'alpha\nbeta\n'

and that's it.

Peter