Simple Text Processing Help
Peter Otten
__peter__ at web.de
Tue Oct 16 04:14:44 EDT 2007
patrick.waldo wrote:
> manipulation? Also, I conceptually get it, but would you mind walking
> me through
>> for key, group in groupby(instream, unicode.isspace):
>> if not key:
>> yield "".join(group)
itertools.groupby() splits a sequence into groups with the same key; e. g.
to group names by their first letter you'd do the following:
>>> def first_letter(s): return s[:1]
...
>>> for key, group in groupby(["Anne", "Andrew", "Bill", "Brett", "Alex"], first_letter):
... print "--- %s ---" % key
... for item in group:
... print item
...
--- A ---
Anne
Andrew
--- B ---
Bill
Brett
--- A ---
Alex
Note that there are two groups with the same initial; groupby() considers
only consecutive items in the sequence for the same group.
In your case the sequence are the lines in the file, converted to unicode
strings -- the key is a boolean indicating whether the line consists
entirely of whitespace or not,
>>> u"\n".isspace()
True
>>> u"alpha\n".isspace()
False
but I call it slightly differently, as an unbound method:
>>> unicode.isspace(u"alpha\n")
False
This is only possible because all items in the sequence are known to be
unicode instances. So far we have, using a list instead of a file:
>>> instream = [u"alpha\n", u"beta\n", u"\n", u"gamma\n", u"\n", u"\n", u"delta\n"]
>>> for key, group in groupby(instream, unicode.isspace):
... print "--- %s ---" % key
... for item in group:
... print repr(item)
...
--- False ---
u'alpha\n'
u'beta\n'
--- True ---
u'\n'
--- False ---
u'gamma\n'
--- True ---
u'\n'
u'\n'
--- False ---
u'delta\n'
As you see, groups with real data alternate with groups that contain only
blank lines, and the key for the latter is True, so we can skip them with
if not key: # it's not a separator group
yield group
As the final refinement we join all lines of the group into a single
string
>>> "".join(group)
u'alpha\nbeta\n'
and that's it.
Peter
More information about the Python-list
mailing list