Help with arrays of strings

Simon Forman rogue_pedro at yahoo.com
Mon Jul 31 21:33:34 EDT 2006


Jon Smirl wrote:
> I only have a passing acquaintance with Python and I need to modify some
> existing code. This code is going to get called with 10GB of data so it
> needs to be fairly fast.
>
> http://cvs2svn.tigris.org/ is code for converting a CVS repository to
> Subversion. I'm working on changing it to convert from CVS to git.
>
> The existing Python RCS parser provides me with the CVS deltas as
> strings.I need to get these deltas into an array of lines so that I can
> apply the diff commands that add/delete lines (like 10 d20, etc). What is
> the most most efficient way to do this? The data structure needs to be
> able to apply the diffs efficently too.
>
> The strings have embedded @'s doubled as an escape sequence, is there an
> efficient way to convert these back to single @'s?
>
> After each diff is applied I need to convert the array of lines back into
> a string, generate a sha-1 over it and then compress it with zlib and
> finally write it to disk.
>
> The 10GB of data is Mozilla CVS when fully expanded.
>
> Thanks for any tips on how to do this.
>
> Jon Smirl
> jonsmirl at gmail.com

Splitting a string into a list (array) of lines is easy enough,  if you
want to discard the line endings,

lines = s.splitlines()

or, if you want to keep them,

lines = s.splitlines(True)

replacing substrings in a string is also easy,

s = s.replace('@@', '@')

For efficiency, you'll probably want to do the replacement first, then
split:

lines = s.replace('@@', '@').splitlines()


Once you've got your list of lines, python's awesome list manipulation
should makes applying diffs very easy.  For instance, to replace lines
3 to 7 (starting at zero) you could assign a list (containing the
replacement lines) to a "slice" of the list of lines:

lines[3:8] = replacement_lines

Where replacement_lines is a list containing the replacement lines.
There's a lot more to this, read up on python's lists.


To convert the list back into one string use the join() method; if you
kept the line endings,

s = "".join(lines)

or if you threw them away,

s = "\n".join(lines)

Python has standard modules for sha-1 digest, sha, and zlib
compression, zlib.  See http://docs.python.org/lib/lib.html

HTH, enjoy,
~Simon




More information about the Python-list mailing list