Help with arrays of strings
Simon Forman
rogue_pedro at yahoo.com
Mon Jul 31 21:33:34 EDT 2006
Jon Smirl wrote:
> I only have a passing acquaintance with Python and I need to modify some
> existing code. This code is going to get called with 10GB of data so it
> needs to be fairly fast.
>
> http://cvs2svn.tigris.org/ is code for converting a CVS repository to
> Subversion. I'm working on changing it to convert from CVS to git.
>
> The existing Python RCS parser provides me with the CVS deltas as
> strings.I need to get these deltas into an array of lines so that I can
> apply the diff commands that add/delete lines (like 10 d20, etc). What is
> the most most efficient way to do this? The data structure needs to be
> able to apply the diffs efficently too.
>
> The strings have embedded @'s doubled as an escape sequence, is there an
> efficient way to convert these back to single @'s?
>
> After each diff is applied I need to convert the array of lines back into
> a string, generate a sha-1 over it and then compress it with zlib and
> finally write it to disk.
>
> The 10GB of data is Mozilla CVS when fully expanded.
>
> Thanks for any tips on how to do this.
>
> Jon Smirl
> jonsmirl at gmail.com
Splitting a string into a list (array) of lines is easy enough, if you
want to discard the line endings,
lines = s.splitlines()
or, if you want to keep them,
lines = s.splitlines(True)
replacing substrings in a string is also easy,
s = s.replace('@@', '@')
For efficiency, you'll probably want to do the replacement first, then
split:
lines = s.replace('@@', '@').splitlines()
Once you've got your list of lines, python's awesome list manipulation
should makes applying diffs very easy. For instance, to replace lines
3 to 7 (starting at zero) you could assign a list (containing the
replacement lines) to a "slice" of the list of lines:
lines[3:8] = replacement_lines
Where replacement_lines is a list containing the replacement lines.
There's a lot more to this, read up on python's lists.
To convert the list back into one string use the join() method; if you
kept the line endings,
s = "".join(lines)
or if you threw them away,
s = "\n".join(lines)
Python has standard modules for sha-1 digest, sha, and zlib
compression, zlib. See http://docs.python.org/lib/lib.html
HTH, enjoy,
~Simon
More information about the Python-list
mailing list