"".join(string_generator()) fails to be magic

thebjorn BjornSteinarFjeldPettersen at gmail.com
Thu Oct 11 03:02:10 EDT 2007


On Oct 11, 8:53 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Thu, 11 Oct 2007 01:26:04 -0500, Matt Mackal wrote:
> > I have an application that occassionally is called upon to process
> > strings that are a substantial portion of the size of memory. For
> > various reasons, the resultant strings must fit completely in RAM.
> > Occassionally, I need to join some large strings to build some even
> > larger strings.
>
> > Unfortunately, there's no good way of doing this without using 2x the
> > amount of memory as the result. You can get most of the way there with
> > things like cStringIO or mmap objects, but when you want to actually
> > get the result as a Python string, you run into the copy again.
>
> > Thus, it would be nice if there was a way to join the output of a
> > string generator so that I didn't need to keep the partial strings in
> > memory. <subject> would be the obvious way to do this, but it of
> > course converts the generator output to a list first.
>
> Even if `str.join()` would not convert the generator into a list first,
> you would have overallocation.  You don't know the final string size
> beforehand so intermediate strings must get moved around in memory while
> concatenating.  Worst case: all but the last string are already
> concatenated and the last one does not fit into the allocated memory
> anymore, so there is new memory allocates that can hold both strings ->
> double amount of memory needed.
>
> Ciao,
>         Marc 'BlackJack' Rintsch

Perhaps realloc() could be used to avoid this?  I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?

-- bjorn




More information about the Python-list mailing list