[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

"Martin v. Löwis" martin at v.loewis.de
Wed Aug 24 16:33:50 CEST 2005


Walter Dörwald wrote:
> Martin v. Löwis wrote:
> 
>> Walter Dörwald wrote:
>>
>>> I think a maxsplit argument (just as for unicode.split()) would help
>>> too.
>>
>>
>> Correct - that would allow to get rid of the quadratic part.
> 
> 
> OK, such a patch should be rather simple. I'll give it a try.

Actually, on a second thought - it would not remove the quadratic
aspect. You would still copy the rest string completely on each
split. So on the first split, you copy N lines (one result line,
and N-1 lines into the rest string), on the second split, N-2
lines, and so on, totalling N*N/2 line copies again. The only
thing you save is the join (as the rest is already joined), and
the IsLineBreak calls (which are necessary only for the first
line).

Please see python.org/sf/1268314; it solves the problem by
keeping the splitlines result. It only invokes IsLineBreak
once per character, and also copies each character only once,
and allocates each line only once, totalling in O(N) for
these operations. It still does contain a quadratic operation:
the lines are stored in a list, and the result list is
removed from the list with del lines[0]. This copies N-1
pointers, result in N*N/2 pointer copies. That should still
be much faster than the current code.

> unicodelinebreaks = u"".join(unichr(c) for c in xrange(0,
> sys.maxunicode) if unichr(c).islinebreak())

That is very inefficient. I would rather add a static list
to the string module, and have a test that says

assert str.unicodelinebreaks == u"".join(ch for ch in (unichr(c) for c
in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or
unicodedata.category(ch)=='Zl')

unicodelinebreaks could then be defined as

# u"\r\n\x1c\x1d\x1e\x85\u2028\u2029
'\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode("utf-8")

> OK, this would mean we'd have to distinguish between a direct call to
> read() and one done by readline() (which we do anyway through the
> firstline argument).

See my patch. If we have cached lines, we don't need to call .read
at all.

Regards,
Martin


More information about the Python-Dev mailing list