[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 22:14:32 CEST 2005

Am 24.08.2005 um 21:15 schrieb Martin v. Löwis:

> Walter Dörwald wrote:
>
>
>>> Right. Not sure what people think whether this should still be
>>> supported, but I keep supporting it whenever I think of it.
>>>
>>
>> OK, so should we add this for 2.4.2 or only for 2.5?
>>
>
> You mean, string.unicodelinebreaks?
>

Yes.

> I think something needs to be
> done to fix the performance problem. In doing so, API changes
> might occur. We should not add API changes in 2.4.2 unless they
> contribute to the bug fix, and even then, the release manager
> probably needs to approve them (in any case, they certainly
> need to be backwards compatible)
>

OK. Your version of the patch (without replacing line =  
line.splitlines(False)[0] with something better) might be enough for  
2.4.2.

>> Should this really be put into string.py, or should it be a class
>> attribute of unicode? (At least that's what was proposed for the  
>> other
>> strings in string.py (string.whitespace etc.) too.
>>
>
> If the 2.4.2 fix is based on this kind of data, I think it should go
> into a private attribute of codecs.py.
>

I think codecs.unicodelinebreaks has one big problem: it will not  
work for codecs that do str->str decoding.

> For 2.5, I would put it
> into strings for tradition. There is no point in having some of these
> constants in strings and others as class attributes (unless we also
> add them as class attributes in 2.5, in which case adding
> unicodelinebreaks into strings would be pointless).
>
> So I think in 2.5, I would like to see
>
> # string.py
> ascii_letters = str.ascii_letters
>
> in which case unicode.linebreaks would be the right spelling.
>

And it would have the advantage, that it could work both with str and  
unicode if we had both str.linebreaks and unicode.linebreaks

>>> I'm not so sure anymore. It is good for consistency, but I doubt  
>>> there
>>> are actual use cases: how often do you want only the first n lines
>>> of some string? Reading the first n lines of a file might be an
>>> application, but then, you would rather use .readline() directly.
>>>
>>
>> Not every unicode string is read from a StreamReader.
>>
>
> Sure: but how often do you want to fetch the first line of a Unicode
> string you happen to have in memory, without iterating over all lines
> eventually?
>

I don't know. The only obvious spot in the standard library (apart  
from codecs.py) seems to be
    def shortdescription(self): return self.description().splitlines() 
[0]
in Lib/plat-mac/pimp.py

>> Another solution would be to have a unicode.itersplitlines() and  
>> store
>> the iterator. Then we wouldn't need a maxsplit because you simply can
>> stop iterating once you have what you want.
>>
>
> That might work. I would then ask for itersplitlines to return pairs
> of (line, truncated) so you can easily know whether you merely ran
> into the end of the string, or whether you got a complete line
> (although it might be a bit too specific for the readlines() case)
>

Or maybe (line, terminatorlength) which gives you the same info  
(terminatorlength == 0 means truncated) and makes it easy to strip  
the terminator.

>> So reverting to the 2.3 behaviour for simple codecs is out?
>>
>
> I'm -1, atleast. It would also fix the problem at hand, for the  
> reported
> case. However, it does leave some codecs in the cold, most notably
> UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is
> built-in in the parser).
>

You meant PEP 263, right?

> I think the UTF-8 stream reader should support
> all Unicode line breaks, so it should continue to use the Python
> approach.
>

OK.

> However, UTF-8 is fairly common, so that reading an
> UTF-8-encoded file line-by-line shouldn't suck.
>

OK, so what's missing is a solution for str->str codecs (or we keep  
line = line.splitlines(False)[0] and test, whether this is fast enough).

Bye,
    Walter Dörwald