[Python-Dev] Re: PEP 262: Unicode Indexing Helper Module

Fri, 13 Jul 2001 15:44:55 +0200

This is a multi-part message in MIME format.
--------------4273B7E264E4649CF795A2CF
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> Paul Moore (in privte mail):
>
> You have methods for finding
> the start and end of various <indextypes>, but you don't have a method for
> finding the length of an <indextype>. In the case of words (which is the one
> I understand :-), the length of a word is not the same as the difference
> between the starts of consecutive words - the intervening whitespace should
> be excluded (at least for some applications). I would suggest
> 
> length_<indextype>(u, index) -> integer
> Returns the length in Unicode objects of the <indextype> found at u[index]
> or -1 in case u[index] is not in an element of this type (for example, in
> the whitespace between words). [XXX Should this be the number of Unicode
> objects between index and the end of the element, or should it be the length
> from start to end even if you are in the middle?]
> 
> or maybe better
> 
> nextend_<indextype>(u, index) -> integer
> Returns the Unicode object index for the end of the next <indextype> found
> after u[index] or -1 in case no next element of this type exists.
> 
> [But that runs into issues when you are in a word - If index is not the
> first Unicode object, nextend is the end of *this* element, whereas next is
> the start of the *next* element. I think I'm starting to show my
> ignorance...]
> 
> Even though I suspect my suggested methods are too simplistic, I'd suggest
> at least a comment in the PEP on how to work out the length of the element
> you're in (or why it's hard, and you'd never want to do it :-)...

The two suggested APIs probe into the Unicode object. I think it would
be more useful to return the slice (as slice object) which represents
the <indextype> element found at the given index in u, e.g.

<indextype>_slice(u, index) -> slice object or None

    Returns the slice pointing to the <indextype> element found in 
    u at the given index or None in case no such element can be found
    at that position.

Hmm, I wonder whether slice objects can be "applied" to sequences
somehow...

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/
--------------4273B7E264E4649CF795A2CF
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Received: from gw-nl1.origin-it.com (gw-nl1.origin-it.com [193.79.128.34])
	by www.egenix.com (8.11.2/8.11.2/SuSE Linux 8.11.1-0.5) with ESMTP id f6DCTY816219
	for <mal@lemburg.com>; Fri, 13 Jul 2001 14:29:34 +0200
Received: from exchsmtp-nl1.origin-it.com (localhost.origin-it.com [127.0.0.1])
          by gw-nl1.origin-it.com with ESMTP id OAA11738
          for <mal@lemburg.com>; Fri, 13 Jul 2001 14:26:54 +0200 (MEST)
          (envelope-from Paul.Moore@atosorigin.com)
Received: from exchsmtp-nl1.origin-it.com(172.16.127.66) by gw-nl1.origin-it.com via mwrap (4.0a)
	id xma011736; Fri, 13 Jul 01 14:26:54 +0200
Received: from mail.origin-it.com (mail.origin-it.com [172.16.127.3]) 
	by exchsmtp-nl1.origin-it.com (8.9.3/8.8.5-1.2.2m-19990317) with ESMTP id OAA04126
	for <mal@lemburg.com>; Fri, 13 Jul 2001 14:26:53 +0200 (MET DST)
Received: from ukrax001.ras.uk.origin-it.com (ukrax001.ras.uk.origin-it.com [172.16.201.234]) 
	by mail.origin-it.com (8.9.3/8.8.5-1.2.2m-19990317) with ESMTP id OAA12785
	for <mal@lemburg.com>; Fri, 13 Jul 2001 14:26:53 +0200 (MET DST)
Received: by ukrax001.ras.uk.origin-it.com with Internet Mail Service (5.5.2650.21)
	id <NBW9YQM2>; Fri, 13 Jul 2001 13:26:53 +0100
Message-ID: <714DFA46B9BBD0119CD000805FC1F53B01B5AEF5@ukrux002.rundc.uk.origin-it.com>
From: "Moore, Paul" <Paul.Moore@atosorigin.com>
To: "'mal@lemburg.com'" <mal@lemburg.com>
Subject: PEP 262: Unicode Indexing Helper Module
Date: Fri, 13 Jul 2001 13:26:52 +0100
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2650.21)
Content-Type: text/plain;
	charset="iso-8859-1"

Excuse me for commenting on an area which I know virtually nothing about,
but one point struck me when I saw this PEP. You have methods for finding
the start and end of various <indextypes>, but you don't have a method for
finding the length of an <indextype>. In the case of words (which is the one
I understand :-), the length of a word is not the same as the difference
between the starts of consecutive words - the intervening whitespace should
be excluded (at least for some applications). I would suggest

length_<indextype>(u, index) -> integer
Returns the length in Unicode objects of the <indextype> found at u[index]
or -1 in case u[index] is not in an element of this type (for example, in
the whitespace between words). [XXX Should this be the number of Unicode
objects between index and the end of the element, or should it be the length
from start to end even if you are in the middle?]

or maybe better

nextend_<indextype>(u, index) -> integer
Returns the Unicode object index for the end of the next <indextype> found
after u[index] or -1 in case no next element of this type exists.

[But that runs into issues when you are in a word - If index is not the
first Unicode object, nextend is the end of *this* element, whereas next is
the start of the *next* element. I think I'm starting to show my
ignorance...]

Even though I suspect my suggested methods are too simplistic, I'd suggest
at least a comment in the PEP on how to work out the length of the element
you're in (or why it's hard, and you'd never want to do it :-)...

Paul.

--------------4273B7E264E4649CF795A2CF--