[issue2636] Regexp 2.6 (modifications to current re 2.2.2)
Jeffrey C. Jacobs
report at bugs.python.org
Tue Jun 17 19:43:40 CEST 2008
Jeffrey C. Jacobs <timehorse at users.sourceforge.net> added the comment:
Well, it's time for another update on my progress...
Some good news first: Atomic Grouping is now completed, tested and
documented, and as stated above, is classified as issue2636-01 and
related patches. Secondly, with caveats listed below, Named Match Group
Attributes on a match object (item 2) is also more or less complete at
issue2636-02 -- it only lacks documentation.
Now, I want to also update my list of items. We left off at 11: Other
Perl-specific modifications. Since that time, I have spawned a number
of other branches, the first of which (issue2636-12) I am happy to
announce is also complete!
12) Implement the changes to the documentation of re as per Jim J.
Jewett suggestion from 2008-04-24 14:09. Again, this has been done.
13) Implement a grouptuples(...) method as per Mark Summerfield's
suggest on 2008-05-28 09:38. grouptuples would take the same filtering
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested). It should default to all
match groups (1..n, not group 0, the matching string).
14) As per PEP-3131 and the move to Python 3.0, python will begin to
allow full UNICODE-compliant identifier names. Correspondingly, it
would be the responsibility of this item to allow UNICODE names for
match groups. This would allow retrieval of UNICODE names via the
group* functions or when combined with Item 3, the getitem handler
(m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m,
u'...')) when combined with item 2 (02+14).
15) Change the Pattern_Type, Match_Type and Scanner_Type (experimental)
to become richer Python Types. Specifically, add __doc__ strings to
each of these types' methods and members.
16) Implement various FIXMEs.
16-1) Implement the FIXME such that if m is a MatchObject, del m.string
will disassociate the original matched string from the match object;
string would be the only member that would allow modification or
deletion and you will not be able to modify the m.string value, only
delete it.
-----
Finally, I want to say a couple notes about Item 2:
Firstly, as noted in Item 14, I wish to add support for UNICODE match
group names, and the current version of the C-code would not allow that;
it would only make sense to add UNICODE support if 14 is implemented, so
adding support for UNICODE match object attributes would depend on both
items 2 and 14. Thus, that would be implemented in issue2636-02+14.
Secondly, there is a FIXME which I discussed in Item 16; I gave that
problem it's own item and branch. Also, as stated in Item 15, I would
like to add more robust help code to the Match object and bind __doc__
strings to the fixed attributes. Although this would not directly
effect the Item 2 implementation, it would probably involve moving some
code around in its vicinity.
Finally, I would like suggestions on how to handle name collisions when
match group names are provided as attributes. For instance, an
expression like '(?P<pos>.*)' would match more or less any string and
assign it to the name "pos". But "pos" is already an attribute of the
Match object, and therefore pos cannot be exposed as a named match group
attribute, since match.pos will return the usual meaning of pos for a
match object, not the value of the capture group names "pos".
I have 3 proposals as to how to handle this:
a) Simply disallow the exposure of match group name attributes if the
names collide with an existing member of the basic Match Object
interface.
b) Expose the reserved names through a special prefix notation, and for
forward compatibility, expose all names via this prefix notation. In
other words, if the prefix was 'k', match.kpos could be used to access
pos; if it was '_', match._pos would be used. If Item 3 is implemented,
it may be sufficient to allow access via match['pos'] as the canonical
way of handling match group names using reserved words.
c) Don't expose the names directly; only expose them through a prefixed
name, e.g. match._pos or match.kpos.
Personally, I like a because if Item 3 is implemented, it makes a fairly
useful shorthand for retrieving keyword names when a keyword is used for
a name. Also, we could put a deprecation warning in to inform users
that eventually match groups names that are keywords in the Match Object
will eventually be disallowed. However, I don't support restricting the
match group names any more than they already are (they must be a valid
python identifier only) so again I would go with a) and nothing more and
that's what's implemented in issue2636-02.patch.
-----
Now, rather than posting umteen patch files I am posting one bz2-
compressed tar of ALL patch files for all threads, where each file is of
the form:
issue2636(-\d\d|+\d\d)*(-only)?.patch
For instance,
issue2636-01.patch is the p1 patch that is a difference between the
current Python trunk and all that would need to be implemented to
support Atomic Grouping / Possessive Qualifiers. Combined branches are
combined with a PLUS ('+') and sub-branches concatenated with a DASH ('-
'). Thus, "issue2636-01+09-01-01+10.patch" is a patch which combines
the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub-
sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.
Item 9 has both a child and a grandchild. The Child (09-01) is my
proposed engine redesign with the single loop; the grandchild (09-01-01)
is the redesign with the triple loop. Finally the optional "-only" flag
means that the diff is against the core SRE modifications branch and
thus does not include the core branch changes.
As noted above, Items 01, 02, 05, 07 and 12 should be considered more or
less complete and ready for merging assuming I don't identify in my
implementation of the other items that I neglected something in these.
The rest, including the combined items, are all provided in the given
tarball.
Added file: http://bugs.python.org/file10645/issue2636-patches.tar.bz2
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________
More information about the Python-bugs-list
mailing list