[Tutor] regex grouping/capturing

Tue Jun 18 10:27:32 CEST 2013

----- Original Message -----
> From: Andreas Perstinger <andipersti at gmail.com>
> To: "tutor at python.org" <tutor at python.org>
> Cc: 
> Sent: Friday, June 14, 2013 2:23 PM
> Subject: Re: [Tutor] regex grouping/capturing
> 
> On 14.06.2013 10:48, Albert-Jan Roskam wrote:
>> I am trying to create a pygments  regex lexer.
> 
> Well, writing a lexer is a little bit more complex than your original 
> example suggested.

Hi Andreas, sorry for the late reply. It is true that creating a lexer is not that simple. I oversimplified my original example indeed.

<snip>

> I'm not sure if a single regex can capture this.
> But looking at the pygments docs I think you need something along the 
> lines of (adapt the token names to your need):
> 
> class ExampleLexer(RegexLexer):
>     tokens = {
>         'root': [
>             (r'\s+', Text),
>             (r'set', Keyword),
>             (r'workspace|header', Name),
>             (r'\S+', Text),
>         ]
>     }
> 
> Does this help?

In my original regex example I used groups because I wanted to use pygments.lexer.bygroups (see below)
to disentangle commands, subcommands, keywords, values. Finding a command is relatively easy, but the other three elements are not. A command is always preceded by newline and a dot. A subcommand is preceded by a forward slash. A value is *optionally* preceded by an equals sign. A keyword precedes a value. Each command has its own subset of subcommands, keywords, values (e.g. a 'median' keyword is valid only in an 'aggregate' command, not in, say, a 'set' command. I have an xml representation of each of the 1000+ commands. My plan is to parse these and regexify them (one regex per command, all to be stored in a dictionary/shelve). Oh, and if that's not challenging enough: regexes in pygment lexers may not contain nested groups (not sure if that also applies to non-capturing groups). I think I have to get the xml part right first and than see if this can be done. Thanks again Andreas!
from pygments.lexer import RegexLexer, bygroups from pygments.token import * class IniLexer(RegexLexer): name = 'INI' aliases = ['ini', 'cfg'] filenames = ['*.ini', '*.cfg'] tokens = { 'root': [ (r'\s+', Text), (r';.*?$', Comment), (r'\[.*?\]$', Keyword), (r'(.*?)(\s*)(=)(\s*)(.*?)$', bygroups(Name.Attribute, Text, Operator, Text, String)) ] }