[Tutor] regex grouping/capturing
Albert-Jan Roskam
fomcl at yahoo.com
Tue Jun 18 10:27:32 CEST 2013
----- Original Message -----
> From: Andreas Perstinger <andipersti at gmail.com>
> To: "tutor at python.org" <tutor at python.org>
> Cc:
> Sent: Friday, June 14, 2013 2:23 PM
> Subject: Re: [Tutor] regex grouping/capturing
>
> On 14.06.2013 10:48, Albert-Jan Roskam wrote:
>> I am trying to create a pygments regex lexer.
>
> Well, writing a lexer is a little bit more complex than your original
> example suggested.
Hi Andreas, sorry for the late reply. It is true that creating a lexer is not that simple. I oversimplified my original example indeed.
<snip>
> I'm not sure if a single regex can capture this.
> But looking at the pygments docs I think you need something along the
> lines of (adapt the token names to your need):
>
> class ExampleLexer(RegexLexer):
> tokens = {
> 'root': [
> (r'\s+', Text),
> (r'set', Keyword),
> (r'workspace|header', Name),
> (r'\S+', Text),
> ]
> }
>
> Does this help?
In my original regex example I used groups because I wanted to use pygments.lexer.bygroups (see below)
to disentangle commands, subcommands, keywords, values. Finding a command is relatively easy, but the other three elements are not. A command is always preceded by newline and a dot. A subcommand is preceded by a forward slash. A value is *optionally* preceded by an equals sign. A keyword precedes a value. Each command has its own subset of subcommands, keywords, values (e.g. a 'median' keyword is valid only in an 'aggregate' command, not in, say, a 'set' command. I have an xml representation of each of the 1000+ commands. My plan is to parse these and regexify them (one regex per command, all to be stored in a dictionary/shelve). Oh, and if that's not challenging enough: regexes in pygment lexers may not contain nested groups (not sure if that also applies to non-capturing groups). I think I have to get the xml part right first and than see if this can be done. Thanks again Andreas!
from pygments.lexer import RegexLexer, bygroups from pygments.token import * class IniLexer(RegexLexer): name = 'INI' aliases = ['ini', 'cfg'] filenames = ['*.ini', '*.cfg'] tokens = { 'root': [ (r'\s+', Text), (r';.*?$', Comment), (r'\[.*?\]$', Keyword), (r'(.*?)(\s*)(=)(\s*)(.*?)$', bygroups(Name.Attribute, Text, Operator, Text, String)) ] }
More information about the Tutor
mailing list