[Tutor] regex advice

Tue Jan 6 13:31:04 CET 2015

On Tue, Jan 06, 2015 at 11:43:01AM +0000, Norman Khine wrote:
> hello,
> i have the following code:
> 
> import os
> import sys
> import re
> 
> walk_dir = ["app", "email", "views"]
> #t(" ")
> gettext_re = re.compile(r"""[t]\((.*)\)""").findall
> 
> for x in walk_dir:
[...]

The first step in effectively asking for help is to focus on the actual 
problem you are having. If you were going to the car mechanic to report 
some strange noises from the engine, you would tell him about the 
noises, not give him a long and tedious blow-by-blow account of where 
you were going when the noise started, why you were going there, who you 
were planning on seeing, and what clothes you were wearing that day. At 
least, for your mechanic's sake, I hope you are not the sort of person 
who does that. 

That reminds me, I need to give my Dad a call...

And so it is with code. Your directory walker code appears to work 
correctly, so there is no need to dump it in our laps. It is just a 
distraction and an annoyance. To demonstrate the problem, you need the 
regex, a description of what result you want, a small sample of data 
where the regex fails, and a description of what you get instead.

So, let's skip all the irrelevant directory-walking code and move on to 
the important part:

> which traverses a directory and tries to extract all strings that are
> within
> 
> t(" ")
> 
> for example:
> 
> i have a blade template file, as
> 
> replace page
>   .row
>     .large-8.columns
>       form( method="POST", action="/product/saveall/#{style._id}" )
>         input( type="hidden" name="_csrf" value=csrf_token )
>         h3 #{t("Generate Product for")} #{tt(style.name)}
[...]

I'm not certain that we need to see an entire Blade template file. 
Perhaps just an extract would do. Or perhaps not. For now, I will assume 
an extract will do, and skip ahead:

> so, gettext_re = re.compile(r"""[t]\((.*)\)""").findall is not correct as
> it includes
> 
> results such as input( type="hidden" name="_csrf" value=csrf_token )
>
> what is the correct way to pull all values that are within t(" ") but
> exclude any tt( ) and input( )
> 
> any advice much appreciated

My first instinct is to quote Jamie Zawinski:

    Some people, when confronted with a problem, think, "I know, 
    I'll use regular expressions." Now they have two problems.

If you have nested parentheses or quotes, you *cannot* in general solve 
this problem with regular expressions. If Blade templates allow nesting, 
then you are in trouble and you will need another solution, perhaps a 
proper parser.

But for now, let's assume that no nesting is allowed. Let's look at the 
data format again:

   blah blah blah don't care input(type="hidden" ...) 
   h3 #{t("blah blah blah")} #{tt(style.name)}

So it looks like the part you care about looks like this:

    ...#{t("spam")}...

where you want to extract the word "spam", nothing else.

This suggests a regex:

r'#{t\("(.*)"\)}'

and then you want the group (.*) not the whole regex. Here it is in 
action:

py> import re
py> text = """blah blah blah don't care input(type="hidden" ...)
...    h3 #{t("we want this")} #{tt(style.name)}
...    more junk
...    #{tt(abcd)} #{t("and this too")}blah blah blah..."""
py> pat = re.compile(r'#{t\("(.*)"\)}')
py> pat.findall(text)
['we want this', 'and this too']

So it appears to be working. Now you can try it on the full Blade file 
and see how it goes.

If the regex still gives false matches, you can try:

- using a non-greedy match instead:

  r'#{t\("(.*?)"\)}'

- rather than matching all possible characters with .* can you limit it 
to only alphanumerics?

  r'#{t\("(\w*?)"\)}'

- perhaps you need a two-part filter, first you use a regex to extract 
all the candidate matches, and then you eliminate some of them using 
some other test (not necessarily a regex).

Good luck!

-- 
Steve