[Tutor] regex advice
Steven D'Aprano
steve at pearwood.info
Tue Jan 6 13:31:04 CET 2015
On Tue, Jan 06, 2015 at 11:43:01AM +0000, Norman Khine wrote:
> hello,
> i have the following code:
>
> import os
> import sys
> import re
>
> walk_dir = ["app", "email", "views"]
> #t(" ")
> gettext_re = re.compile(r"""[t]\((.*)\)""").findall
>
> for x in walk_dir:
[...]
The first step in effectively asking for help is to focus on the actual
problem you are having. If you were going to the car mechanic to report
some strange noises from the engine, you would tell him about the
noises, not give him a long and tedious blow-by-blow account of where
you were going when the noise started, why you were going there, who you
were planning on seeing, and what clothes you were wearing that day. At
least, for your mechanic's sake, I hope you are not the sort of person
who does that.
That reminds me, I need to give my Dad a call...
And so it is with code. Your directory walker code appears to work
correctly, so there is no need to dump it in our laps. It is just a
distraction and an annoyance. To demonstrate the problem, you need the
regex, a description of what result you want, a small sample of data
where the regex fails, and a description of what you get instead.
So, let's skip all the irrelevant directory-walking code and move on to
the important part:
> which traverses a directory and tries to extract all strings that are
> within
>
> t(" ")
>
> for example:
>
> i have a blade template file, as
>
> replace page
> .row
> .large-8.columns
> form( method="POST", action="/product/saveall/#{style._id}" )
> input( type="hidden" name="_csrf" value=csrf_token )
> h3 #{t("Generate Product for")} #{tt(style.name)}
[...]
I'm not certain that we need to see an entire Blade template file.
Perhaps just an extract would do. Or perhaps not. For now, I will assume
an extract will do, and skip ahead:
> so, gettext_re = re.compile(r"""[t]\((.*)\)""").findall is not correct as
> it includes
>
> results such as input( type="hidden" name="_csrf" value=csrf_token )
>
> what is the correct way to pull all values that are within t(" ") but
> exclude any tt( ) and input( )
>
> any advice much appreciated
My first instinct is to quote Jamie Zawinski:
Some people, when confronted with a problem, think, "I know,
I'll use regular expressions." Now they have two problems.
If you have nested parentheses or quotes, you *cannot* in general solve
this problem with regular expressions. If Blade templates allow nesting,
then you are in trouble and you will need another solution, perhaps a
proper parser.
But for now, let's assume that no nesting is allowed. Let's look at the
data format again:
blah blah blah don't care input(type="hidden" ...)
h3 #{t("blah blah blah")} #{tt(style.name)}
So it looks like the part you care about looks like this:
...#{t("spam")}...
where you want to extract the word "spam", nothing else.
This suggests a regex:
r'#{t\("(.*)"\)}'
and then you want the group (.*) not the whole regex. Here it is in
action:
py> import re
py> text = """blah blah blah don't care input(type="hidden" ...)
... h3 #{t("we want this")} #{tt(style.name)}
... more junk
... #{tt(abcd)} #{t("and this too")}blah blah blah..."""
py> pat = re.compile(r'#{t\("(.*)"\)}')
py> pat.findall(text)
['we want this', 'and this too']
So it appears to be working. Now you can try it on the full Blade file
and see how it goes.
If the regex still gives false matches, you can try:
- using a non-greedy match instead:
r'#{t\("(.*?)"\)}'
- rather than matching all possible characters with .* can you limit it
to only alphanumerics?
r'#{t\("(\w*?)"\)}'
- perhaps you need a two-part filter, first you use a regex to extract
all the candidate matches, and then you eliminate some of them using
some other test (not necessarily a regex).
Good luck!
--
Steve
More information about the Tutor
mailing list