Help to find a regular expression to parse po file

gialloporpora "sandrodll[remove]" at googlemail.com
Mon Jul 6 12:32:47 EDT 2009


Risposta al messaggio di Hallvard B Furuseth :


>
> I don't know the syntax of a po file, but this works for the
> snippet you posted:
>
> arg_re = r'"[^\\\"]*(?:\\.[^\\\"]*)*"'
> arg_re = '%s(?:\s+%s)*' % (arg_re, arg_re)
> find_re = re.compile(
>      r'^msgid\s+(' + arg_re + ')\s*\nmsgstr\s+(' + arg_re + ')\s*\n', re.M)
>
> However, can \ quote a newline? If so, replace \\. with \\[\s\S] or
> something.
> Can there be other keywords between msgid and msgstr?  If so,
> add something like (?:\w+\s+<arg_re>\s*\n)*? between them.
> Can msgstr come before msgid? If so, forget using a single regexp.
> Anything else to the syntax to look out for?  Single quotes, maybe?
>
> Is it a problem if the regexp isn't quite right and doesn't match all
> cases, yet doesn't report an error when that happens?
>
> All in all, it may be a bad idea to sqeeze this into a single regexp.
> It gets ugly real fast.  Might be better to parse the file in a more
> regular way, maybe using regexps just to extract each (keyword, "value")
> pair.
>
Thank you very much, Haldvard, it seem to works, there is a strange 
match in the file header but I could skip the first match.


The po files have this structure:
http://bit.ly/18qbVc

msgid "string to translate"
"   second string to match"
"   n string to match"
msgstr "translated sting"
"   second translated string"
"  n translated string"
One or more new line before the next group.

In past I have created a Python script to parse PO files  where msgid 
and msgstr are in two sequential lines, for example:

msgid "string to translate"
msgstr "translated string"

now the problem is how to match also (optional) string between msgid and 
msgstr.

Sandro








More information about the Python-list mailing list