regex pattern to extract repeating groups

Mon Aug 27 14:15:15 EDT 2018

On 08/25/2018 04:55 PM, Malcolm wrote:
> I am trying to understand why regex is not extracting all of the 
> characters between two delimiters.
> 
> The complete string is the xmp IFD data extracted from a .CR2 image file.
> 
> I do have a work around, but it's messy and possibly not future proof.
> 
> Any insight greatly appreciated.
> 
> Malcolm
> 
> My test code is
> 
> import re
> 
> # environment # Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) 
> [MSC v.1900 64 bit (AMD64)] on win32 # extract of real data for test 
> purposes. This extract is repeated # the delimiters are <dc: and </dc: 
> extract =''' <dc:creator> <rdf:Seq> <rdf:li>abcdef zxcvb</rdf:li> 
> </rdf:Seq> </dc:creator> ''' # modify the test data modified_extract_1 
> =''' <dc:creator> <rdf:Seq> <rdf:li>abcdef zxcvb</rdf:li> </rdf:Seq> 
> </dc:creator> ''' # modify test data version 2 this works 
> modified_extract_2 =''' <dc:creator> <rdf:li>abcdef zxcvb</rdf:li> 
> </dc:creator> ''' re_pattern =r'( *<dc:.*</dc:)' print('extract', 
> re.search(re_pattern, extract, re.DOTALL))
> # >>> s1 <_sre.SRE_Match object; span=(1, 89), match=' <dc:creator>\n 
> <rdf:Seq>\n <rdf:li>abcd> print('modified_extract_1', 
> re.search(re_pattern, modified_extract_1, re.DOTALL))
> # >>> sre.SRE_Match object; span=(1, 70), 
> match='<dc:creator>\n<rdf:Seq>\n<rdf:li>abcdef zxcvb</rd> 
> print('modified_extract_2', re.search(re_pattern, modified_extract_2, 
> re.DOTALL))
> # >>> s <_sre.SRE_Match object; span=(1, 49), 
> match='<dc:creator>\n<rdf:li>abcdef zxcvb</rdf:li>\n</dc> # NOTE the 
> missing ':' from the </dc I
> 

Regexes are generally regarded as a bad way to parse XML data, and I 
believe are provably unsuited in the general case (though can be beaten 
into sufficiency in specific ones).

Use https://docs.python.org/3.6/library/xml.etree.elementtree.html 
instead.  Everything will just work.  You'll be happier and more 
productive, with a brighter smile and glossier coat.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.