Parsing text

Wed May 6 23:43:49 EDT 2009

> Hi,
> I'm trying to write a fairly basic text parser to split up scenes and
> acts in plays to put them into XML. I've managed to get the text split
> into the blocks of scenes and acts and returned correctly but I'm
> trying to refine this and get the relevant scene number when the split
> is made but I keep getting an NoneType error trying to read the block
> inside the for loop and nothing is being returned. I'd be grateful for
> some suggestions as to how to get this working.
> 
> for scene in text.split('Scene'):
>     num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)
>     textNum = num.match(scene)
>     if textNum:
>         print textNum
>     else:
>         print "No scene number"
>     m = '<div type="scene>'
>     m += scene
>     m += '<\div>'
>     print m
> 
> Thanks, Iain
> 

Don't forget that when you split the text, the first piece you get is what came *before* the thing you split on so there won't be a scene number in the first piece.

###
>>> print 'this foo 1 and that foo 2 and the end'.split('foo')
['this ', ' 1 and that ', ' 2 and the end']
###

If you have material before the first occurrence of the word 'Scene' you will want to print that out without decoration.

Also, it looks like you are trying to say with your regex that the scene number will come after some space and be a digit followed by a roman numeral of some kind(?). If the number looks like this 1iii or 2iv or then you could split your text with a regex rather than split:

###
>>> scene=re.compile('Scene\s+([0-9iIvV]+)')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene  1ii And then came the next act.')
['The front matter ', '1i', ' The beginning was the best. ', '1ii', ' And then came the next act.']
>>> 
###

The \s+ indicates that there will be at least one space character and maybe more; the human error factor predicts that you will use more than one space after the word scene, so \s+ just allows for that possibility.

The 0-9iIvV indicate the possible characters that might be part of your scene number. Since it's unlikely that you will have any word appearing after Scene that matches that pattern, it isn't written to be exact in specifying what should come next. [1] The parenthesis tell what (beside the pieces left by removing the split target) should be presented. In this case, the parenthesis were put around the pattern that (maybe) represented your scene number and so those are interspersed with the list of pieces.

/chris

[1] If it were more precise it might be '([1-9][0-9]*(iv|v?i{0,3}))' which recognizes that a number should start with 1 or above and perhaps be followed by 0 or more digits (including 0) and then come the roman numeral possibilities (for up to viii) [2].  That "|" indicates "or" and the parenthesis go around the roman numeral part to indicate that the "or" doesn't extend back to the decimal digits. That extra set of parenthesis also means that the split will now contain TWO captured pieces between each piece of script. If you put a ? after the scene number part meaning that it may or may not be there, None will be returned for the patterns that are not there:

###
>>> scene=re.compile('Scene\s+([1-9][0-9]*(iv|v?i{0,3}))?')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene  1ii And then came the next act. Scene The last one has no number.')
['The front matter ', '1i', 'i', ' The beginning was the best. ', '1ii', 'ii', ' And then came the next act. ', None, None, 'The last one has no number.']
>>> 
###

[2] http://diveintopython.org/regular_expressions/roman_numerals.html