Parsing text

Wed May 6 17:15:03 EDT 2009

On Wed, 06 May 2009 19:32:28 +0100, iainemsley <iainemsley at googlemail.com>  
wrote:

> Hi,
> I'm trying to write a fairly basic text parser to split up scenes and
> acts in plays to put them into XML. I've managed to get the text split
> into the blocks of scenes and acts and returned correctly but I'm
> trying to refine this and get the relevant scene number when the split
> is made but I keep getting an NoneType error trying to read the block
> inside the for loop and nothing is being returned. I'd be grateful for
> some suggestions as to how to get this working.

With neither a sample of your data nor the traceback you get, this is
going to require some crystal ball work.  Assuming that all you've got
is running text, I should warn you now that getting this right is a
hard task.  Getting it apparently right and having it fall over in a
heap or badly mangle the text is, unfortunately, very easy.

> for scene in text.split('Scene'):

Not a safe start.  This will split on the word "Scenery" as well, for
example, and doesn't guarantee you the start of a scene by a long way.

>     num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)

This is almost certainly not going to do what you expect, because all
those backslashes in the string are going to get processed as escape
characters before the string is ever passed to re.compile.  Even if
you fix that (by doubling the backslashes or making it a raw string),
I sincerely doubt that this is the regular expression you want.  As
escaped, it matches in sequence:

   * the start of the string
   * a space, tab, newline or other whitespace character.  Just the one.
   * the literal string "[0-9, "
   * either "i" or "I" repeated between 1 and four times
   * the literal string ", "
   * either "v" or "V"
   * the literal string "]"

Assuming you didn't mean to escape the open square bracket doesn't help:

   * the start of the string
   * one whitespace character
   * one of the following characters: 0123456789,iI{}vV

Also, what the heck is this doing *inside* the for loop?

>     textNum = num.match(scene)

If you're using re.match(), the "^" on the regular expression is
redundant.

>     if textNum:
>         print textNum

textNum is the match object, so printing it won't tell you much.  In
particular, it isn't going to produce well-formed XML.

>     else:
>         print "No scene number"

Nor will this.

>     m = '<div type="scene>'

Missing close double quotes after 'scene'.

>     m += scene
>     m += '<\div>'
>     print m

I'm seeing nothing here that should produce an error message that
has anything to do with NoneType.  Any chance of (a) a more accurate
code sample, (b) the traceback, or (c) sample data?

-- 
Rhodri James *-* Wildebeeste Herder to the Masses