searching through a string and pulling characters

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Mon Aug 18 19:01:26 EDT 2008


On Mon, 18 Aug 2008 13:40:13 -0700, Alexnb wrote:

> Lets say I have a text file. The contents look like this, only there is
> A LOT of the same thing.
> 
> () A registry mark given by underwriters (as at Lloyd's) to ships in
> first-class condition. Inferior grades are indicated by A 2 and A 3. ()
> The first three letters of the alphabet, used for the whole alphabet. ()
> In church or chapel style; -- said of compositions sung in the old
> church style, without instrumental accompaniment; as, a mass a capella,
> i. e., a mass purely vocal.
> () Astride; with a part on each side; -- used specif. in designating the
> position of an army with the wings separated by some line of
> demarcation, as a river or road.
> 
> Now, I am talking 1000's of these. I need to do something like this. I
> will have a number, and what I want to do is go through this text file,
> just like the example. The trick is this, those "()'s" are what I need
> to match, so if the number is 245 I need to find the 245th () and then
> get the all the text from after it until the next (). If you have an
> idea about the best way to do this I would love your help. If you made
> it all the way through thanks! ;)


If I take your description of the problem literally, then the solution is:

text = "() A registry mark given ..."  # lots and lots of text
blocks = text.split( "()" )  # use a literal "()" as a delimiter
answer = blocks[n]  # whichever number you want, starting counting at 0


I suspect that the problem is more complicated than you are saying. I 
guess that in your actual data, the brackets () probably have something 
inside them. It looks like you are quoting definitions from a dictionary.

Alex, a word of advice for you: we really don't like playing guessing 
games. If you get a reputation for describing your problem inaccurately, 
incompletely or cryptically, you will find fewer and fewer people willing 
to answer your questions. I recommend that you spend a few minutes now 
reading this page and save yourself a lot of grief later:

http://www.catb.org/~esr/faqs/smart-questions.html

Now, back to your problem. If my guess is right, and the brackets 
actually have text inside them, then my simple solution above will not 
work. You will need a more complicated solution using a regular 
expression or a parser. That solution will depend on whether or not you 
can get nested brackets "(ab (123 (fee fi fum) 456) cd ef)" or arbitrary 
single brackets without the matching pair.

Your question also sounds suspiciously like homework. I don't do people's 
homework, but here's something to get you started. It's not a solution, 
but it can be used as the first step towards a solution.

text = "() A registry mark given ..."  # lots and lots of text
level = 0
blocks = []
for c in text:  # process text one character at a time
    if c == '(':
        print "Found an opening bracket"
        level += 1  # one deeper in brackets
    elif c == ')':
        level -= 1
        if level < 0:
             print "Found a close bracket without matching open bracket"
        else:
             print "Found a closing bracket"
    else:  # any other character
        # here's where you do the real work
        if level == 0:
            print "Not inside a bracket"
            blocks.append(c)
        else:
            print "Inside a bracket"
if level > 0:
    print "Missing close bracket"
text_minus_bracketed_words = ''.join(blocks)



-- 
Steven



More information about the Python-list mailing list