Re: a little parsing challenge ☺

Tue Jul 19 13:32:22 EDT 2011

On Jul 18, 2:59 pm, Thomas 'PointedEars' Lahn <PointedE... at web.de>
wrote:
> Ian Kelly wrote:
> > Billy Mays wrote:
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
> >> face it, Unicode is for goobers.
>
> > Uh, okay...
>
> > Your script also misses the requirement of outputting the index or row
> > and column of the first mismatched bracket.
>
> Thanks to Python's expressiveness, this can be easily remedied (see below).  
>
> I also do not follow Billy's comment about Unicode.  Unicode and the fact
> that Python supports it *natively* cannot be appreciated enough in a
> globalized world.
>
> However, I have learned a lot about being pythonic from his posting (take
> those generator expressions, for example!), and the idea of looking at the
> top of a stack for reference is a really good one.  Thank you, Billy!
>
> Here is my improvement of his code, which should fill the mentioned gaps.
> I have also reversed the order in the report line as I think it is more
> natural this way.  I have tested the code superficially with a directory
> containing a single text file.  Watch for word-wrap:
>
> # encoding: utf-8
> '''
> Created on 2011-07-18
>
> @author: Thomas 'PointedEars' Lahn <PointedE... at web.de>, based on an idea of
> Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9... at myhashismyemail.com>
> in <news:j01ph6$knt$1 at speranza.aioe.org>
> '''
> import sys, os
>
> pairs = {u'}': u'{', u')': u'(', u']': u'[',
>          u'”': u'“', u'›': u'‹', u'»': u'«',
>          u'】': u'【', u'〉': u'〈', u'》': u'《',
>          u'」': u'「', u'』': u'『'}
> valid = set(v for pair in pairs.items() for v in pair)
>
> if __name__ == '__main__':
>     for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
>         for name in filenames:
>             stack = [' ']
>
>             # you can use chardet etc. instead
>             encoding = 'utf-8'
>
>             with open(os.path.join(dirpath, name), 'r') as f:
>                 reported = False
>                 chars = ((c, line_no, col) for line_no, line in enumerate(f)
> for col, c in enumerate(line.decode(encoding)) if c in valid)
>                 for c, line_no, col in chars:
>                     if c in pairs:
>                         if stack[-1] == pairs[c]:
>                             stack.pop()
>                         else:
>                             if not reported:
>                                 first_bad = (c, line_no + 1, col + 1)
>                                 reported = True
>                     else:
>                         stack.append(c)
>
>             print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
> '%s' at %s:%s" % first_bad))

Thanks for the fix.
Though, it seems still wrong.

On the file http://xahlee.org/p/time_machine/tm-ch04.html

there is a mismatched curly double quote at 28319.

the script reports:
tm-ch04.html: bad ')' at 68:2

that doesn't seems right. Line 68 is empty. There's no opening or
closing round bracket anywhere close. Nearest are lines 11 and 127.

Maybe Billy Mays's algorithm is wrong.

 Xah (fairly discouraged now, after running 3 python scripts all
failed)