Re: a little parsing challenge ☺

Xah Lee xahlee at gmail.com
Wed Jul 20 06:31:24 EDT 2011


i've just cleaned up my elisp code and wrote a short elisp tutorial.

Here:

〈Emacs Lisp: Batch Script to Validate Matching Brackets〉
http://xahlee.org/emacs/elisp_validate_matching_brackets.html

plain text version follows. Please let me know what you think.

am still working on going thru all code in other langs. Will get to
the ruby one, and that perl regex, and the other fixed python ones.
(possibly also the 2 common lisp codes but am not sure they are
runnable as is or just some non-working showoff. lol)

===============================================
Emacs Lisp: Batch Script to Validate Matching Brackets

Xah Lee, 2011-07-19

This page shows you how to write a elisp script that checks thousands
of files for mismatched brackets.

----------------------------------------------------------------
The Problem

------------------------------------------------
Summary

I have 5 thousands files containing many matching pairs. I want to to
know if any of them contains mismatched brackets.

------------------------------------------------
Detail

The matching pairs includes these: () {} [] “” ‹› «» 〈〉 《》 【】 〖〗 「」
『』.

The program should be able to check all files in a dir, and report any
file that has mismatched bracket, and also indicate the line number or
positon where a mismatch occurs.

For those curious, if you want to know what these brackets are, see:

    • Syntax Design: Use of Unicode Matching Brackets as Specialized
Delimiters
    • Intro to Chinese Punctuation with Computer Language Syntax
Perspectives

For other notes and conveniences about dealing with brackets in emacs,
see:

    • Emacs: Defining Keys to Navigate Brackets
    • “extend-selection” at A Text Editor Feature: Extend Selection by
Semantic Unit
    • “select-text-in-quote” at Suggestions on Emacs's mark-word
Command

----------------------------------------------------------------
Solution

Here's outline of steps.

    • Go thru the file char by char, find a bracket char.
    • Check if the one on stack is a matching opening char. If so
remove it. Else, push the current onto the stack.
    • Repeat the above till no more bracket char in the file.
    • If the stack is not empty, then the file got mismatched
brackets. Report it.
    • Do the above on all files.

Here's some interesting use of lisp features to implement the above.

------------------------------------------------
Define Matching Pair Chars as “alist”

We begin by defining the chars we want to check, as a “association
list” (aka “alist”). Like this:

(setq matchPairs '(
                   ("(" . ")")
                   ("{" . "}")
                   ("[" . "]")
                   ("“" . "”")
                   ("‹" . "›")
                   ("«" . "»")
                   ("【" . "】")
                   ("〖" . "〗")
                   ("〈" . "〉")
                   ("《" . "》")
                   ("「" . "」")
                   ("『" . "』")
                   )
      )

If you care only to check for curly quotes, you can remove elements
above. This is convenient because some files necessarily have
mismatched pairs such as the parenthesis, because that char is used
for many non-bracketing purposes (e.g. ASCII smiley).

A “alist” in lisp is basically a list of pairs (called key and value),
with the ability to search for a key or a value. The first element of
a pair is called its key, the second element is its value. Each pair
is a “cons”, like this: (cons mykey myvalue), which can also be
written using this syntax: (mykey . myvalue) for more easy reading.

The purpose of lisp's “alist” is similar to Python's dictionary or
Pretty Home Page's array. It is also similar to hashmap, except that
alist can have duplicate keys, can search by values, maintains order,
and alist is not intended for massive number of elements. Elisp has a
hashmap datatype if you need that. (See: Emacs Lisp Tutorial: Hash
Table.)

(info "(elisp) Association Lists")

------------------------------------------------
Generate Regex String from alist

To search for a set of chars in emacs, we can read the buffer char-by-
char, or, we can simply use “search-forward-regexp”. To use that,
first we need to generate a regex string from our matchPairs alist.

First, we defines/declare the string. Not a necessary step, but we do
it for clarity.

(setq searchRegex "")

Then we go thru the matchPairs alist. For each pair, we use “car” and
“cdr” to get the chars and “concat” it to the string. Like this:

(mapc
 (lambda (mypair) ""
   (setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
   )
 matchPairs)

Then we remove the ending “|”.

(setq searchRegex (substring searchRegex 0 -1)) ; remove the ending
“|”

Then, change | it to \\|. In elisp regex, the | is literal. The “regex
or” is \|. And if you are using regex in elisp, elisp does not have a
special regex string syntax, it only understands normal strings. So,
to feed to regex \|, you need to espace the first backslash. So, your
regex needs to have \\|. Here's how we do it:

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or” operation

You could shorten the above into just 2 lines by using \\| in the
“mapc” step and not as a extra step of replacing | by \\|.

See also: emacs regex tutorial.

------------------------------------------------
Implement Stack Using Lisp List

Stack is done using lisp's list. e.g. '(1 2 3). The bottom of stack is
the first element. To add to the stack, do it like this: (setq mystack
(cons newitem mystack)). To remove a item from stack is this: (setq
mystack (cdr mystack)). The stack begin as a empty list: '().

For each element in the stack, we need the char and also its position,
so that we can report the position if the file does have mismatched
pairs.

We use a vector as entries for the stack. Each entry is like this:
(vector char pos). (See: Emacs Lisp Tutorial: List & Vector.)

Here's how to fetch a char from stack bottom, check if current char
matches, push to stack, pop from stack.

; check if current char is a closing char and is in our match pairs
alist.
; use “rassoc” to check alist's set of “values”.
; It returns the first key/value pair found, or nil
(rassoc char matchPairs)

; add to stack
(setq myStack (cons (vector char pos) myStack) )

; pop stack
(setq myStack (cdr myStack) )

------------------------------------------------
Complete Code

Here's the complete code.

;; -*- coding: utf-8 -*-
;; 2011-07-15
;; go thru a file, check if all brackets are properly matched.
;; e.g. good: (…{…}… “…”…)
;; bad: ( [)]
;; bad: ( ( )

(setq inputFile "xx_test_file.txt" ) ; a test file.
(setq inputDir "~/web/xahlee_org/") ; must end in slash

(defvar matchPairs '() "a alist. For each pair, the car is opening
char, cdr is closing char.")
(setq matchPairs '(
                   ("(" . ")")
                   ("{" . "}")
                   ("[" . "]")
                   ("“" . "”")
                   ("‹" . "›")
                   ("«" . "»")
                   ("【" . "】")
                   ("〖" . "〗")
                   ("" . "")
                   ("" . "")
                   ("「" . "」")
                   ("『" . "』")
                   )
      )

(defvar searchRegex "" "regex string of all pairs to search.")
(setq searchRegex "")
(mapc
 (lambda (mypair) ""
   (setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
   )
 matchPairs)

(setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
t)) ; remove the ending “|”

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or” operation

(defun my-process-file (fpath)
  "process the file at fullpath fpath ..."
  (let (myBuffer myStack ξchar ξpos)

    (setq myStack '() ) ; each element is a vector [char position]
    (setq ξchar "") ; the current char found

    (when t
      ;; (not (string-match "/xx" fpath)) ; in case you want to skip
certain files

      (setq myBuffer (get-buffer-create " myTemp"))
      (set-buffer myBuffer)
      (insert-file-contents fpath nil nil nil t)

      (goto-char 1)
      (setq case-fold-search t)
      (while (search-forward-regexp searchRegex nil t)
        (setq ξpos (point)  )
        (setq ξchar (buffer-substring-no-properties ξpos (- ξpos
1))  )

        ;; (princ (format "-----------------------------\nfound char:
%s\n" ξchar) )

        (let ((isClosingCharQ nil) (matchedOpeningChar nil) )
          (setq isClosingCharQ (rassoc ξchar matchPairs))
          (when isClosingCharQ (setq matchedOpeningChar (car
isClosingCharQ) ) )

          ;; (princ (format "isClosingCharQ is: %s\n"
isClosingCharQ) )
          ;; (princ (format "matchedOpeningChar is: %s\n"
matchedOpeningChar) )

          (if
              (and
               (car myStack) ; not empty
               (equal (elt (car myStack) 0) matchedOpeningChar )
               )
              (progn
                ;; (princ (format "matched this bottom item on stack:
%s\n" (car myStack)) )
                (setq myStack (cdr myStack) )
                )
            (progn
              ;; (princ (format "did not match this bottom item on
stack: %s\n" (car myStack)) )
              (setq myStack (cons (vector ξchar ξpos) myStack) ) )
            )
          )
        ;; (princ "current stack: " )
        ;; (princ myStack )
        ;; (terpri )
        )

      (when (not (equal myStack nil))
        (princ "Error file: ")
        (princ fpath)
        (print (car myStack) )
        )
      (kill-buffer myBuffer)
      )
    ))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah match pair output*" )
  (with-output-to-temp-buffer outputBuffer
    ;; (my-process-file inputFile)
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.txt$"))
    (princ "Done deal!")
    )
  )

I added many comments and debug code for easy understanding. If you
are not familiar with the many elisp idioms such as opening file,
buffers, printing to output, see: Emacs Lisp Idioms (for writing
interactive commands) ◇ Text Processing with Emacs Lisp Batch Style.

To run the code, simply open it in emacs. Edit the line at the top for
“inputDir”. Then call “eval-buffer”.

Here's a sample output:

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter_orig.txt
[")" 3625]
Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter.txt
[")" 2338]
Error file: c:/Users/h3/web/xahlee_org/p/arabian_nights/xx/v1fn.txt
["”" 185795]
Done deal!

The weird ξ you see in my code is greek x. I use unicode char in
variable name for experimental purposes. You can just ignore it. (See:
Programing Style: Variable Naming: English Words Considered Harmful.)

------------------------------------------------
Advantages of Emacs Lisp

Note that the great advantage of using elisp for text processing,
instead of {perl, python, ruby, …} is that many things are taken care
by the emacs environment.

I don't need to write code to declare file's encoding (emacs
automatically detects). No reading file is involved. Just open, save,
or move thru characters. No code needed for doing safety backup. (the
var “make-backup-files” controls that). You can easily open the files
by its path with a click or key press. I can add just 2 lines so that
clicking on the error char in the output jumps to the location in the
file.

Any elisp script you write inside emacs automatically become extension
of emacs and can be used in a interactive way.

This problem is posted to a few comp.lang newsgroups as a fun
challenge. You can see several solutions in python, ruby, perl, common
lisp, at: a little parsing challenge ☺ (2011-07-17) @ Source
groups.google.com.

 Xah



More information about the Python-list mailing list