Elisp Tutorial: HTML Syntax Coloring Code Block

Fri Oct 19 03:46:13 EDT 2007

On Wed, 2007-10-17 at 21:15 -0700, Xah Lee wrote:
> Elisp Tutorial: HTML Syntax Coloring Code Block
> 
> Xah Lee, 2007-10
> 
> This page shows a example of writing a emacs lisp function that
> process a block of text to syntax color it by HTML tags. If you don't
> know elisp, first take a gander at Emacs Lisp Basics.
> 
> HTML version with color and links is at:
> http://xahlee.org/emacs/elisp_htmlize.html
> 
> ---------------------------------------
> THE PROBLEM
> 
> SUMMARY
> 
> I want to write a elisp function, such that when invoked, the block of
> text the cursor is on, will have various HTML style tags wrapped
> around them. This is for the purpose of publishing programing language
> code in HTML on the web.
> 
> DETAIL
> 
> I write a lot computer programing tutorials for several computer
> languages. For example: Perl and Python tutorial, Java tutorial, Emacs
> Lisp tutorial, Javascript tutorial. In these tutorials, often there
> are code snippets. These code need to be syntax colored in HTML.
> 
> For example, here's a elisp code snippet:
> 
> (if (< 3 2)  (message "yes") )
> 
> Here's what i actually want as raw HTML:
> 
> (<span class="keyword">if</span> (< 3 2)  (message <span
> class="string">"yes"</span>) )
> 
> Which should looks like this in a web browser:
> 
> (if (< 3 2)  (message "yes") )
> 
> There is a emacs package that turns a syntax-colored text in emacs to
> HTML form. This is extremely nice. The package is called htmlize.el
> and is written (1997,...,2006) by Hrvoje Niksic, available at
> http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el.
> 
> This program provides you with a few new emacs commands. Primarily, it
> has htmlize-region, htmlize-buffer, htmlize-file. The region and
> buffer commands will output HTML code in a new buffer, and the htmlize-
> file version will take a input file name and output into a file.
> 
> When i need to include a code snippet in my tutorial, typically, i
> write the code in a separate file (e.g. “temp.java”, “temp.py”), run
> it to make sure the code is correct (compile, if necessary), then,
> copy the file into the HTML tutorial page, inside a «pre» block. In
> this scheme, the best way for me to utilize htmlize.el program is to
> use the “html-buffer” command on my temp.java, then copy the htmlized
> output and paste that into my HTML tutorial file inside a «pre» block.
> Since many of my tutorials are written haphazardly over the years
> before seeing the need for syntax coloration, most exist inside «pre»
> tags already without a temp code file. So, in most cases, what i do is
> to select the text inside the «pre» tag, paste into a temp buffer and
> invoke the right mode for the language (so the text will be fontified
> correctly), then do htmlize-buffer, then copy the html output, then
> paste back to replace the selected text.
> 
> This process is tedious. A tutorial page will have several code
> blocks. For each, i will need to select text, create a buffer, switch
> mode, do htmlize, select again, switch buffer, then paste. Many of the
> steps are not pure push-buttons operations but involves eye-balling.
> There are few hundred such pages.
> 
> It would be better, if i can place the cursor on a code block in a
> existing HTML page, then press a button, and have emacs magically
> replace the code block with htmlized version colorized for the code
> block's language. We proceed to write this function.
> 
> ---------------------------------------
> SOLUTION
> 
> For a elisp expert who knows how fontification works in emacs, the
> solution would be writing a elisp code that maps emacs's string's
> fontification info into html tags. This is what htmlize.el do exactly.
> Since it is already written, a elisp expert might find the essential
> code in htmlize.el. (the code is licensed under GPL) .
> 
> Unfortunately, my lisp experience isn't so great. I spent maybe 30
> minutes tried to look in htmlize.html in hope to find a function
> something like htmlize-str that is the essence, but wasn't successful.
> I figured, it is actually faster if i took the dumb and inefficient
> approach, by writing a elisp code that extracts the output from
> htmlize-buffer. Here's the outline of the plan of my function:
> 
>     * 1. Grab the text inside a <pre class="«lang»">...</pre> tag.
>     * 2. Create a new buffer. Paste the code in.
>     * 3. Make the new buffer «lang» mode (and fontify it)
>     * 4. Call htmlize-buffer
>     * 5. Grab the (htmlized) text inside «pre» tag in the htmlize
> created output buffer.
>     * 6. Kill the htmlize buffer and my temp buffer.
>     * 7. Delete the original text, paste in the new text.
> 
> To achieve the above, i decided on 2 steps. A: Write a function
> “htmlize-string” that takes a string and mode name, and returns the
> htmlized string. B: Write a function “htmlize-block” that does the
> steps of grabbing text and pasting, and calls “htmlize-string” for the
> actual htmlization.
> 
> Here's the code of my htmlize-string function:
> 
> (defun htmlize-string (ccode mn)
> "Take string ccode and return htmlized code, using mode mn.\n
> This function requries the htmlize-mode.el by Hrvoje Niksic, 2006"
> (let (cur-buf temp-buf temp-buf2 x1 x2 resultS)
>     (setq cur-buf (buffer-name))
>     (setq temp-buf "xout-weewee")
>     (setq temp-buf2 "*html*") ;; the buffer that htmlize-buffer
> creates
> 
>     ; put the code in a new buffer, set the mode
>     (switch-to-buffer temp-buf)
>     (insert ccode)
>     (funcall (intern mn))
> 
>     (htmlize-buffer temp-buf)
>     (kill-buffer temp-buf)
>     (switch-to-buffer temp-buf2)
> 
>     ; extract the core code
>     (setq x1 (re-search-forward "<pre>"))
>     (setq x1 (+ x1 1))
>     (re-search-forward "</pre>")
>     (setq x2 (re-search-backward "</pre>"))
>     (setq resultS (buffer-substring-no-properties x1 x2))
>     (kill-buffer temp-buf2)
> 
>     (switch-to-buffer cur-buf)
>     resultS
> )
> )
> 
> The major part in this code is knowing how to create, switch, kill
> buffers. Then, how to set a mode. Lastly, how to grab text in a
> buffer.
> 
> Current buffer is given by “buffer-name”. To create or switch buffer
> is done by “switch-to-buffer”. Kill buffer is “kill-buffer”. To
> activate a mode, the code is “(funcall (intern my-mode-name))”. I
> don't know why this is so in detail, but it is interesting to know.
> 
> The grabbing text is done by locating the desired beginning and ending
> locations using re-search functions, and buffer-substring-no-
> properties for actually extracting the string.
> 
> Here, note the “no-properties” in “buffer-substring-no-properties”.
> Emacs's string can contain information called properties, which is
> essentially the fontification information.
> 
> Reference: Elisp Manual: Buffers.
> 
> Reference: Elisp Manual: Text-Properties.
> 
> Here's the code of my htmlize-block function:
> 
> (defun htmlize-block ()
>   "Replace the region enclosed by <pre> tag to htmlized code.
> For example, if the cursor somewhere inside the tag:
> 
> <pre cla ss=\"code\">
> codeXYZ...
> </pre>
> 
> after calling, the “codeXYZ...” block of text will be htmlized.
> That is, wrapped with many <span> tags.
> 
> The opening tag must be of the form <pre cla ss=\"lang-str\">.
> The “lang-str” determines what emacs mode is used to colorize
> the code.
> This function requires htmlize.el by Hrvoje Niksic."
> 
> (interactive)
> (let (mycode tag-begin styclass code-begin code-end tag-end mymode)
>   (progn
>     (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\)
> \""))
>     (setq styclass (match-string 1))
>     (setq code-begin (re-search-forward ">"))
>     (re-search-forward "</pre>")
>     (setq code-end (re-search-backward "<"))
>     (setq tag-end (re-search-forward "</pre>"))
>     (setq mycode (buffer-substring-no-properties code-begin code-end))
>     )
>   (cond
>    ((equal styclass "elisp") (setq mymode "emacs-lisp-mode"))
>    ((equal styclass "perl") (setq mymode "cperl-mode"))
>    ((equal styclass "python") (setq mymode "python-mode"))
>    ((equal styclass "java") (setq mymode "java-mode"))
>    ((equal styclass "html") (setq mymode "html-mode"))
>     ((equal styclass "haskell") (setq mymode "haskell-mode"))
>    )
>   (save-excursion
>     (delete-region code-begin code-end)
>     (goto-char code-begin)
>     (insert (htmlize-string mycode mymode))
>     )
>   )
> )
> 
> The steps of this function is to grab the text inside a «pre» block,
> call htmlize-string, then insert the result replacing text.
> 
> Originally, i wrote the code to grab text by inside plain “<pre>...</
> pre>” tags, then use some heuristics to determine what language it is,
> then call htmlize-string with the mode-name passed to it. However,
> since my html pages already has the language information in the form
> of “<pre class="«lang»">...</pre>” (for CSS reasons), so, now i search
> text by that form, and use the “lang” part to determine a mode.
> 
> Emacs is beautiful.
> 
> Postscript:
> 
> The story given above is slightly simplified. For example, when i
> began my language notes and commentaries, they were not planned to be
> some systematic or sizable tutorial. As the pages grew, more quality
> are added in editorial process. So, a plain un-colored code inside
> «pre» started to have “language comment” strings colorized (e.g.
> “<span class="cmt">#...</span>), by using a simple elisp code that
> wraps a tag on them, and this function is mapped to shortcut key for
> easy execution. As pages and languages grew, i find colorizing comment
> isn't enough, then i started to look for a syntax-coloring html
> solution. There are solutions in Perl, Python, PHP, but I find emacs
> solution best suites my needs in particular because it integrates with
> emacs's interactive nature, and my writing work is done in a
> accumulative, editorial process.
> 
> In the beginning i used htmlize-region and htmlize-buffer as they are
> for new code. Note that this is still a laborious process. Gradually i
> need to colorized my old code. The problem is that many already
> contain my own «span class="cmt"» tags, and strings common in computer
> languages such as “<=” have already been transformed into required
> html encoding “<=”. So, the elisp code will first “un-htmlize”
> these in my htmlize-block code. But once all my existing code has been
> so newly colorized, the part of code to transform strings for un-
> htmlize is no longer necessary, so they are taken out in htmlize-block
> and resumes a cleaner state. Also, htmlize-block went thru many
> revisions over the year. Sometimes in recent past, i had one code
> wrapper for each language. For example, i had htmlize-me-perl, htmlize-
> me-python, htmlize-me-java, etc. The need for unification into a
> single coherent wrapper code didn't materialize. In general, it is my
> experience, in particular in writing elisp customization for emacs,
> that tweaking code periodically thru the year is practical, because it
> adapts to the constant changes of requirements, environment, work
> process. For example, eventually i might write my own htmlize.el, if i
> happen to need more flexibility, or if my elisp experience
> sufficiently makes the job relatively easy.
> 
> Also note: a whole-sale solution is to write a program, in say,
> Python, that process html files and replace proper sections by
> htmlized string. This is perhaps more efficient if all the existing
> html files are in some uniform format. However, i need to work on my
> tutorials on a case-by-case basis. In part, because, some pages
> contain multiple languages or contains pseudo-code that i do not wish
> colorized. (For example, some pages contains codes of the Mathematica↗
> language. Mathematica code is normally done in Mathematica's
> mathematical typesetting capable “front-end” IDE called “Notebook” and
> is not “syntax-colored” as such.)

+1 ;;
BTW, what is G2/1.0? Is that Emacs-like editor?

-- 
Byung-Hee HWANG <bh at izb.knu.ac.kr>
InZealBomb, Kyungpook National University, KOREA

"Godfather, Godfather, save me from death, I beg of you."
		-- Genco Abbandando, "Chapter 1", page 46