HTML6 proposal (Re: sexp xml syntax transformation)

Wed Sep 22 12:42:50 EDT 2010

cleaned up the previous post.

• 〈HTML6, Your HTML/XML Simplified〉
http://xahlee.org/comp/html6.html

plain text version follows
--------------------------------------------------
HTML6, Your HTML/XML Simplified

Xah Lee, 2010-09-21

Tired of the standard bodies telling us what to do and change their
altitude? Tired of the SGML/HTML/XML/XHTML/HTML5 changes? Tire no
more, here's a new proposal that will make life easier.

Introducing HTML6

HTML6 is based on HTML5, XML, and a rectified LISP syntax. More
specifically, it is derived from existing work on this, the SXML.
http://okmij.org/ftp/Scheme/SXML.html, except that there is complete
regularity at syntax level, and is not considered or compatible with
lisp readers. The syntax can be specified by 3 short lines of parsing
expression grammar.

The aim is far more simpler syntax, 100% regularity, and leaner. but
with a far simpler, and more strict, format.

First of all, no error is accepted, ever. If a source code has
incorrect syntax, that page is not displayed.

Example

Here's a standard ATOM webfeed XML file.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://xahlee.org/
emacs/">

 <title>Xah's Emacs Blog</title>
 <subtitle>Emacs, Emacs, Emacs</subtitle>
 <link rel="self" href="http://xahlee.org/emacs/blog.xml"/>
 <link rel="alternate" href="http://xahlee.org/emacs/blog.html"/>
 <updated>2010-09-19T14:53:08-07:00</updated>
 <author>
   <name>Xah Lee</name>
   <uri>http://xahlee.org/</uri>
 </author>
 <id>http://xahlee.org/emacs/blog.html</id>
 <icon>http://xahlee.org/ics/sum.png</icon>
 <rights>© 2009, 2010 Xah Lee</rights>

 <entry>
   <title>Using Emacs's Abbrev Mode for Abbreviation</title>
   <id>tag:xahlee.org,2010-09-19:215308</id>
   <updated>2010-09-19T14:53:08-07:00</updated>
   <summary>tutorial</summary>
  <link rel="alternate" href="http://xahlee.org/emacs/
emacs_abbrev_mode.html"/>
 </entry>

</feed>
Here's how it looks like in html6:

〔?xml 「version “1.0” encoding “utf-8”」〕
〔feed 「xmlns “http://www.w3.org/2005/Atom” xml:base “http://xahlee.org/
emacs/”」

  〔title Xah's Emacs Blog〕
  〔subtitle Emacs, Emacs, Emacs〕
  〔link 「rel “self” href “http://xahlee.org/emacs/blog.xml”」〕
  〔link 「rel “alternate” href “http://xahlee.org/emacs/blog.html”」〕
  〔updated 2010-09-19T14:53:08-07:00〕
  〔author
   〔name Xah Lee〕
   〔uri http://xahlee.org/〕
  〕

  〔id http://xahlee.org/emacs/blog.html〕
  〔icon http://xahlee.org/ics/sum.png〕
  〔rights © 2009, 2010 Xah Lee〕

  〔entry
   〔title Using Emacs's Abbrev Mode for Abbreviation〕
   〔id tag:xahlee.org,2010-09-19:215308〕
   〔updated 2010-09-19T14:53:08-07:00〕
   〔summary tutorial〕
   〔link 「rel “alternate” href “http://xahlee.org/emacs/
emacs_abbrev_mode.html”」〕
  〕
〕
Simple Matching Pairs For Tag Delimiters

The standard xml markup bracket is simplified using simple lisp style
matching pairs. For example, this code:

<h1>HTML6</h1>
Is written as:

〔h1 HTML6〕
The delimiter used is:

Character	Unicode Code Point	Unicode Name
〔	U+3014	LEFT TORTOISE SHELL BRACKET
〕	U+3015	RIGHT TORTOISE SHELL BRACKET
XML Properties and Attributes Syntax

In xml:

<h1 id="xyz" class="abc">HTML6</h1>
In html6:

〔h1「id “xyz” class “abc”」HTML6〕
The attributes are specified by matching corner brackets. Items inside
are a sequence of pairs. The value must be quoted by curly double
quotes.

Escape Mechanisms

To include the 〔tortoise shell〕 delimiters in data, use “&#x3014;” and
“&#x3015;”, similarly for the 「corner brackets」.

Unicode; No More CD Data and Entities “&”

There's no Entities. Except the unicode in hexadecimal format
“&#x‹unicode code point hexidecimal›”.

For example, “&” is not allowed.

Treatment of Whitespace

Basically identical to XML.

Char Encoding; UTF8 and UTF16 Only

Source code must be UTF8 or UTF16, only. Nothing else.

File Name Extension

File name extension is “.xml6” or “.html6”.

Semantics

The semantics should follow xhtml5.

Questions and Answers

What's wrong with xhtml/html5 exactly?

The politics of standard body changes, and their attitude about what
is correct also changes whimsically. In around 2000, we are told that
XML and XHTML will change society, or, at least, make the web correct
and valid and far more easier to develop and flexible. Now it's a
decade later. Sure the web has improved, but as far as html/xhtml and
browser rendering goes, it's still a syntax soup with extreme
complexities. 99.99% of web pages are still not valid, and nobody
cares. Major browsers still don't agree on their rendering behavior.
Web dev is actually far more complex, involving tens or hundreds of
tech that hardly a person even knows about (ajax, jason, lots xml
variations). It's hard to say if it is better at all than the HTML3
days with “font” and “table” tags and gazillion tricks. The best
practical approach is still trial n error with browsers.

And, now HTML5 comes alone, from a newfangled hip group primarily from
current big corporations Google and Apple, with a attitude that
validation is overrated — a insult to the face about the XML mantra
from w3c, just when there starts to be more and more sites with
correct XHTML and Microsoft's Internet Explorer getting on track about
correctness.

XML is break from SGML, with many justifications why it needs be, and
with some backward compatible trade-offs, and now HTML5 is a break
from both SGML and XML.

See also:

(Google Earth) KML Validation Fuckup
Google's 「rel="nofollow"」 Rule
HTML Correctness and Validators
Why not just adopt SXML from the lisp world?

Lisp's SXML is not a stand-alone syntax for the need of the web.
Lisp's format typically are made in a way to follow lisp's traditions,
and often has quirks of its own. The syntax is not 100% regular of
nested parens. SXML is easy for lispers to adopt, but harder for other
languages and communities.

For lisp's syntax irregularities, see: Fundamental Problems of Lisp.

For example, the xml as textual representation of a tree has a quirk,
in that each node has this special thing called “attributes” (aka
“properties”). The “attribute” is not a node of the tree, but rather,
is info attached to a node.

The standard lisp syntax (aka sexp) to represent attributes is this,
e.g..

(h1 :id "xyz" :class "abc" ...)
Syntactically, each of “:id”, “"xyz"” etc are not distinguishable from
a node/branch in the tree. Only semantically, after lisp reader parsed
the special character “:” in a node's name, then it is considered a
property name, and that the next element in the expression is being
considered as a value for that property.

Another way to represent xml's attribute is this:

(h1 ((id . "xyz") (class . "abc")) ...)
This too, have syntactical ambiguity.

The whole “((id . "xyz") (class . "abc"))” can be interpreted as a
node by itself, where the first element is again a node. But also
here, it uses lisp's special “cons” syntax “(id . "xyz")” which is
itself ambiguous at the syntax level. e.g. it can be considered as a
node named “id” with 2 branches “.” and “"xyz"”, or it can be
considered as a node named “cons” with 2 branches “id” and “"xyz"”.

Another common lisp syntax for attributes is this:

(h1 (@ (id . "xyz") (class . "abc")) ...)
Again, this whole “(@ ...)” part at the syntax level is simply a node
named “@”. Only at the semantic level, that it is taken as properties
of a node due to semantics attached to the head string “@”.

So, in conceiving html6, i thought a solution for getting rid of
syntax ambiguity for node vs attributes is to use a special bracket
for properties/attributes of a node. e.g. “〔h1「id “xyz” class
“abc”」...〕”.

Why use weird Unicode characters for matching pair?

Unicode has become widely adopted today. (See: Unicode Popularity On
Web.) Unicode also has a lot proper matching pairs. (See: Matching
Brackets in Unicode.) It seems today is the right time to adopt the
wide range of proper characters instead of keep relying on the very
limited number of ASCII characters.

The straight quote character " is not a matching pair, and in code it
present several problems. For example, it is difficult to know which
quote matches which. Also, it is difficult to recover from a missing
quote. (this problem is especially pronounced in text editors for
syntax highlighting.) A proper matching pair allow programs and
editors to more easily correctly determine the quoted content, and for
easily navigating the tree.

The unicode characters 〔〕 and 「」 may be difficult to input. Possibly,
they can be replaced by () and {} for html6. Though, that also means a
lot ugly escape will need to happen in text, and if not escaped, that
means incorrect syntax.

One thing about this html6 is that it is intentionally separate from
being a valid sexp of the lisp world. The core idea is that the syntax
of html6 is designed specifically as a 2-dimentional textual
representation of a tree, and with a attribute quote that attaches a
limited form of info (pairs sequence) to any node to fit existing
structure of XML.

The advantage of this is that it should be extremely easy to parse, in
perhaps just 3 lines of parsing expression grammar. And can be easily
done in perl, python, ruby... without entailing lisp quirks, and can
be trivially tranformed into legal lisp syntax by lisps as well.

Any thoughts about flaws?

 Xah ∑ xahlee.org ☄