The Cost of Dynamism

Sun Mar 13 16:05:39 EDT 2016

Chris Angelico wrote:

> On Sun, Mar 13, 2016 at 6:24 AM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Marko Rauhamaa wrote:
>>> […] HTML markup is all ASCII.
>>
>> Wrong.  I am creating HTML documents whose source code contains Unicode
>> characters every day.
>>
>> Also, the two of you fail to differentiate between US-ASCII, a 7-bit
>> character encoding, and 8-bit or longer encodings which can *also* encode
>> characters that can be *encoded with* US-ASCII.
> 
> Where are the non-ASCII characters in your HTML documents? Are they in
> the *markup* of HTML, or in the *text*? This is the difference.

There is a misconception on your part instead.  The text content of an 
HTML/Web document (the part between the [HTML] tags) is *part* of the (HTML) 
markup as it is (at least) *a part* of the content of (HTML) elements. [1a]
[1b] 

Besides, even if one would unwisely adopt your private definition of 
“markup”, Unicode characters that cannot be encoded with US-ASCII are of 
course allowed verbatim in attribute values, and to a lesser degree (not in 
HTML 4.01 and below) in element type names and attribute names, as well – 
therefore, according to even your *wrong* private definition of “markup”, 
“*in* the markup of HTML”. [2][3]

Bottom line:

If one declares the character encoding that one uses in an SGML-based (HTML 
up to including version 4.01, XML and all XML-based document types) or SGML-
related (HTML5) markup document (there are several possibilities for that)¹, 
there is no need to use character entity references instead of plain Unicode 
characters.  And if you avoid spaghetti code, the probability of the need 
for numeric character references in HTML is also quite low.  (The same 
applies to lightweight markup languages like Markdown, but let us not get 
there now.)

[In fact, the possibility to use characters verbatim other than those that 
can be encoded with US-ASCII applies to all Internet messages, including
e-mail and Usenet postings, and to a lesser degree (because there are fewer 
declaration mechanisms available) to all forms of electronically 
stored/readable text.  As of RFC 5536, standards-compliant Network News 
client software is even required to support MIME. [4]]

  [This was a professional Web author/developer with more than a decade of 
   continuing work experience clarifying your misconception.  I recommend
   to you that you subscribe to the newsgroups in the 
   comp.infosystems.www.authoring.* hierarchy, where this discussion would
   have been on-topic, and to <news:comp.lang.javascript>, to clarify some
   of the other misconceptions that you may have acquired about
   Web(-related) authoring/development.]

________
¹  This is only to be reasonably safe from surprises; several of those 
   markup languages require the assumption of a default character encoding 
   and/or the implementation of character encoding detection for their
   parsers, but not all parsers are conforming, and it stands to reason
   that parser efficiency can be increased if the encoding does not have
   to be detected/inferred at first.

[1a] <https://en.wikipedia.org/wiki/Markup_language#Etymology_and_origin>
[1b] <https://www.w3.org/TR/1999/REC-html401-19991224
      /intro/sgmltut.html#h-3.2.1>
     <http://www.w3.org/TR/2014/REC-html5-20141028/dom.html#elements>
[2]  <http://www.w3.org/TR/2014/REC-html5-20141028
      /infrastructure.html#encoding-terminology>
[3]  <https://www.w3.org/TR/1999/REC-html401-19991224
      /charset.html#doc-char-set>
     <http://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#parsing>
[4]  <http://tools.ietf.org/html/rfc5536#section-2.3>

> And I'm not conflating those two. When I say ASCII, I am referring to
> the 128 characters that have Unicode codepoints U+0000 through U+007F.

That is only your private definition of ASCII.  The commonly accepted 
definition is along those lines instead:

<https://en.wikipedia.org/wiki/ASCII> pp.

(See also the Specification references above.)

HTH

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.