HTML Fragment Parser with Substitution and Syntactic Sugar

This is a little off my usual beaten path, but what the heck.

This is two related proposals: one for a new DOM feature, document.parseDocumentFragment, and one for JS syntactic sugar for that feature. It is a response to Ian Hickson’s E4H Strawman, and is partially inspired by the general quasi-literal proposal for ES-Harmony.

Compared to Hixie’s proposal, this avoids embedding a subset of the HTML grammar in the JS grammar, while at the same time being more likely to conform with author expectations, since the HTML actually gets parsed by the HTML parser. It should have at least equivalent expressivity and power.

Motivating Example

function addUserBox(userlist, username, icon, attrs) {
  var section = h`<section class="user" {attrs}>
                    <h1>{username}</h1>
                  </section>`;
  if (icon)
    section.append(h`<img src="{icon}" alt="">`);
  userlist.append(section);
}

document.parseDocumentFragment

This is a new method on the DOM Document interface:

DocumentFragment parseDocumentFragment(in DOMString htmlText,
                                       optional object substitutions)
                     throws DOMException;

The htmlText is parsed as-if by the HTML fragment parsing algorithm, with no context element; the resulting sequence of DOM nodes is collected into a DocumentFragment, which is returned.

The substitutions argument cannot be described in WebIDL as far as I can tell. It must be a dictionary, but it can have any number of arbitrarily-named keys and arbitrary values, except that all keys must be valid JavaScript identifiers. The values in the dictionary are constrained as described below.

If the substitutions argument is supplied, then the HTML5 tokenizer uses it as follows:

  • A substitution reference consists syntactically of an U+007B LEFT CURLY BRACKET ({) character, followed immediately by a valid JavaScript identifier, followed immediately by an U+007D RIGHT CURLY BRACKET (}) character. A live substitution reference is a substitution reference for which the substitutions dictionary contains a key which is character-by-character identical to the JavaScript identifier, and the corresponding value for that substitution reference is the dictionary value corresponding to that key.

  • In all tokenizer states, a substitution reference which is not live (i.e. which has no corresponding value in the substitutions dictionary) is processed as its constituent sequence of characters would have been in the absence of this specification.

  • If the tokenizer is in data state and it encounters a live substitution reference, then:

    • If the corresponding value is a DOMString, then it is inserted into the document in place of the substitution reference, as-if the tokenizer had emitted its contents as a sequence of character tokens.

    • If the corresponding value is a DOM Node of any kind, then it is inserted into the document in place of the substitution reference, as-if by appendChild (not as-if by the parser’s tree construction stage) applied to the current node. The current node pointer remains the same.

    • Otherwise, a DOMException is thrown.

  • If the tokenizer is in RCDATA state, RAWTEXT state, or script data state, and it encounters a live substitution reference, then: if the corresponding value is a DOMString, then it is inserted into the document in place of the substitution reference, as-if the tokenizer had emitted its contents as a sequence of character tokens; otherwise, a DOMException is thrown.

  • If the complete text of any quoted attribute value is a single live substitution reference, then: if the corresponding value is a DOMString, then the tokenizer changes the value of that attribute to be the contents of the DOMString; otherwise, a DOMException is thrown.

  • If the tokenizer is in before attribute name state and it encounters a live substitution reference: if the corresponding value is a dictionary, then each key-value pair in that dictionary becomes an attribute name and value for the new tag token under construction; otherwise, a DOMException is thrown.

In the latter two cases, if the replacement would cause any attribute to take on an invalid value, it is handled in the same way that it would be if DOM manipulation had been used to set that attribute to that value.

Rationale

In some implementations, DOM method invocation is so expensive that constructing even a relatively short document fragment with create<node> and appendChild is slower than setting innerHTML. Therefore, whatever new syntactic sugar we invent for HTML fragment literals needs to expand to a single library call rather than a sequence of operations. This in turn requires a mechanism for carrying out substitutions within the HTML parser. I have deliberately written this up as an abstract set of new tokenizer rules rather than a detailed change proposal; the latter would have to be written in the same prose-algorithmese as the tokenizer spec itself, and that would obscure what is being proposed. I have been as liberal as I dare about which tokenizer states can make use of substitutions, but it is unclear what many of them are for, so I may have missed some places where substitutions should be allowed.

The substitution mechanism should automatically quote all syntactically significant characters; I believe that emitting string contents as a sequence of character tokens is the correct way to do this within the HTML5 parser algorithm. Note that I do allow insertion of arbitrary text into inline <script> and <style> elements and a few others (RCDATA/RAWTEXT/script data states); this can’t change where the element ends but it could conceivably change the meaning of a script or style sheet in an unsafe way.

We do not want to have to call back into the JS interpreter to evaluate substitutions, as this can be just as expensive as JS-to-native method invocation, so we accept only strings in most contexts. However, it may be useful to be able to supply entire document fragments as substituents in a quoting-safe way, and the obvious tactic is to allow Nodes as substituents in data state.

h`…` syntactic sugar

The JS parser scans each h-prefixed backquoted string for occurrences of U+007B LEFT CURLY BRACKET not immediately preceded by an odd number of U+005C REVERSE SOLIDUS characters. From each such LEFT CURLY BRACKET, it scans forward for a subsequent U+007D RIGHT CURLY BRACKET at the same nesting level, counting parentheses (U+0028, U+0029) and square brackets (U+005B, U+005D) as well as curly brackets for nesting. It is a syntax error if the grouping characters are misnested.

The text in between the matching curly brackets is extracted and parsed according to the standard JavaScript Expression production (note that this production is not used to determine the end of the extraction). It is a semantic error if the expression can be determined to have side effects at compile time. The parser substitutes a gensym for each extracted expression; it SHOULD (in the RFC2119 sense) merge the expressions into equivalence classes and use the same gensym for all occurrences of the same equivalence class.

The overall literal is then replaced by a call to document.parseDocumentFragment whose htmlText argument is the backquoted string after {...} replacement, and whose substitutions argument is a dictionary whose keys are the gensyms and whose values are the results of evaluating the extracted expressions. Each expression, after evaluation, is processed through an abstract operation which I shall call [[primToString]]. This converts all primitive and boxed primitive JavaScript types to string as-if by the standard ToString abstract operation, but leaves objects of any other type alone.

Thus, the motivating example above might get rewritten as

function addUserBox(userlist, username, icon, attrs) {
  var section = document.parseDocumentFragment(
      '<section class="user" {A}>
         <h1>{B}</h1>
       </section>', { "A": [[primToString]](username),
                      "B": [[primToString]](attrs) });
  if (icon)
    section.append(document.parseDocumentFragment(
        '<img src="{A}" alt="">', { "A": [[primToString]](icon) });
  userlist.append(section);
}

Note that there is no need to protect the gensyms against collisions with other identifiers in the program, since they will only be used for lookup in the substitutions dictionary, and by construction, nothing else will.

Rationale

This syntactic sugar is similar to, but not the same as, the sugar proposed for general JavaScript quasi-literals. The key differences are that our substitutions use {...} instead of ${...}, and we simply match curly brackets rather than using Expression to determine the end of the substitution. I believe both are in keeping with the way similar things have been done in e.g. PHP and Python.

The [[primToString]] conversion improves the readability of simple cases, e.g. one may write

h`Two and two are {2+2}`

instead of

h`Two and two are {(2+2).toString()}`

while staying out of the way of authors who need to substitute attribute dictionaries or Nodes.

The h prefix is maybe too short; if it’s to be longer, html would be the logical choice.