This is a little off my usual beaten path, but what the heck.
This is two related proposals: one for a new DOM feature,
document.parseDocumentFragment
, and one for JS syntactic
sugar for that feature. It is a response to Ian Hickson’s E4H Strawman, and is
partially inspired by the general
quasi-literal proposal for ES-Harmony.
Compared to Hixie’s proposal, this avoids embedding a subset of the HTML grammar in the JS grammar, while at the same time being more likely to conform with author expectations, since the HTML actually gets parsed by the HTML parser. It should have at least equivalent expressivity and power.
Motivating Example
function addUserBox(userlist, username, icon, attrs) {
var section = h`<section class="user" {attrs}>
<h1>{username}</h1>
</section>`;
if (icon)
section.append(h`<img src="{icon}" alt="">`);
userlist.append(section);
}
document.parseDocumentFragment
This is a new method on the DOM Document interface:
DocumentFragment parseDocumentFragment(in DOMString htmlText,
optional object substitutions)
throws DOMException;
The htmlText
is parsed as-if by the HTML
fragment parsing algorithm, with no context element; the resulting
sequence of DOM nodes is collected into a DocumentFragment
,
which is returned.
The substitutions
argument cannot be described in WebIDL
as far as I can tell. It must be a dictionary, but it can have any
number of arbitrarily-named keys and arbitrary values, except that all
keys must be valid JavaScript identifiers. The values in the dictionary
are constrained as described below.
If the substitutions
argument is supplied, then the HTML5
tokenizer uses it as follows:
A substitution reference consists syntactically of an U+007B LEFT CURLY BRACKET (
{
) character, followed immediately by a valid JavaScript identifier, followed immediately by an U+007D RIGHT CURLY BRACKET (}
) character. A live substitution reference is a substitution reference for which thesubstitutions
dictionary contains a key which is character-by-character identical to the JavaScript identifier, and the corresponding value for that substitution reference is the dictionary value corresponding to that key.In all tokenizer states, a substitution reference which is not live (i.e. which has no corresponding value in the
substitutions
dictionary) is processed as its constituent sequence of characters would have been in the absence of this specification.If the tokenizer is in data state and it encounters a live substitution reference, then:
If the corresponding value is a DOMString, then it is inserted into the document in place of the substitution reference, as-if the tokenizer had emitted its contents as a sequence of character tokens.
If the corresponding value is a DOM Node of any kind, then it is inserted into the document in place of the substitution reference, as-if by
appendChild
(not as-if by the parser’s tree construction stage) applied to the current node. The current node pointer remains the same.Otherwise, a DOMException is thrown.
If the tokenizer is in RCDATA state, RAWTEXT state, or script data state, and it encounters a live substitution reference, then: if the corresponding value is a DOMString, then it is inserted into the document in place of the substitution reference, as-if the tokenizer had emitted its contents as a sequence of character tokens; otherwise, a DOMException is thrown.
If the complete text of any quoted attribute value is a single live substitution reference, then: if the corresponding value is a DOMString, then the tokenizer changes the value of that attribute to be the contents of the DOMString; otherwise, a DOMException is thrown.
If the tokenizer is in before attribute name state and it encounters a live substitution reference: if the corresponding value is a dictionary, then each key-value pair in that dictionary becomes an attribute name and value for the new tag token under construction; otherwise, a DOMException is thrown.
In the latter two cases, if the replacement would cause any attribute to take on an invalid value, it is handled in the same way that it would be if DOM manipulation had been used to set that attribute to that value.
Rationale
In some implementations, DOM method invocation is so expensive that
constructing even a relatively short document fragment with
create<node>
and appendChild
is slower
than setting innerHTML
. Therefore, whatever new syntactic
sugar we invent for HTML fragment literals needs to expand to a
single library call rather than a sequence of operations. This
in turn requires a mechanism for carrying out substitutions within the
HTML parser. I have deliberately written this up as an abstract set of
new tokenizer rules rather than a detailed change proposal; the latter
would have to be written in the same prose-algorithmese as the tokenizer
spec itself, and that would obscure what is being proposed. I have been
as liberal as I dare about which tokenizer states can make use of
substitutions, but it is unclear what many of them are for, so I may
have missed some places where substitutions should be allowed.
The substitution mechanism should automatically quote all
syntactically significant characters; I believe that emitting
string contents as a sequence of character tokens
is the correct
way to do this within the HTML5 parser algorithm. Note that I do allow
insertion of arbitrary text into inline <script>
and
<style>
elements and a few others
(RCDATA/RAWTEXT/script data states); this can’t change where the element
ends but it could conceivably change the meaning of a script or
style sheet in an unsafe way.
We do not want to have to call back into the JS interpreter to evaluate substitutions, as this can be just as expensive as JS-to-native method invocation, so we accept only strings in most contexts. However, it may be useful to be able to supply entire document fragments as substituents in a quoting-safe way, and the obvious tactic is to allow Nodes as substituents in data state.
h`…`
syntactic sugar
The JS parser scans each h
-prefixed backquoted string
for occurrences of U+007B LEFT CURLY BRACKET not immediately preceded by
an odd number of U+005C REVERSE SOLIDUS characters. From each such LEFT
CURLY BRACKET, it scans forward for a subsequent U+007D RIGHT CURLY
BRACKET at the same nesting level, counting parentheses (U+0028, U+0029)
and square brackets (U+005B, U+005D) as well as curly brackets for
nesting. It is a syntax error if the grouping characters are
misnested.
The text in between the matching curly brackets is extracted and
parsed according to the standard JavaScript Expression
production (note that this production is not used to determine
the end of the extraction). It is a semantic error if the expression can
be determined to have side effects at compile time. The parser
substitutes a gensym for each extracted expression; it SHOULD (in the
RFC2119 sense) merge the expressions into equivalence classes and use
the same gensym for all occurrences of the same equivalence class.
The overall literal is then replaced by a call to
document.parseDocumentFragment
whose htmlText
argument is the backquoted string after {...}
replacement,
and whose substitutions
argument is a dictionary whose keys
are the gensyms and whose values are the results of evaluating the
extracted expressions. Each expression, after evaluation, is processed
through an abstract operation which I shall call
[[primToString]]
. This converts all primitive and
boxed primitive JavaScript types to string as-if by the
standard ToString
abstract operation, but leaves objects of
any other type alone.
Thus, the motivating example
above might get rewritten as
function addUserBox(userlist, username, icon, attrs) {
var section = document.parseDocumentFragment(
'<section class="user" {A}>
<h1>{B}</h1>
</section>', { "A": [[primToString]](username),
"B": [[primToString]](attrs) });
if (icon)
section.append(document.parseDocumentFragment(
'<img src="{A}" alt="">', { "A": [[primToString]](icon) });
userlist.append(section);
}
Note that there is no need to protect the gensyms against collisions
with other identifiers in the program, since they will only be used for
lookup in the substitutions
dictionary, and by
construction, nothing else will.
Rationale
This syntactic sugar is similar to, but not the same as, the sugar
proposed for general
JavaScript quasi-literals. The key differences are that our
substitutions use {...}
instead of ${...}
, and
we simply match curly brackets rather than using Expression
to determine the end of the substitution. I believe both are in keeping
with the way similar things have been done in e.g. PHP and Python.
The [[primToString]]
conversion improves the readability
of simple cases, e.g. one may write
h`Two and two are {2+2}`
instead of
h`Two and two are {(2+2).toString()}`
while staying out of the way of authors who need to substitute attribute dictionaries or Nodes.
The h
prefix is maybe too short; if it’s to be longer,
html
would be the logical choice.