Invisible XML Specification

Community Group Editorial Draft, 2024-10-15

This version:: https://invisiblexml.org/pr/276/
Latest consensus draft:: https://invisiblexml.org/current/
Latest published version:: https://www.w3.org/community/reports/ixml/CG-FINAL-ixml-20231212/
Test suite:: https://github.com/invisibleXML/ixml/tree/master/tests
Editor:: Steven Pemberton, CWI, Amsterdam
Feedback:: GitHub: invisiblexml/ixml (pull requests, new issue, open issues)

Please consult the errata page for additional changes to the specification after publication.

The Invisible XML specification grammar described in this version of the specification is available in ixml or XML format.

A version with automatically generated change markup is available. Change markup shows the differences between this version of the specification and the current version (at the time of publication).

1. Status

This document describes ixml version 1.0++, a work in progress. It reflects corrections to errata found in the published text of ixml 1.0, and may in due course be published as an updated specification. The current official version is Invisible XML 1.0.

Show section numbers

2. Introduction
3. How it works
4. The Grammar
- 4.1. Prolog
- 4.2. Rules
- 4.3. Nonterminals
- 4.4. Terminals
- 4.5. Character sets
- 4.6. Insertions
5. Parsing
6. Serialization
7. Conformance
- 7.7. Conformance of grammars
- 7.8. Conformance of processors
8. Hints for Implementers
9. Complete Grammar
10. IXML in XML
11. Errors
12. References
13. Informational References
14. Acknowledgements

2. Introduction

Data is an abstraction: there is no essential difference between the JSON

{"temperature": {"scale": "C", "value": 21}}

and an equivalent XML

<temperature scale='C' value='21'/>

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

since the underlying abstractions being represented are the same.

We choose which representations of our data to use, CSV, JSON, XML, or whatever, depending on habit, convenience, and the context in which it occurs. On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value. How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content. For example, it can turn CSS code like

body {color: blue; font-weight: bold}

into XML like

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

or, if preferred, as:

<css>
   <rule>
      <simple-selector name='body'/>
      <property name='color' value='blue'/>
      <property name='font-weight' value='bold'/>
   </rule>
</css>

As another example, the expression

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

and the URL

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

<url scheme='http'>
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

The JSON value:

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

3. How it works

A grammar is used to describe the input format. An input is parsed using this grammar, and the resulting parse tree is serialized as XML. Special marks in the grammar affect details of this serialization, for example excluding parts of the tree, or serializing parts as attributes instead of elements.

As an example, consider this simplified grammar for URLs:

url: scheme, ":", authority, path.

scheme: letter+.

authority: "//", host.
host: sub++".".
sub: letter+.

path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

This means that a URL consists of a scheme (whatever that is), followed by a colon, followed by an authority, and then a path. A scheme, is one or more letters (whatever a letter is). An authority starts with two slashes, followed by a host. A host is one or more subs, separated by points. A sub is one or more letters. A path is a slash followed by a seg, repeated one or more times. A seg is zero or more fletters. A letter is a lowercase letter, an uppercase letter, or a digit. A fletter is a letter or a point.

So, given the input string http://www.w3.org/TR/1999/xhtml.html, this would produce the serialization

<url>
   <scheme>http</scheme>:
   <authority>//
      <host>
         <sub>www</sub>.
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>TR</seg>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

(Here and in other examples, whitespace has been added to the XML for legibility.)

If the rule for letter had not had a "-" before it, the serialization for scheme, for instance, would have been:

<scheme><letter>h</letter><letter>t</letter><letter>t</letter><letter>p</letter></scheme>

Changing the rule for scheme to

scheme: name.
@name: letter+.

would change the serialization for scheme to:

<scheme name="http"/>:

Changing the rule for scheme instead to:

@scheme: letter+.

would change the serialization for url to:

<url scheme="http">

Changing the definitions of sub and seg from

sub: letter+.
seg: fletter*.

-sub: letter+.
-seg: fletter*.

would prevent the sub and seg elements appearing in the serialized result, giving:

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

Changing the rule

url: scheme, ":", authority, path.

url: scheme, -":", authority, path.

and

authority: "//", host.

authority: -"//", host.

would remove the spurious characters from the serialization:

<url scheme='http'>
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

4. The Grammar

Here we describe the format of the grammar used to describe documents. Note that it is in its own format, and therefore describes itself.

A grammar is an optional prolog, followed by a sequence of one or more rules, surrounded and separated by spacing and comments. Spacing and comments are entirely optional, except that rules must be separated by at least one of either (error S01). If an input grammar encoded in UTF-8 begins with a byte order mark (BOM), the BOM must be ignored.

ixml: s, (prolog, RS)?, rule++RS, s.

An s stands for an optional sequence of spacing and comments; RS for at least one space or comment. A comment is enclosed in braces, and can included nested comments, to enable commenting out parts of a grammar:

         -s: (whitespace; comment)*. {Optional spacing}
        -RS: (whitespace; comment)+. {Required spacing}
-whitespace: -[Zs]; tab; lf; cr.
       -tab: -#9.
        -lf: -#a.
        -cr: -#d.
    comment: -"{", (cchar; comment)*, -"}".
     -cchar: ~["{}"].

4.1. Prolog

The optional prolog declares the version of ixml being used.

       prolog: version.
      version: -"ixml", RS, -"version", RS, string, s, -'.' .

If a version string is provided and the implementation recognizes the version string, it must process the grammar using the syntax and semantics of that version.

If the version is not provided, or the implementation does not recognize the version string provided, it must nevertheless attempt to process the grammar. In this case, it is implementation-defined which version or versions the implementation uses when it attempts to parse the grammar. If it finds a syntactically valid interpretation of the grammar, it must proceed using the semantics of the version under which it found a valid interpretation, otherwise it must reject the grammar.

A grammar must conform to the syntax and semantics of the version declared or assumed (error S12).

The document element of the serialization should include an attribute named ixml:version that identifies the version of the iXML grammar used for the parse. If the prolog specifies a version that is unrecognized, the document element of the serialization must include an attribute named ixml:state, with the word 'version-mismatch' in its value. The ixml namespace URI is "http://invisiblexml.org/NS".

4.2. Rules

A rule consists of a naming, and one or more alternatives. The grammar here uses colons to define rules; an equals sign is also allowed.

rule: naming, -["=:"], s, -alts, -".".

A naming consists of an optional mark, a name, and an optional alias:

-naming: (mark, s)?, name, s, (">", s, alias, s)?.

A mark is one of ^, @ or -, and indicates whether the item so marked will be serialized as an element with its children (^) which is the default, as an attribute (@), or deleted, so that only its children are serialized (-).

@mark: ["@^-"].

A name starts with a letter or underscore, and continues with a letter, digit, underscore, a small number of punctuation characters, and the Unicode combiner characters; Unicode classes are used to define the sets of characters used, for instance, for letters and digits. This is close to, but not identical with the XML definition of a name; it is the grammar author's responsibility to ensure that all serialized names match the requirements for an XML name [XML]. Names are case-sensitive.

        @name: namestart, namefollower*.
   -namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

An alias is just a substitute name for the rule to be used at serialization:

@alias: name.

Alternatives are separated by a semicolon or a vertical bar. The grammar here uses semicolons.

alts: alt++(-[";|"], s).

An alternative is zero or more terms, separated by commas:

alt: term**(-",", s).

A term is a singleton factor, an optional factor, or a repeated factor, repeated zero or more times, or one or more times.

-term: factor;
       option;
       repeat0;
       repeat1.

A factor is a terminal, a nonterminal, an insertion, or a bracketed series of alternatives:

-factor: terminal;
         nonterminal;
         insertion;
         -"(", s, alts, -")", s.

A factor repeated zero or more times is followed by an asterisk, or followed by a double asterisk and a separator, e.g. abc* and abc**",". For instance "a"**"#" would match the empty string, a, a#a, a#a#a etc.

repeat0: factor, (-"*", s; -"**", s, sep).

Similarly, a factor repeated one or more times is followed by a plus, or a double plus and a separator, e.g. abc+ and abc++",". For instance "a"++"#" would match a, a#a, a#a#a etc., but not the empty string.

repeat1: factor, (-"+", s; -"++", s, sep).

An optional factor is followed by a question mark, e.g. abc?. For instance "a"? would match a or the empty string.

option: factor, -"?", s.

A separator can be any factor, for example abc**def or abc**(","; "."). For instance "a"++("#"; "!") would match a#a, a!a, a#a!a, a!a#a, a#a#a etc.

sep: factor.

4.3. Nonterminals

A nonterminal is a naming:

nonterminal: naming.

The name of the naming (but not the optional alias) refers to the rule that defines this name, which must exist (error S02), and there must only be one such rule (error S03).

4.4. Terminals

A terminal is a literal or a set of characters. It matches characters in the input. A terminal marked as deleted (-) serializes to the empty string.

-terminal: literal; 
           charset.

A literal is either a quoted string, or a hexadecimally encoded character:

  literal: quoted;
           encoded.

A quoted string is an optionally marked string of one or more characters, enclosed with single or double quotes. A string matches only the exact same string in the input. Examples: "yes" 'yes'.

A string cannot contain any characters from the control code character class (Cc), including a line-break (error S11). The enclosing quote is represented in a string by doubling it; these two strings are identical: 'Isn''t it?' and "Isn't it?", as are these: "He said ""Don't!""" and 'He said "Don''t!"'.

 -quoted: (tmark, s)?, string, s.

  @tmark: ["^-"].
 @string: -'"', dchar+, -'"';
          -"'", schar+, -"'".
   dchar: ~['"'; Cc];
          '"', -'"'. {all characters except controls; quotes must be doubled}
   schar: ~["'"; Cc];
          "'", -"'". {all characters except controls; quotes must be doubled}

An encoded character is an optionally marked hexadecimal number. It starts with a hash symbol, followed by any number of hexadecimal digits, for example #a0. The digits are interpreted as a number in hexadecimal (error S06) , and the character at that Unicode code-point is used [Unicode]. The number must be within the Unicode code-point range (error S07), and must not denote a Noncharacter or Surrogate code point (error S08). The version of Unicode cited is the one current at the time the initial version of this specification was published. Processors may support any version of Unicode; it is implementation-defined which version(s) they support.

An encoded character matches that one character in the input.

-encoded: (tmark, s)?, -"#", hex, s.
    @hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.

4.5. Character sets

A character set is an inclusion or an exclusion: an inclusion matches one character in the input that is in the set, an exclusion matches one character not in the set.

An inclusion is enclosed in square brackets, and represents the set of characters defined by any combination of literal characters, a range of characters, hex encoded characters, or Unicode classes. Examples ["a"-"z"], ["xyz"], [Lc], and ["0"-"9"; "!@#"; Lc]. Note that ["abc"], ["a"; "b"; "c"], ["a"-"c"], and [#61-#63] all represent the same set of characters.

An exclusion is an inclusion preceded by a tilde ~. For example, ~["{}"] matches any character that is not an opening or closing brace.

Note that the empty inclusion [] will fail to match any character in the input; on the other hand ~[] will match any one character, whatever it is.

 -charset: inclusion; 
           exclusion.
inclusion: (tmark, s)?,          set.
exclusion: (tmark, s)?, -"~", s, set.
     -set: -"[", s,  (member, s)**(-[";|"], s), -"]", s.
   member: string;
           -"#", hex;
           range;
           class.

A range represents all characters in the range from the from character to the to character, inclusive, using the Unicode ordering. The from character must not be later in the ordering than the to character (error S09).

-range: from, s, -"-", s, to.
 @from: character.
   @to: character.

A character is a string of length one, or a hex encoded character:

-character: -'"', dchar, -'"';
            -"'", schar, -"'";
            "#", hex.

A class is one or two letters, representing any character from the Unicode character category [Categories] of that name, which must exist (error S10). E.g. [Ll] matches any lower-case letter, [Ll; Lu] matches any upper- or lower-case character.

   -class: code.
    @code: capital, letter?.
 -capital: ["A"-"Z"].
  -letter: ["A"-"Z"; "a"-"z"].

4.6. Insertions

An insertion is a string or hex proceeded by a plus +. An insertion matches zero characters in the input, and only appears in the serialization.

insertion: -"+", s, (string; -"#", hex), s.

5. Parsing

The root symbol of the grammar is the name of the first rule in the grammar.

Processors must accept and parse any conforming grammar, and produce at least one parse of supplied input that matches the grammar starting at the root symbol. If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting serialization should include the attribute ixml:state on the document element with a value that includes the word ambiguous. Different processors may vary in whether input is detected as ambiguous or not. Known algorithms that accept and parse any context-free grammar include [Earley], [Unger], [CYK], [GLR], and [GLL]; see also [Grune].

Invisible XML processors normalize line endings when they read grammar and input files. They apply the same rules as [XML]: every occurrence of the two character sequence #d #a, and any #d not immediately followed by a #a, are translated into a single #a in the input. This assures that the single character #a can always be used to match an end of line, irrespective of the conventions of the system where the files were created. It follows that the iXML exclusion ~[#a]* will always match all of the characters to the end of a line.

If an input encoded in UTF-8 begins with a BOM, the BOM should be ignored.

If the parse fails, some XML document must be produced with ixml:state on the document element with a value that includes the word failed. The document should provide helpful information about where and why it failed; it may be a partial parse tree that includes parts of the parse that succeeded.

6. Serialization

If the parse succeeds, the resulting parse-tree is serialized as XML by serializing the root node of the parse tree.

A parse node is one of:

a nonterminal, which has a name, an optional alias, and children,
a terminal, which has a string,
an insertion, which has a string.

A nonterminal can be unmarked, or marked as included (^), as an attribute (@), or as deleted (-). The mark comes from the use of the nonterminal in a rule if present, and otherwise from the definition of the rule for that nonterminal if it has a mark.

Unmarked or included: the node is serialized as an XML element whose:
- name is the alias of the node if present, or the alias of the referred-to rule, if it has one, and otherwise the name of the node,
- attributes are the serializations of all exposed attribute descendants, if any. An attribute node is exposed if it is an attribute child, or an exposed attribute node of a deleted child (note this is recursive).
- content is the serialization of all its non-attribute children in order, if any.
Deleted: all its non-attribute children, if any, are serialized in order.
Attribute: the node is serialized as an XML attribute whose:
- name is the alias of the node if present, or the alias of the referred-to rule, if it has one, and otherwise the name of the node,
- value is the serialization of all non-deleted terminal descendants of the node (regardless of the marking of intermediate nonterminals), if any, in order.

A terminal can be unmarked, or marked as included (^), or as deleted (-).

Unmarked or included: the node is serialized as its string.
Deleted: the node is not serialized.

An insertion is serialized as its string.

An application has some latitude when serializing XML. Some aspects of the serialization are explicitly insignificant, such as the order of attributes on an element, whether single or double quotes are used to delimit attributes, or whether numeric character references use decimal or hexadecimal numbers. Some aspects of the serialization will impact whether or not all characters of the input are retained after the serialized XML is parsed with a conforming XML parser (e.g. #a#d as a line separator, or either of those characters within attribute values) . For a more comprehensive discussion, see [XML Serialization]. A conformant Invisible XML processor is required to produce well-formed XML output, but its choices in serializing the selected parse tree are not otherwise constrained.

Grammars must be written so that any serialization of a parse tree produced from the grammar is well-formed XML (error D01).

Note: This requirement means for instance that names of serialized elements and attributes must match the XML requirements (error D03); an element must not contain more than one attribute of a given name (error D02); an element must not contain an attribute named “xmlns” (error D07); the names of all elements and attributes must conform to the requirements for XML names; non-XML characters must not be serialized (error D04); a nonterminal being serialized as root element must not be marked as an attribute (error D05); in order to match the XML requirement of a single-rooted document, if the root rule is marked as hidden, all of its productions must produce exactly one non-hidden non-attribute nonterminal and no non-hidden terminals before or after that nonterminal (error D06).

A (necessarily contrived) example grammar that illustrates serialization rules is:

          expr: open, -arith, @close, -";".
         @open: "(".
         close: ")".
         arith: left, op, ^right>second.
    left>first: operand.
        -right: operand.
      -operand: name; -number.
         @name: ["a"-"z"].
       @number: ["0"-"9"].
           -op: sign.
@sign>operator: "+"; "-".

Applied to the string (a+1); it yields the serialization

<expr open='(' operator='+' close=')'>
   <first name='a'/>
   <second>1</second>
</expr>

Points to note: how the semicolon is suppressed from the serialization; the two ways open and close have been defined as attributes; similarly the two ways left and right have been defined as elements, and the two ways they have been renamed; how number appears as content and not as an attribute; and how sign being an exposed attribute appears on its nearest non-hidden ancestor, and has been renamed. Also of note is how the content of some attributes can appear earlier in the serialization than in the input.

Insertions allow characters to be inserted into the serialization that were not present in the input. For instance, the grammar

  data: value++-",", @source.
source: +"ixml".
 value: pos; neg.
  -pos: +"+", digit+.
  -neg: +"-", -"(", digit+, -")".
-digit: ["0"-"9"].

With input:

100,200,(300),400

would produce

<data source='ixml'>
   <value>+100</value>
   <value>+200</value>
   <value>-300</value>
   <value>+400</value>
</data>

7. Conformance

In this specification, the verb "must" expresses unconditional requirements for conformance to the specification; the verb "should" expresses requirements that are encouraged but which are not conditions of conformance; the verb "may" expresses optional features which are neither required nor prohibited.

Conformance to this specification can meaningfully be claimed for grammars and for processors; it cannot be claimed for input streams or input + grammar pairs.

7.7. Conformance of grammars

An ixml grammar in ixml form conforms to this specification if it is described by the grammar given in this specification, and it satisfies all the other requirements specified for ixml grammars.

An ixml grammar in XML form conforms to this specification if, after removal of namespace qualified elements and attributes, it can be derived from an ixml grammar in ixml form by parsing as described in this specification, and it satisfies all the other requirements specified for ixml grammars.

Note: The normative formulations of conformance requirements are those given elsewhere in this specification. For convenience the requirements that go beyond what is expressed in the grammar itself can be summarized as follows. (Reasonable effort has been used to make this list complete, but omission of any conformance requirement from this list does not affect its status as a conformance requirement.)

Every nonterminal used in the right-hand side of any rule must be defined by a single rule.
Any character class used must be one that is listed in the Unicode specification.
The number represented in a hex encoding of a character must be within the Unicode character range, and must not denote a Noncharacter or Surrogate code point.
The from character of a range must not be later in the Unicode ordering than the to character.
Any serialization of a parse tree produced from the grammar must be well-formed XML.

7.8. Conformance of processors

A conforming processor must accept grammars in ixml form, and should accept grammars in XML form; it must not accept non-conforming grammars. Both grammars and input must be accepted in UTF-8 encoding, and may be accepted in other encodings.

For any conforming grammar and any input, under normal operation:

Processors must parse by default the entire input using the grammar, determining in the process whether or not the input is described by the grammar. Processors may provide user options for other behaviors, such as parsing the largest, or smallest, prefix of the input that is described by the grammar, or supporting invocation with input streams of indeterminate length.
If the input is unambiguously described by the grammar, the resulting parse tree must be serialized to an XML document.
If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting serialization should by default include the attribute ixml:state on the document element with a value that includes the word ambiguous. Processors may provide a user option to suppress that attribute; they may also provide a user option to produce more than one parse tree.
If the input is not described by the grammar, the processor must produce some XML document with the attribute ixml:state on the document element with a value that includes the word failed, with helpful information about where and why it failed; it may be a partial parse tree that includes parts of the parse that succeeded.
If a prefix of the input is described by the grammar, processors may choose either to produce a failure document as described above, or to serialize the resulting parse tree with the attribute ixml:state containing the word prefix, or if the parse is ambiguous, the words ambiguous prefix.
If the input was processed as a different version of ixml than that required by the prolog, the ixml:state attribute must include the word version-mismatch.
The form in which XML documents are produced is not constrained by this specification; processors should be capable of producing serialized XML as a character stream, but other forms (e.g. DOM instances or XDM instances) may also be used.

8. Hints for Implementers

Many parsing algorithms only mention terminals and nonterminals, and don't explain how to deal with the repetition constructs used in ixml. However, these can be handled simply by converting them to equivalent simple constructs. In the examples below, f and sep are factors from the grammar above. The other nonterminals are generated nonterminals.

Optional factor:

f? ⇒ f-option
-f-option: f; ().

Zero or more repetitions:

f* ⇒ f-star
-f-star: (f, f-star)?.

One or more repetitions:

f+ ⇒ f-plus
-f-plus: f, f*.

One or more repetitions with separator:

f++sep ⇒ f-plus-sep
-f-plus-sep: f, (sep, f)*.

Zero or more repetitions with separator:

f**sep ⇒ f-star-sep
-f-star-sep: (f++sep)?.

Implementers should pay particular attention to serializing whitespace and other control characters. Consider, for example, the case where the characters #a or #d appear in a value serialized as an attribute. When that serialized XML is parsed, the XML parser will replace #a and #d characters with spaces when it performs whitespace normalization on the attribute value. Similarly, the sequence #d#a will be translated to a single #a by standard XML parsing. If the user of the grammar expects to see the original characters in the XML output, it will be necessary to encode them using numeric character references when serializing the XML output. If on the other hand the user does not expect to see the original characters in the output, then carefully preserving them using numeric character references is likely to be unhelpful. See [Serialization] for detailed discussions.

9. Complete Grammar

The complete grammar for ixml:

{ Invisible XML specification grammar, 2024-10-15 }
{ Published in https://invisiblexml.org/pr/276/ }
{ Commit hash e76e9bc08ef6 }

         ixml: s, (prolog, RS)?, rule++RS, s.

           -s: (whitespace; comment)*. {Optional spacing}
          -RS: (whitespace; comment)+. {Required spacing}
  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
          -cr: -#d.
      comment: -"{", (cchar; comment)*, -"}".
       -cchar: ~["{}"].

       prolog: version.
      version: -"ixml", RS, -"version", RS, string, s, -'.' .

         rule: naming, -["=:"], s, -alts, -".".
      -naming: (mark, s)?, name, s, (">", s, alias, s)?.
        @name: namestart, namefollower*.
   -namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

       @alias: name.
         alts: alt++(-[";|"], s).
          alt: term**(-",", s).
        -term: factor;
               option;
               repeat0;
               repeat1.
      -factor: terminal;
               nonterminal;
               insertion;
               -"(", s, alts, -")", s.
      repeat0: factor, (-"*", s; -"**", s, sep).
      repeat1: factor, (-"+", s; -"++", s, sep).
       option: factor, -"?", s.
        @mark: ["@^-"].
          sep: factor.
  nonterminal: naming.
    -terminal: literal; 
               charset.
      literal: quoted;
               encoded.
      -quoted: (tmark, s)?, string, s.

       @tmark: ["^-"].
      @string: -'"', dchar+, -'"';
               -"'", schar+, -"'".
        dchar: ~['"'; Cc];
               '"', -'"'. {all characters except controls; quotes must be doubled}
        schar: ~["'"; Cc];
               "'", -"'". {all characters except controls; quotes must be doubled}
     -encoded: (tmark, s)?, -"#", hex, s.
         @hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.

     -charset: inclusion; 
               exclusion.
    inclusion: (tmark, s)?,          set.
    exclusion: (tmark, s)?, -"~", s, set.
         -set: -"[", s,  (member, s)**(-[";|"], s), -"]", s.
       member: string;
               -"#", hex;
               range;
               class.
       -range: from, s, -"-", s, to.
        @from: character.
          @to: character.
   -character: -'"', dchar, -'"';
               -"'", schar, -"'";
               "#", hex.
       -class: code.
        @code: capital, letter?.
     -capital: ["A"-"Z"].
      -letter: ["A"-"Z"; "a"-"z"].
    insertion: -"+", s, (string; -"#", hex), s.

10. IXML in XML

Since the ixml grammar is expressed in its own notation, the above grammar can be processed into an XML document by parsing it using itself, and then serializing. Note that all semantically significant terminals are recorded in attributes, and non-significant characters are not serialized. An abbreviated serialization is shown below, but the entire serialization is available:

<ixml>
   <comment> Invisible XML specification grammar, 2024-10-15 </comment>
   <comment> Published in https://invisiblexml.org/pr/276/ </comment>
   <comment> Commit hash e76e9bc08ef6 </comment>
   <rule name='ixml'>
      <alt>
         <nonterminal name='s'/>
         <option>
            <alts>
               <alt>
                  <nonterminal name='prolog'/>
                  <nonterminal name='RS'/>
               </alt>
            </alts>
         </option>
         <repeat1>
            <nonterminal name='rule'/>
            <sep>
               <nonterminal name='RS'/>
            </sep>
         </repeat1>
         <nonterminal name='s'/>
      </alt>
   </rule>
   <rule mark='-' name='s'>
      <alt>
         <repeat0>
            <alts>
               <alt>
                  <nonterminal name='whitespace'/>
               </alt>
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>
         </repeat0>
      </alt>
   </rule>
   <comment>Optional spacing</comment>
   <rule mark='-' name='RS'>
      <alt>
         <repeat1>
            <alts>
               <alt>
                  <nonterminal name='whitespace'/>
               </alt>
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>
         </repeat1>
      </alt>
   </rule>
   <comment>Required spacing</comment>
   <rule mark='-' name='whitespace'>
      <alt>
         <inclusion tmark='-'>
            <member code='Zs'/>
         </inclusion>
      </alt>
      <alt>
         <nonterminal name='tab'/>
      </alt>
      <alt>
         <nonterminal name='lf'/>
      </alt>
      <alt>
         <nonterminal name='cr'/>
      </alt>
   </rule>
   <rule mark='-' name='tab'>
      <alt>
         <literal tmark='-' hex='9'/>
      </alt>
   </rule>
   <rule mark='-' name='lf'>
      <alt>
         <literal tmark='-' hex='a'/>
      </alt>
   </rule>
   <rule mark='-' name='cr'>
      <alt>
         <literal tmark='-' hex='d'/>
      </alt>
   </rule>
   <rule name='comment'>
      <alt>
         <literal tmark='-' string='{'/>
         <repeat0>
            <alts>
               <alt>
                  <nonterminal name='cchar'/>
               </alt>
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>
         </repeat0>
         <literal tmark='-' string='}'/>
      </alt>
   </rule>
   <rule mark='-' name='cchar'>
      <alt>
         <exclusion>
            <member string='{}'/>
         </exclusion>
      </alt>
   </rule>
   <rule name='prolog'>
      <alt>
         <nonterminal name='version'/>
      </alt>
   </rule>
   <!--  Many more rules here… -->
</ixml>

11. Errors

This section summarizes errors identified in this specification. Static errors are errors that can be identified by inspecting the grammar.

S01: It is an error if two rules are not separated by at least one whitespace character or comment.
S02: It is an error to use a nonterminal name that is not defined by a rule in the grammar.
S03: It is an error if the grammar contains more than one rule for a given nonterminal name.
S06: It is an error if a hex encoding uses any characters not allowed in hexadecimal.
S07: It is an error if the hexadecimal value is not within the Unicode code-point range.
S08: It is an error if an encoded character denotes a Unicode noncharacter or surrogate code point.
S09: It is an error if the first character in a range has a code point value greater than the second character in the range.
S10: It is an error to use a Unicode character category that is not defined in the Unicode specification.
S11: It is an error if a string contains a C0 or C1 control character, including a line break.
S12: It is an error if the grammar does not conform to the implied or declared version.

Dynamic errors arise when a particular input is processed with a grammar.

D01: It is an error if the parse tree produced by a grammar cannot be represented as well-formed XML.
D02: It is an error if two or more attributes with the same name would be serialized on the same element.
D03: It is an error if the name of any element or attribute is not a valid XML name.
D04: It is an error to attempt to serialize as XML any characters that are not permitted in XML.
D05: It is an error to attempt to serialize an attribute as the root node of an XML document.
D06: It is an error if the parse tree does not contain exactly one top-level element.
D07: It is an error if an attribute named “xmlns” appears on an element.

Note: if error codes are reported in a context where it makes sense for them to appear in a namespace, they should be in the Invisible XML namespace.

12. References

[Unicode] The Unicode Consortium (ed.), The Unicode Standard — Version 13.0. Unicode Consortium, 2020, ISBN 978-1-936213-26-9, http://www.unicode.org/versions/Unicode13.0.0/

[Categories] The Unicode Consortium (ed.), Unicode Standard Annex #44: Unicode Character Database -- General Category Values https://unicode.org/reports/tr44/#General_Category_Values (See also http://www.fileformat.info/info/unicode/category/index.htm)

[XML] Tim Bray et al. (eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C, 2008, https://www.w3.org/TR/xml/

13. Informational References

[CYK] Sakai, Itiroo. Syntax in universal translation. In 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, pages 593–608. https://aclanthology.org/www.mt-archive.info/50/NPL-1961-Sakai.pdf

[Earley] Earley, J. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102, February 1970, doi:10.1145/362007.362035

[GLL] Elizabeth Scott and Adrian Johnstone, GLL Parsing. Electronic Notes in Theoretical Computer Science, Volume 253, Issue 7, 17 September 2010, pages 177-189. doi:10.1016/j.entcs.2010.08.041

[GLR] Masaru Tomita. Generalized LR Parsing. Springer Science & Business Media. ISBN 978-1-4615-4034-2. doi:10.1007/978-1-4615-4034-2

[Grune] Grune, D. and Jacobs, C. Parsing techniques : a practical guide (2nd ed.). New York: Springer, 2008. ISBN 978-0-387-20248-8. https://dickgrune.com/Books/PTAPG_2nd_Edition/CompleteList.pdf

[XML Serialization] Andrew Coleman and C. M. Sperberg-McQueen (eds.) XSLT and XQuery Serialization 3.1. W3C, 2017, https://www.w3.org/TR/xslt-xquery-serialization-31/

[Unger] Unger, S. H. A global parser for context-free phrase structure grammars. Communications of the ACM, 11(4):240–247, April 1968, doi:10.1145/362991.363001

[Control] Wikipedia, C0 and C1 control codes, https://en.wikipedia.org/wiki/C0_and_C1_control_codes.

14. Acknowledgements

This specification was produced by members of the W3C ixml community group: Tomos Hillman, John Lumley, Steven Pemberton, C. M. Sperberg-McQueen, Bethan Tovey-Walsh, Norman Tovey-Walsh. Other current and former members of the group have also contributed.

Thanks are due to Hans-Dieter Hiep for an early close reading of the specification, and consequent many helpful comments.