Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
XML
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Characters and escaping == XML documents consist entirely of characters from the [[Unicode]] repertoire. Except for a small number of specifically excluded [[control characters]], any character defined by Unicode may appear within the content of an XML document. XML includes facilities for identifying the ''encoding'' of the Unicode characters that make up the document, and for expressing characters that, for one reason or another, cannot be used directly. === Valid characters === {{Main|Valid characters in XML}} Unicode code points in the following ranges are valid in XML 1.0 documents:{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.2}} * U+0009 (Horizontal Tab), U+000A (Line Feed), U+000D (Carriage Return): these are the only [[C0 and C1 control codes|C0]] controls accepted in XML 1.0; * U+0020–U+D7FF, U+E000–U+FFFD: this excludes some noncharacters in the [[Basic Multilingual Plane|BMP]] (all surrogates, U+FFFE and U+FFFF are forbidden); * U+10000–U+10FFFF: this includes all code points in supplementary planes, including noncharacters. XML 1.1 extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2006|loc=section 2.2}} At the same time, however, it restricts the use of C0 and [[C0 and C1 control codes|C1]] control characters other than U+0009 (Horizontal Tab), U+000A (Line Feed), U+000D (Carriage Return), and U+0085 (Next Line) by requiring them to be written in escaped form (for example U+0001 must be written as <code>&#x01;</code> or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected. The code point [[U+0000]] (Null) is the only character that is not permitted in any XML 1.1 document. === Encoding detection === The Unicode character set can be encoded into [[byte]]s for storage or transmission in a variety of different ways, called "encodings". Unicode itself defines encodings that cover the entire repertoire; well-known ones include [[UTF-8]] (which the XML standard recommends using, without a [[byte order mark|BOM]]) and [[UTF-16]].<ref>{{cite web|last=Bray|first=T.|url=http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF|title=Characters vs. Bytes|website=Tbray.org |date=April 26, 2003 |access-date=16 November 2017}}</ref> There are many other text encodings that predate Unicode, such as [[ASCII]] and various [[ISO/IEC 8859]]; their character repertoires are in every case subsets of the Unicode character set. XML allows the use of any of the Unicode-defined encodings and any other encodings whose characters also appear in Unicode. XML also provides a mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=appendix F}} Encodings other than UTF-8 and UTF-16 are not necessarily recognized by every XML parser (and in some cases not even UTF-16, even though the standard mandates it to also be recognized). === Escaping === XML provides ''[[Escape sequence|escape]]'' facilities for including characters that are problematic to include directly. For example: * The characters "<" and "&" are key syntax markers and may never appear in content outside a [[CDATA]] section. It is allowed, but not recommended, to use "<" in XML entity values.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.3}} * Some character encodings support only a subset of Unicode. For example, it is legal to encode an XML document in ASCII, but ASCII lacks code points for Unicode characters such as "é". * It might not be possible to type the character on the author's machine. * Some characters have [[homoglyph|glyphs]] that cannot be visually distinguished from other characters, such as the [[nonbreaking space]] (<code>&#xa0;</code>) " " and the [[Space (punctuation)|space]] (<code>&#x20;</code>) " ", and the [[А|Cyrillic capital letter A]] (<code>&#x410;</code>) "А" and the [[A|Latin capital letter A]] (<code>&#x41;</code>) "A". There are five [[List of XML and HTML character entity references#Predefined entities in XML|predefined entities]]: * <code>&lt;</code> represents "<"; * <code>&gt;</code> represents ">"; * <code>&amp;</code> represents "&"; * <code>&apos;</code> represents "{{mono|'}}"; * <code>&quot;</code> represents '{{mono|"}}'. All permitted Unicode characters may be represented with a ''[[numeric character reference]]''. Consider the Chinese character "中", whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offers no method for entering this character could still insert it in an XML document encoded either as <code>&#20013;</code> or <code>&#x4e2d;</code>. Similarly, the string "I <3 Jörg" could be encoded for inclusion in an XML document as <code>I &lt;3 J&#xF6;rg</code>. <code>&#0;</code> is not permitted because the [[null character]] is one of the control characters excluded from XML, even when using a numeric character reference.<ref>{{cite web|first1=Tex|last1=Texin|first2=François|last2=Yergeau|date=6 September 2003|url=http://www.w3.org/International/questions/qa-controls|title=W3C I18N FAQ: HTML, XHTML, XML and Control Codes|website=W3C Internationalization|publisher=W3C|access-date=16 November 2017}}</ref> An alternative encoding mechanism such as [[Base64]] is needed to represent such characters. === Comments === Comments may appear anywhere in a document outside other markup. Comments cannot appear before the XML declaration. Comments begin with <code><!--</code> and end with <code>--></code>. For compatibility with [[SGML]], the string "--" (double-hyphen) is not allowed inside comments;{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.5}} this means comments cannot be nested. The ampersand has no special significance within comments, so entity and character references are not recognized as such, and there is no way to represent characters outside the character set of the document encoding. An example of a valid comment: <code><!--no need to escape <code> & such in comments--></code> === International use === {{Contains special characters|Armenian|example}} XML 1.0 (Fifth Edition) and XML 1.1 support the direct use of almost any [[Unicode]] character in element names, attributes, comments, character data, and processing instructions (other than the ones that have special symbolic meaning in XML itself, such as the less-than sign, "<"). The following is a well-formed XML document including [[Chinese character|Chinese]], [[Armenian alphabet|Armenian]] and [[Cyrillic]] characters: <syntaxhighlight lang="xml"> <?xml version="1.0" encoding="UTF-8"?> <俄语 լեզու="ռուսերեն">данные</俄语> </syntaxhighlight>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
XML
(section)
Add topic