Can XML use non-Latin characters?
Answer:
Yes, the XML Specification explicitly says XML uses ISO 10646, the international
standard character repertoire which covers most known languages. Unicode is an
identical repertoire, and the two standards track each other. The spec says (2.2): ‘All
XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646…’. There
is a Unicode FAQ at http://www.unicode.org/faq/FAQ.
UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as
ASCII, and higher-order characters are used to encode anything else from Unicode into
sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same
as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other
languages using the Latin alphabet without diacritics. Note that UTF-8 is incompatible
with ISO 8859-1 (ISO Latin-1) after code point 127 decimal (the end of ASCII).
UTF-16 is an encoding of Unicode into 16-bit characters, which lets it represent 16
planes. UTF-16 is incompatible with ASCII because it uses two 8-bit bytes per character
(four bytes above U+FFFF).
standard character repertoire which covers most known languages. Unicode is an
identical repertoire, and the two standards track each other. The spec says (2.2): ‘All
XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646…’. There
is a Unicode FAQ at http://www.unicode.org/faq/FAQ.
UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as
ASCII, and higher-order characters are used to encode anything else from Unicode into
sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same
as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other
languages using the Latin alphabet without diacritics. Note that UTF-8 is incompatible
with ISO 8859-1 (ISO Latin-1) after code point 127 decimal (the end of ASCII).
UTF-16 is an encoding of Unicode into 16-bit characters, which lets it represent 16
planes. UTF-16 is incompatible with ASCII because it uses two 8-bit bytes per character
(four bytes above U+FFFF).
No comments:
Post a Comment