Pages

Monday, March 19, 2012

Watch out for the BOM

Consider yourself lucky if you haven't yet had to deal with character byte order and encoding problems in your XML. Encoding can be a major source of headaches considering all the applications use it to create or edit XML. Just because you see <?xml version="1.0" encoding="UTF-8"?> in the beginning of an XML file doesn't guarantee the actual byte order encoding of the file.

Unicode.org defines a Byte Order Mark (BOM) "can be used as a signature defining the byte order and encoding form" for text files. Most Unicode-aware applications, like Notepad, will insert a BOM into XML files.

However, not all XML editors properly support Unicode. For example, the popular XML Editor, Cooktop, behaves unexpectedly when editing XML files with different character encodings. Older parsers, such as Apache's Crimson, will throw errors if the character encoding of a BOM is detected. These errors can range from such inaccurate messages as "Document root element is missing" to "Content not allowed in prolog."

One quick way to identify the character encoding is to open the XML file in a hexidecimal editor, such as UltraEdit, and examine the first few bytes of the file.

Byte order mark......Encoding
FE FF................UTF-16, big-endian
FF FE................UTF-16, little-endian
EF BB................UTF-8

We suggest you upgrade your parser(s) and invest in applications that fully support Unicode. For more information about XML and character encoding, refer to the XML specification or the Unicode.org website.

www.unicode.org/faq/utf_bom.html
www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.