38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Encoding Conversion</title><meta name="generator" content="DocBook XSL Stylesheets V1.61.2"><link rel="home" href="index.html" title="Libxml Tutorial"><link rel="up" href="index.html" title="Libxml Tutorial"><link rel="previous" href="ar01s08.html" title="Retrieving Attributes"><link rel="next" href="apa.html" title="A.�Compilation"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Encoding Conversion</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="ar01s08.html">Prev</a>�</td><th width="60%" align="center">�</th><td width="20%" align="right">�<a accesskey="n" href="apa.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="xmltutorialconvert"></a>Encoding Conversion</h2></div></div><div></div></div><p><a class="indexterm" name="id2587348"></a>
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsyncData encoding compatibility problems are one of the most common
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync difficulties encountered by programmers new to <span class="acronym">XML</span> in
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync general and <span class="application">libxml</span> in particular. Thinking
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync through the design of your application in light of this issue will help
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync avoid difficulties later. Internally, <span class="application">libxml</span>
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync stores and manipulates data in the UTF-8 format. Data used by your program
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync in other formats, such as the commonly used ISO-8859-1 encoding, must be
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync converted to UTF-8 before passing it to <span class="application">libxml</span>
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync functions. If you want your program's output in an encoding other than
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync UTF-8, you also must convert it.</p><p><span class="application">Libxml</span> uses
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">iconv</span> if it is available to convert
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync data. Without <span class="application">iconv</span>, only UTF-8, UTF-16 and
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync ISO-8859-1 can be used as external formats. With
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">iconv</span>, any format can be used provided
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">iconv</span> is able to convert it to and from
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync UTF-8. Currently <span class="application">iconv</span> supports about 150
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync different character formats with ability to convert from any to any. While
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync the actual number of supported formats varies between implementations, every
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">iconv</span> implementation is almost guaranteed to
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync support every format anyone has ever heard of.</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td colspan="2" align="left" valign="top"><p>A common mistake is to use different formats for the internal data
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync in different parts of one's code. The most common case is an application
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync that assumes ISO-8859-1 to be the internal data format, combined with
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">libxml</span>, which assumes UTF-8 to be the
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync internal data format. The result is an application that treats internal
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync data differently, depending on which code section is executing. The one or
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync the other part of code will then, naturally, misinterpret the data.
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync </p></td></tr></table></div><p>This example constructs a simple document, then adds content provided
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync at the command line to the document's root element and outputs the results
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync to <tt class="filename">stdout</tt> in the proper encoding. For this example, we
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync use ISO-8859-1 encoding. The encoding of the string input at the command
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync line is converted from ISO-8859-1 to UTF-8. Full code: <a href="aph.html" title="H.�Code for Encoding Conversion Example">Appendix�H, <i>Code for Encoding Conversion Example</i></a></p><p>The conversion, encapsulated in the example code in the
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <tt class="function">xmlFindCharEncodingHandler</tt> function:
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <a name="handlerdatatype"></a><img src="images/callouts/1.png" alt="1" border="0">xmlCharEncodingHandlerPtr handler;
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <a name="calcsize"></a><img src="images/callouts/2.png" alt="2" border="0">size = (int)strlen(in)+1;
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync out_size = size*2-1;
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync out = malloc((size_t)out_size);
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <a name="findhandlerfunction"></a><img src="images/callouts/3.png" alt="3" border="0">handler = xmlFindCharEncodingHandler(encoding);
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <a name="callconversionfunction"></a><img src="images/callouts/4.png" alt="4" border="0">handler->input(out, &out_size, in, &temp);
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <a name="outputencoding"></a><img src="images/callouts/5.png" alt="5" border="0">xmlSaveFormatFileEnc("-", doc, encoding, 1);
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync </p><div class="calloutlist"><table border="0" summary="Callout list"><tr><td width="5%" valign="top" align="left"><a href="#handlerdatatype"><img src="images/callouts/1.png" alt="1" border="0"></a> </td><td valign="top" align="left"><p><tt class="varname">handler</tt> is declared as a pointer to an
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <tt class="function">xmlCharEncodingHandler</tt> function.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#calcsize"><img src="images/callouts/2.png" alt="2" border="0"></a> </td><td valign="top" align="left"><p>The <tt class="function">xmlCharEncodingHandler</tt> function needs
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync to be given the size of the input and output strings, which are
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync calculated here for strings <tt class="varname">in</tt> and
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <tt class="varname">out</tt>.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#findhandlerfunction"><img src="images/callouts/3.png" alt="3" border="0"></a> </td><td valign="top" align="left"><p><tt class="function">xmlFindCharEncodingHandler</tt> takes as its
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync argument the data's initial encoding and searches
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <span class="application">libxml's</span> built-in set of conversion
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync handlers, returning a pointer to the function or NULL if none is
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync found.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#callconversionfunction"><img src="images/callouts/4.png" alt="4" border="0"></a> </td><td valign="top" align="left"><p>The conversion function identified by <tt class="varname">handler</tt>
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync requires as its arguments pointers to the input and output strings,
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync along with the length of each. The lengths must be determined
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync separately by the application.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#outputencoding"><img src="images/callouts/5.png" alt="5" border="0"></a> </td><td valign="top" align="left"><p>To output in a specified encoding rather than UTF-8, we use
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync <tt class="function">xmlSaveFormatFileEnc</tt>, specifying the
38ae7e4efe803ea78b6499cd05a394db32623e41vboxsync </p></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="ar01s08.html">Prev</a>�</td><td width="20%" align="center"><a accesskey="u" href="index.html">Up</a></td><td width="40%" align="right">�<a accesskey="n" href="apa.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Retrieving Attributes�</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">�A.�Compilation</td></tr></table></div></body></html>