[Expat-discuss] Example of using Expat with Unicode?

Karl Waclawek karl at waclawek.net
Fri Jun 20 10:13:17 EDT 2003


> Hi,
>
> This email is pretty long, so if you don't feel like reading here's what I'm
> really basically asking: I am humbly requesting is a way to read in XML in
> Expat such that I can handle Unicode stuff, using XML_Char's throughout the
> program rather than char's.

Expat reports all XML data as UTF-8 or UTF-16 encoded Unicode, depending
on how it is compiled.

Expat reads XML documents encoded in US-ASCII, ISO-8859-1, UTF-8
or UTF-16, and more if you supply an UnknownEncodingHandler callback.

I didn't quite understand your problems, to be honest.
Expat, like any conformant parser, processes and outputs Unicode.

> I downloaded Expat from Jclark.com and have been trying to figure out how to
> use it. I looked at the sample program, "elements.c" and basically figured
> out what to do based on that. However, that program prints out the character
> data for some XML files (the one that made me notice this is located at
> http://www.aaronsw.com/weblog/index.xml - what appears as "10% think it’s
> advantageous to be a woman in American society" in Internet Explorer's XML
> viewer appears as "10% think it’s advantageous to be a woman in American
> society".) The file was being read from disk using standard C file i/o
> functions - all I did was open up a FILE* structure with a call to fopen()
> and replaced "fread(buf, 1, sizeof(buf), stdin)" with "fread(buf, 1,
> sizeof(buf), my_file_structure". This worked fine except for this Unicode
> stuff that my normal functions seemed to read in incorrectly.

Since Expat output is Unicode, you need ways of displaying it correctly.
Standard string display does not do that.

>
> (I am using Windows and Visual C++ 6.0, by the way.)

On Windoes you should compile Expat with XML_UNICODE_WCHAR_T defined,
or simply build the "expatw" library project. The Unicode encoding native
to Windows is UTF-16.

>
> Now I figured I should modify my file I/O and string functions to use the
> Unicode equivalents of the ones I was using before -- e.g., _wfopen instead
> of fopen. My element handles would take XML_Char's as parameters instead of
> char's. I had quite a bit of other stuff in the code that I thought could go
> wrong so I figured I'd make a little separate test program to make sure I
> understood how to do Unicode file I/O. This was the test program:

When you pass data to Expat you should consider them
generic buffers of bytes, no matter what the encoding is.

> So I'm basically out of ideas. I tried downloading the latest version for
> Windows -- 1.95.6 -- thinking maybe I should just forget about Unicode file
> I/O and just try modifying the elements.c program and seeing what would
> happen but couldn't get the samples for the new version to build in Visual
> C++ -- I get the following error:
>
> LINK : fatal error LNK1181: cannot open input file "libexpatMT.lib"

Builds fine for me in the VC++ 6.0 IDE.
The older versions (from jclark.com) does not work for UTF-16 output, AFAIK.
1.95.6 is the most stable and conformant version yet.

>
> So what I am humbly requesting is a way to read in XML in Expat such that I
> can handle Unicode stuff like the aaronsw.com feed.

Not needed, it already does it. Just pass it byte-buffers.

> Just basically a
> modified version of the sample program. If possible I would like to also see
> how this would be done for parsing an XML file on disk.

Have a look at the xmlwf tool.


Karl




More information about the Expat-discuss mailing list