From robert.hancock1 at virgin.net  Sun Apr  1 23:38:06 2007
From: robert.hancock1 at virgin.net (Robert Hancock)
Date: Sun, 01 Apr 2007 22:38:06 +0100
Subject: [Expat-discuss] large XML files in python 2.5 (expat 2.0.0,
	XML_LARGE_SIZE)
Message-ID: <4610263E.4080601@virgin.net>

Hi,

I'm processing some very large (>1TB) XML files with python and expat 
(python 2.5 to get expat 2.0.0).

The parser.CurrentByteIndex attribute is useful for me for some 
statistical and debugging purposes, but in the default Python build 
seems to be limited to 2**31 bytes (signed 32 bit int?), which I see as 
the Index wrapping as I work through the file.

I have tried rebuilding python2.5 from source (platform: Linux x86) but 
can't seem to get the XML_LARGE_SIZE option to have any effect. I've run
./configure CFLAGS=-DXML_LARGE_SIZE CPPFLAGS=-DXML_LARGE_SIZE
and rebuilding, but I still see the same wrapping behaviour.

Are there notes anywhere on how to enable this? Any way to test it 
within Python?

Thanks in advance for any help,

Robert Hancock

From eric.slosser at v-fx.com  Thu Apr  5 16:58:49 2007
From: eric.slosser at v-fx.com (Eric Slosser)
Date: Thu, 5 Apr 2007 10:58:49 -0400
Subject: [Expat-discuss] digitally signing a build of expat
Message-ID: <FB877ED9-7CF6-4E65-9E83-4FC9EED7BE94@v-fx.com>

We ship a build of the expat library and are considering digitally  
signing our build so as to avoid a Windows-Vista warning at install  
time.

Our build guy is wondering if anyone in the expat world cares about  
this?

I don't imagine you do, but it can't hurt to ask (I hope).

From karl at waclawek.net  Fri Apr  6 22:42:50 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Fri, 06 Apr 2007 16:42:50 -0400
Subject: [Expat-discuss] digitally signing a build of expat
In-Reply-To: <FB877ED9-7CF6-4E65-9E83-4FC9EED7BE94@v-fx.com>
References: <FB877ED9-7CF6-4E65-9E83-4FC9EED7BE94@v-fx.com>
Message-ID: <4616B0CA.9010709@waclawek.net>

Eric Slosser wrote:
> We ship a build of the expat library and are considering digitally  
> signing our build so as to avoid a Windows-Vista warning at install  
> time.
>
> Our build guy is wondering if anyone in the expat world cares about  
> this?
>   

I don't know anything about library signing on Vista, but I could 
imagine that this requires more
than just a signature. I would assume that the signing identity must 
somehow be registered or
recognized by someone (Microsoft?) as trustworthy.

Could you enlighten us as to the details and requirements?

Karl

From jeffreyholle at bellsouth.net  Mon Apr  9 04:34:28 2007
From: jeffreyholle at bellsouth.net (Jeffrey Holle)
Date: Sun, 08 Apr 2007 22:34:28 -0400
Subject: [Expat-discuss] [expat] newbie question
Message-ID: <evc8nq$6gm$1@sea.gmane.org>

I'm using expat 2.0.0 self build on Ubuntu Linux 5.10.

The program that I'm writing needs to process N XML files.
Note not nested, but in series.

My first attempt initializes the expat parser and feeds the data into it 
like this (pseudo code):

while(!files.eof())
{
   getline(files.filename);
   ifstream file(filename);
   do {
     char buffer[256];
     file.read(buffer,256);
     XML_Parse(parser,buffer,file.gcount(),0);
   } while (file.eof());
   XML_Parse(parser,NULL,0,1);
}

I have not shown it, but I do have an element handler enabled and it 
works as expected for the first file, but not subsequent ones.

What should I be doing to end one XML file parsing action and start another?


From nickmacd at gmail.com  Mon Apr  9 05:18:21 2007
From: nickmacd at gmail.com (Nick MacDonald)
Date: Sun, 8 Apr 2007 23:18:21 -0400
Subject: [Expat-discuss] [expat] newbie question
In-Reply-To: <evc8nq$6gm$1@sea.gmane.org>
References: <evc8nq$6gm$1@sea.gmane.org>
Message-ID: <bdcd32c90704082018jb111acdkc8c8c9b1e3eebbbf@mail.gmail.com>

Jeffrey:

I think you need to make a new instance of the parser for each file.
What you are doing has the overall effect of concatenating the files
together one ofter the other, which would not be legal because there
can only be one main  XML element (my terminology may be off here) per
file.  Unfortunately I am not near a machine with my own code and not
on a machine that has (or can have) eXpat installed.  But in
pseudo-code, do this:

while (more files)
{
  parser=createParser()
  open(file)
  while(!eof)
  {
    read databuff
    xmlParse(databuff)
  }
  close(file)
  finalizeParser()
}

Good luck,
  Nick

On 4/8/07, Jeffrey Holle <jeffreyholle at bellsouth.net> wrote:
> I'm using expat 2.0.0 self build on Ubuntu Linux 5.10.
>
> The program that I'm writing needs to process N XML files.
> Note not nested, but in series.
>
> My first attempt initializes the expat parser and feeds the data into it
> like this (pseudo code):
>
> while(!files.eof())
> {
>    getline(files.filename);
>    ifstream file(filename);
>    do {
>      char buffer[256];
>      file.read(buffer,256);
>      XML_Parse(parser,buffer,file.gcount(),0);
>    } while (file.eof());
>    XML_Parse(parser,NULL,0,1);
> }
>
> I have not shown it, but I do have an element handler enabled and it
> works as expected for the first file, but not subsequent ones.
>
> What should I be doing to end one XML file parsing action and start another?
>
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>


-- 
Nick MacDonald
NickMacD at gmail.com

From andrelsm at iname.com  Mon Apr  9 15:37:26 2007
From: andrelsm at iname.com (Andre Luis Monteiro)
Date: Mon, 09 Apr 2007 08:37:26 -0500
Subject: [Expat-discuss] [expat] newbie question
Message-ID: <20070409133726.EE9CA1F5462@ws1-2.us4.outblaze.com>

Jeffrey

you should to use

XML_Bool XMLCALL XML_ParserReset(XML_Parser p, const XML_Char *encoding);

Clean up the memory structures maintained by the parser so that it may be used again. After this has been called, parser is ready to start parsing a new document. All handlers are cleared from the parser, except for the unknownEncodingHandler. The parser's external state is re-initialized except for the values of ns and ns_triplets. This function may not be used on a parser created using XML_ExternalEntityParserCreate; it will return XML_FALSE in that case. Returns XML_TRUE on success. Your application is responsible for dealing with any memory associated with user data. 


regards
Andr? Lu?s


PS: take a glance at your expat-2.0.0/doc/reference.html


> ----- Original Message -----
> From: "Nick MacDonald" <nickmacd at gmail.com>
> To: jeffreyholle at bellsouth.net
> Subject: Re: [Expat-discuss] [expat] newbie question
> Date: Sun, 8 Apr 2007 23:18:21 -0400
> 
> 
> Jeffrey:
> 
> I think you need to make a new instance of the parser for each file.
> What you are doing has the overall effect of concatenating the files
> together one ofter the other, which would not be legal because there
> can only be one main  XML element (my terminology may be off here) per
> file.  Unfortunately I am not near a machine with my own code and not
> on a machine that has (or can have) eXpat installed.  But in
> pseudo-code, do this:
> 
> while (more files)
> {
>    parser=createParser()
>    open(file)
>    while(!eof)
>    {
>      read databuff
>      xmlParse(databuff)
>    }
>    close(file)
>    finalizeParser()
> }
> 
> Good luck,
>    Nick
> 
> On 4/8/07, Jeffrey Holle <jeffreyholle at bellsouth.net> wrote:
> > I'm using expat 2.0.0 self build on Ubuntu Linux 5.10.
> >
> > The program that I'm writing needs to process N XML files.
> > Note not nested, but in series.
> >
> > My first attempt initializes the expat parser and feeds the data into it
> > like this (pseudo code):
> >
> > while(!files.eof())
> > {
> >    getline(files.filename);
> >    ifstream file(filename);
> >    do {
> >      char buffer[256];
> >      file.read(buffer,256);
> >      XML_Parse(parser,buffer,file.gcount(),0);
> >    } while (file.eof());
> >    XML_Parse(parser,NULL,0,1);
> > }
> >
> > I have not shown it, but I do have an element handler enabled and it
> > works as expected for the first file, but not subsequent ones.
> >
> > What should I be doing to end one XML file parsing action and start another?
> >
> > _______________________________________________
> > Expat-discuss mailing list
> > Expat-discuss at libexpat.org
> > http://mail.libexpat.org/mailman/listinfo/expat-discuss
> >
> 
> 
> --
> Nick MacDonald
> NickMacD at gmail.com
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss

>


abra?o
Andr? Lu?s


=


From suresh.kumar.j at gmail.com  Tue Apr 10 07:27:15 2007
From: suresh.kumar.j at gmail.com (Suresh Kumar J)
Date: Tue, 10 Apr 2007 10:57:15 +0530
Subject: [Expat-discuss] Clarification on the behavior of the text handler
In-Reply-To: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
References: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
Message-ID: <88b2f6dd0704092227s2e3e8127t9d6b46f004246996@mail.gmail.com>

Hi there!

I wanted to clarify on the behavior of the text handler.

Below is the description for the XML_SetCharacterDataHandler API:
------------------------------------------------------------------------
The string your handler receives is NOT zero terminated. You have to
use the length argument to deal with the end of the string. A single
block of contiguous text free of markup may still result in a sequence
of calls to this handler. In other words, if you're searching for a
pattern in the text, it may be split across calls to this handler.
------------------------------------------------------------------------

Lets say that I am passing the complete XML document to the XMLParse()
API in a single shot. So If I register a character data handler for
handling the element data then would I be getting the complete element
text data in a single call to my registered text handler. In other
words, can I safely assume that the first call to the text handler
routine would contain the complete text data?. Even when I pass the
complete XML document to the XMLParse() in a single shot, can the text
data be split across the calls to the data handler?.

Any inputs in this regard would be helpful.

--
Thanks and Regards,
Suresh Kumar J

From suresh.kumar.j at gmail.com  Tue Apr 10 07:23:08 2007
From: suresh.kumar.j at gmail.com (Suresh Kumar J)
Date: Tue, 10 Apr 2007 10:53:08 +0530
Subject: [Expat-discuss] Clarification on the behavior of the text handler
Message-ID: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>

Hi there!

I wanted to clarify on the behavior of the text handler.

Below is the description for the XML_SetCharacterDataHandler API:
------------------------------------------------------------------------
The string your handler receives is NOT zero terminated. You have to
use the length argument to deal with the end of the string. A single
block of contiguous text free of markup may still result in a sequence
of calls to this handler. In other words, if you're searching for a
pattern in the text, it may be split across calls to this handler.
------------------------------------------------------------------------

Lets say that I am passing the complete XML document to the XMLParse()
API in a single shot. So If I register a character data handler for
handling the element data then would I be getting the complete element
text data in a single call to my registered text handler. In other
words, can I safely assume that the first call to the text handler
routine would contain the complete text data?. Even when I pass the
complete XML document to the XMLParse() in a single shot, can the text
data be split across the calls to the data handler?.

Any inputs in this regard would be helpful.

-- 
Thanks and Regards,
Suresh Kumar J

From nickmacd at gmail.com  Tue Apr 10 17:45:19 2007
From: nickmacd at gmail.com (Nick MacDonald)
Date: Tue, 10 Apr 2007 11:45:19 -0400
Subject: [Expat-discuss] Clarification on the behavior of the text
	handler
In-Reply-To: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
References: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
Message-ID: <bdcd32c90704100845p73a96bf0g3adf6c5361a612ca@mail.gmail.com>

If you want robust XML processing then you absolutely *SHOULD NOT*
make any assumptions about what you will receive...  you need to
concatenate them all together.  The most likely reason why you would
get multiple calls is for escaped text, such as &lt; and &amp; .

Try this kind of document if you want to see what I mean:
<Sample>
This is my sample text with escapes
&amp; in the middle of it &lt; which will likely
cause multiple calls to &gt; the handler
</Sample>

On 4/10/07, Suresh Kumar J <suresh.kumar.j at gmail.com> wrote:
> I wanted to clarify on the behavior of the text handler.
>
> Below is the description for the XML_SetCharacterDataHandler API:
> ------------------------------------------------------------------------
> The string your handler receives is NOT zero terminated. You have to
> use the length argument to deal with the end of the string. A single
> block of contiguous text free of markup may still result in a sequence
> of calls to this handler. In other words, if you're searching for a
> pattern in the text, it may be split across calls to this handler.
> ------------------------------------------------------------------------
>
> Lets say that I am passing the complete XML document to the XMLParse()
> API in a single shot. So If I register a character data handler for
> handling the element data then would I be getting the complete element
> text data in a single call to my registered text handler. In other
> words, can I safely assume that the first call to the text handler
> routine would contain the complete text data?. Even when I pass the
> complete XML document to the XMLParse() in a single shot, can the text
> data be split across the calls to the data handler?.

From karl at waclawek.net  Tue Apr 10 18:48:48 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Tue, 10 Apr 2007 12:48:48 -0400
Subject: [Expat-discuss] Clarification on the behavior of the text
	handler
In-Reply-To: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
References: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
Message-ID: <461BBFF0.4070004@waclawek.net>

Suresh Kumar J wrote:
> Hi there!
>
> I wanted to clarify on the behavior of the text handler.
>
> Below is the description for the XML_SetCharacterDataHandler API:
> ------------------------------------------------------------------------
> The string your handler receives is NOT zero terminated. You have to
> use the length argument to deal with the end of the string. A single
> block of contiguous text free of markup may still result in a sequence
> of calls to this handler. In other words, if you're searching for a
> pattern in the text, it may be split across calls to this handler.
> ------------------------------------------------------------------------
>
> Lets say that I am passing the complete XML document to the XMLParse()
> API in a single shot. So If I register a character data handler for
> handling the element data then would I be getting the complete element
> text data in a single call to my registered text handler. In other
> words, can I safely assume that the first call to the text handler
> routine would contain the complete text data?. Even when I pass the
> complete XML document to the XMLParse() in a single shot, can the text
> data be split across the calls to the data handler?.
>
> Any inputs in this regard would be helpful.
>   

Nick is correct - you cannot assume single call-backs.
For instance, any line break in element content will cause multiple 
call-backs, IIRC.

Karl

From andrelsm at iname.com  Tue Apr 10 19:00:05 2007
From: andrelsm at iname.com (Andre Luis Monteiro)
Date: Tue, 10 Apr 2007 12:00:05 -0500
Subject: [Expat-discuss] Clarification on the behavior of the text
	handler
Message-ID: <20070410170010.02FF81CE67F@ws1-6.us4.outblaze.com>

Nick, Kumar

beyond that, we have some whitespace handling intricacies, as exposed in:

http://msdn2.microsoft.com/en-us/library/ms256097.aspx

eXpat splits text content (even in CDATA sections) at '\n's. Right?


Question: how to add support for "xml:space" in my app?


[]s
andrelsm

> ----- Original Message -----
> From: "Nick MacDonald" <nickmacd at gmail.com>
> To: "Suresh Kumar J" <suresh.kumar.j at gmail.com>
> Subject: Re: [Expat-discuss] Clarification on the behavior of the text	handler
> Date: Tue, 10 Apr 2007 11:45:19 -0400
> 
> 
> If you want robust XML processing then you absolutely *SHOULD NOT*
> make any assumptions about what you will receive...  you need to
> concatenate them all together.  The most likely reason why you would
> get multiple calls is for escaped text, such as < and & .
> 
> Try this kind of document if you want to see what I mean:
> <Sample>
> This is my sample text with escapes
> & in the middle of it < which will likely
> cause multiple calls to > the handler
> </Sample>
> 
> On 4/10/07, Suresh Kumar J <suresh.kumar.j at gmail.com> wrote:
> > I wanted to clarify on the behavior of the text handler.
> >
> > Below is the description for the XML_SetCharacterDataHandler API:
> > ------------------------------------------------------------------------
> > The string your handler receives is NOT zero terminated. You have to
> > use the length argument to deal with the end of the string. A single
> > block of contiguous text free of markup may still result in a sequence
> > of calls to this handler. In other words, if you're searching for a
> > pattern in the text, it may be split across calls to this handler.
> > ------------------------------------------------------------------------
> >
> > Lets say that I am passing the complete XML document to the XMLParse()
> > API in a single shot. So If I register a character data handler for
> > handling the element data then would I be getting the complete element
> > text data in a single call to my registered text handler. In other
> > words, can I safely assume that the first call to the text handler
> > routine would contain the complete text data?. Even when I pass the
> > complete XML document to the XMLParse() in a single shot, can the text
> > data be split across the calls to the data handler?.
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss

>


abra?o
Andr? Lu?s


=


From boris at codesynthesis.com  Tue Apr 10 21:07:31 2007
From: boris at codesynthesis.com (Boris Kolpackov)
Date: Tue, 10 Apr 2007 19:07:31 +0000 (UTC)
Subject: [Expat-discuss] Clarification on the behavior of the text
	handler
References: <88b2f6dd0704092223k1d749609l4bf9c76dfb71ce33@mail.gmail.com>
Message-ID: <evgn9j$rca$1@sea.gmane.org>

Hi Suresh,

"Suresh Kumar J" <suresh.kumar.j at gmail.com> writes:

> Lets say that I am passing the complete XML document to the XMLParse()
> API in a single shot. So If I register a character data handler for
> handling the element data then would I be getting the complete element
> text data in a single call to my registered text handler.

No, it still can be split across several calls. What you may want to
do to emulate the desired behavior is to accumulate the data in a
string buffer and then process it when you know that all the data
has been delivered, e.g., in the "end element" handler.

hth,
-boris


-- 
Boris Kolpackov
Code Synthesis Tools CC
http://www.codesynthesis.com
Open-Source, Cross-Platform C++ XML Data Binding


From Saumya.Agarwal at netapp.com  Tue Apr 17 17:13:52 2007
From: Saumya.Agarwal at netapp.com (Agarwal, Saumya)
Date: Tue, 17 Apr 2007 20:43:52 +0530
Subject: [Expat-discuss] How is SJIS encoding handled in expat?
Message-ID: <7026BCCA258BA2438F885772CA0B431307AA3A73@exbtc01.hq.netapp.com>

Hi,
 
I have a scenario in which the encoding of the data on the server is in SJIS format. The client requests this data from the server through an API, the server sends the output in XML parsed by the expat parser.
 
Here is the input and output  -
 
<?xml version='1.0' encoding='SHIFT-JIS' ?>
<!DOCTYPE netapp SYSTEM 'file:/etc/netapp_filer.dtd'>
<netapp xmlns="http://www.netapp.com/filer/admin <BLOCKED::http://www.netapp.com/filer/admin> " version="1.0"><file-inode-info><inode-number>1193746</inode-number><volume-name>vol0</volume-name></file-inode-info></netapp> 
 
OUTPUT:
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE netapp SYSTEM '/na_admin/netapp_filer.dtd'>
<netapp version='1.1' xmlns='http://www.netapp.com/filer/admin'>
<results status="passed"><volume-name>vol0</volume-name><volume-fsid>1996999850</volume-fsid><volume-uuid>42a93940-4ed9-11db-ba89-00a098032816</volume-uuid><inode-number>1193746</inode-number><number-of-parents>1</number-of-parents><inode-paths><inode-parent-info><inode-path>/vol/vol0/home/???????? ??????.doc</inode-path></inode-parent-info></inode-paths></results></netapp>

 
As seen above, the client declares the document encoding to be SHIFT-JIS. The server returns the proper data (seems like SJIS, as japanese characters are represented correctly in the output ) but the encoding declared in the output document is UTF-8. 
Now, the strange part is that even if the client declares the document endoding to be UTF-8 in the input, the server behavior is just the same!
 
Here are my questions -
1. Does expat support SJIS encoding? 
2. If yes, then how does it know the data is SJIS encoded and when does it call the appropriate handler? 
3. Is the output returned by expat, the SJIS encoded data, or does it convert the data to UTF-8 and return it?
4. Is there a way through which expat can declare to the client that the data is actually SJIS and not UTF-8? We have another parser on the client side (libxml2) which fails which a parsing error when the XML output from expat is given to it, as the data is japanese while the encoding declaration is UTF-8.
 
Thanks,
Saumya

From karl at waclawek.net  Tue Apr 17 18:14:30 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Tue, 17 Apr 2007 12:14:30 -0400
Subject: [Expat-discuss] How is SJIS encoding handled in expat?
In-Reply-To: <7026BCCA258BA2438F885772CA0B431307AA3A73@exbtc01.hq.netapp.com>
References: <7026BCCA258BA2438F885772CA0B431307AA3A73@exbtc01.hq.netapp.com>
Message-ID: <4624F266.1040304@waclawek.net>

Agarwal, Saumya wrote:
> Hi,
>  
> I have a scenario in which the encoding of the data on the server is in SJIS format. The client requests this data from the server through an API, the server sends the output in XML parsed by the expat parser.
>  
> Here is the input and output  -
>  
> <?xml version='1.0' encoding='SHIFT-JIS' ?>
> <!DOCTYPE netapp SYSTEM 'file:/etc/netapp_filer.dtd'>
> <netapp xmlns="http://www.netapp.com/filer/admin <BLOCKED::http://www.netapp.com/filer/admin> " version="1.0"><file-inode-info><inode-number>1193746</inode-number><volume-name>vol0</volume-name></file-inode-info></netapp>
>  
> OUTPUT:
> <?xml version='1.0' encoding='UTF-8' ?>
> <!DOCTYPE netapp SYSTEM '/na_admin/netapp_filer.dtd'>
> <netapp version='1.1' xmlns='http://www.netapp.com/filer/admin'>
> <results status="passed"><volume-name>vol0</volume-name><volume-fsid>1996999850</volume-fsid><volume-uuid>42a93940-4ed9-11db-ba89-00a098032816</volume-uuid><inode-number>1193746</inode-number><number-of-parents>1</number-of-parents><inode-paths><inode-parent-info><inode-path>/vol/vol0/home/???????? ??????.doc</inode-path></inode-parent-info></inode-paths></results></netapp>
>
>  
> As seen above, the client declares the document encoding to be SHIFT-JIS. The server returns the proper data (seems like SJIS, as japanese characters are represented correctly in the output ) but the encoding declared in the output document is UTF-8. 
> Now, the strange part is that even if the client declares the document endoding to be UTF-8 in the input, the server behavior is just the same!
>  
> Here are my questions -
> 1. Does expat support SJIS encoding? 
>   

Not by default. You must register an "unknownEncodingHandler" that can
handle SHIFT-JIS.
Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16
for input.
For an example, look at patch #888879 on the Expat web site.

> 2. If yes, then how does it know the data is SJIS encoded and when does it call the appropriate handler? 
>   

Normally, Expat would reject the input document. Do you know if there is
an "unknownEncodingHandler"?
Or more likely, the XML_ParserCreate(const XML_Char *encoding); function
is called by passing
a recognized encoding (instead of null). This would override the
encoding declaration and make
Expat treat the document as if it thus encoded.

> 3. Is the output returned by expat, the SJIS encoded data, or does it convert the data to UTF-8 and return it?
>   

Expat always return either UTF-8 or UTF-16, depending on how it was built.
My guess is, the server forces one of the built-in encodings when calling
XML_ParserCreate(const XML_Char *encoding). This can work as long as there
is no sequence of bytes that represents an invalid code point in that
encoding.

> 4. Is there a way through which expat can declare to the client that the data is actually SJIS and not UTF-8? We have another parser on the client side (libxml2) which fails which a parsing error when the XML output from expat is given to it, as the data is japanese while the encoding declaration is UTF-8.
>   

No, Expat always returns UTF-8 or UTF-16. I think there is an error on
the server side.
Since you say the characters returned by Expat are actually SJIS, I
assume that the server
forces Expat to treat it as one of the built-in encodings (most likely
UTF-8).
>  Karl
>
>   

From omer.anjum at tut.fi  Thu Apr 19 12:46:02 2007
From: omer.anjum at tut.fi (Omer Anjum)
Date: Thu, 19 Apr 2007 13:46:02 +0300
Subject: [Expat-discuss] Expat For Embedded System
Message-ID: <20070419134602.p4t905z54o4s0wsk@webmail.tut.fi>

Dear Epat Forum members

I am working on an embedded system and needs a C or C++ based XML  
Pareser with size less then 100Kb. Can you tell me that is Expat able  
to help me in solving my problem.
Regards
Omer


From jeffreyholle at bellsouth.net  Thu Apr 19 18:21:27 2007
From: jeffreyholle at bellsouth.net (Jeffrey Holle)
Date: Thu, 19 Apr 2007 12:21:27 -0400
Subject: [Expat-discuss] Handling include statements in XML files
Message-ID: <f084uu$adg$1@sea.gmane.org>

The XML files which I am attempting to parse with expat 2.0.0 have 
include statements which I presently can handle.

I've attempted to create a new parser via the 
XML_ExternalEntityParserCreate function, but I am not sure what the 
"context" parameter needs to be.  What should it be?

The include XML file is of the same type as the original, so I want all 
exiting handlers to work with the included file.


From karl at waclawek.net  Thu Apr 19 20:03:38 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Thu, 19 Apr 2007 14:03:38 -0400
Subject: [Expat-discuss] Expat For Embedded System
In-Reply-To: <20070419134602.p4t905z54o4s0wsk@webmail.tut.fi>
References: <20070419134602.p4t905z54o4s0wsk@webmail.tut.fi>
Message-ID: <4627AEFA.1010101@waclawek.net>

Omer Anjum wrote:
> Dear Epat Forum members
>
> I am working on an embedded system and needs a C or C++ based XML  
> Pareser with size less then 100Kb. Can you tell me that is Expat able  
> to help me in solving my problem.
> Regards
>   

I don't think Expat's size is < 100KB on any system, but there are
compile time options
to minimize the size of Expat (disabling certain features, etc. - read
the documentation)
and there are also compiler options to keep the size smaller ( for a
little less performance).
Maybe you can trim it down to about 100KB that way.

Karl

From nickmacd at gmail.com  Thu Apr 19 21:12:25 2007
From: nickmacd at gmail.com (Nick MacDonald)
Date: Thu, 19 Apr 2007 15:12:25 -0400
Subject: [Expat-discuss] Handling include statements in XML files
In-Reply-To: <f084uu$adg$1@sea.gmane.org>
References: <f084uu$adg$1@sea.gmane.org>
Message-ID: <bdcd32c90704191212r7e475ebr8dc8313eb47923f2@mail.gmail.com>

Jeffrey:

I think you need to supply some sample XML here to make it clear what
you're trying to do.  If I understand you properly, you have some sort
of embedded XML tag that you want to process to get another XML file
that should be considered part of the original file?

Like so?

<MyXML>
  <include file="someXMLfileToInclude.xml" />
</MyXML>

If this is the case, then you really want to just create a new parser
to start parsing the included file right at the point that you get the
call back for the <include> end tag.  You would then read that file
for the duration of the file, and then when it concludes successfully,
you would return to parsing the original file.  The only downside of
this approach, is that the included XML file would need to be valid,
it couldn't be a partial file.  If you needed to handle partial files,
you'd probably need to generate an interim file which would include
the content from all the files included, and then you would process
that interim file and then presumably delete it as part of the clean
up.

Thus, in that case, you would have two steps:
  1. copy the input file to a new file
      while include statements found in new file
        parse the new file for includes
        when include statement is found merge the two files into one new file
      end
  2. parse the final new file

Good luck with your project...
  Nick

On 4/19/07, Jeffrey Holle <jeffreyholle at bellsouth.net> wrote:
> The XML files which I am attempting to parse with expat 2.0.0 have
> include statements which I presently can handle.
>
> I've attempted to create a new parser via the
> XML_ExternalEntityParserCreate function, but I am not sure what the
> "context" parameter needs to be.  What should it be?
>
> The include XML file is of the same type as the original, so I want all
> exiting handlers to work with the included file.
>
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>


-- 
Nick MacDonald
NickMacD at gmail.com

From benski at nullsoft.com  Thu Apr 19 21:12:12 2007
From: benski at nullsoft.com (Ben Allison)
Date: Thu, 19 Apr 2007 15:12:12 -0400 (EDT)
Subject: [Expat-discuss] Expat For Embedded System
In-Reply-To: <4627AEFA.1010101@waclawek.net>
References: <20070419134602.p4t905z54o4s0wsk@webmail.tut.fi>
	<4627AEFA.1010101@waclawek.net>
Message-ID: <1145.75.75.86.10.1177009932.squirrel@mail.winamp.com>

> Omer Anjum wrote:
>> Dear Epat Forum members
>>
>> I am working on an embedded system and needs a C or C++ based XML
>> Pareser with size less then 100Kb. Can you tell me that is Expat able
>> to help me in solving my problem.
>> Regards
>>
>
> I don't think Expat's size is < 100KB on any system, but there are
> compile time options
> to minimize the size of Expat (disabling certain features, etc. - read
> the documentation)
> and there are also compiler options to keep the size smaller ( for a
> little less performance).
> Maybe you can trim it down to about 100KB that way.

If you statically link expat, and have a reasonable smart compiler, it
should check in at less than 100kb (we get it down to about 75kb even with
UTF-16 and namespace enabled).  Creating a dynamic library (e.g. DLL or
so) tends to be larger because the compiler isn't sure what code will and
won't be called.  If you need a dynamic library (maybe because multiple
programs use the library) then you should be able to trim some code from
unused areas by tweaking your compiler and linker settings.

From santhoshpremkumar at gmail.com  Fri Apr 20 07:29:46 2007
From: santhoshpremkumar at gmail.com (Santhosh Premkumar)
Date: Fri, 20 Apr 2007 10:59:46 +0530
Subject: [Expat-discuss] Build on x86_64 AMD cross compilation
Message-ID: <54ac2f0b0704192229g4790355fnd5c9ddfe6ee36047@mail.gmail.com>

Hi

  I have a problem in building Expat Library in VC++ 2005 for AMD x86_64 bit
compiler.

I need to build the library using this compiler. I tried to configure using
automake and run make, but it issues build library is invalid.

Have any one run this on VC++ compiler using make file ( I am running in a
shell prompt). Please guide me to run.

Thanks
Santhosh

From santhoshpremkumar at gmail.com  Fri Apr 20 07:46:03 2007
From: santhoshpremkumar at gmail.com (Santhosh Premkumar)
Date: Fri, 20 Apr 2007 11:16:03 +0530
Subject: [Expat-discuss] Build on x86_64 AMD cross compilation
In-Reply-To: <7C83A8A6B56D3A478333B1DF47E185868E524B@MPBABGEX01.corp.mphasis.com>
References: <54ac2f0b0704192229g4790355fnd5c9ddfe6ee36047@mail.gmail.com>
	<7C83A8A6B56D3A478333B1DF47E185868E524B@MPBABGEX01.corp.mphasis.com>
Message-ID: <54ac2f0b0704192246k38c5997cw64876a11c450b9d4@mail.gmail.com>

Hi

I could't find the documents. Could you please provide me the link

I will elobrate my requirements

1. I have downloaded Expat 2 library files
2. Tried to Build in Microsoft visual studio compiler
(C:\Program Files\Microsoft Visual Studio 8\VC\bin\x86_amd64)

This is being done inorder to port  64-Bit expat Library

I done these steps

1. ./configure --host=x86 --target=x86_64 CC=CL
2. make buildlib
3. Make install

While installing the wmlwf present in expat folder having test files shows a
warning that "Invalid library format: Ignored" . This library is build
through the x86_64 compiler found to be wroing.

I Do no in which place i went wrong (.. in config or in linker )

If you have any ideas. Please share them.

Thanks
I Santhosh
Chennai
Driver Testing and Development
Chennai
9884937329


On 4/20/07, Mukesh Kumar <Mukesh.S at mphasis.com> wrote:
>
> Look to run the Expat libarary, you should also installed Active Perl,
> please go thrugh the document provided by Expat INC. and if u don't
> understand any step, let us know...
>
> Regards,
> Mukesh Kumar,
> Sr.Software Engineer,
> Bangalore,
> India.
> 9342906419 (M).
>
> -----Original Message-----
> From: expat-discuss-bounces+mukesh.s=mphasis.com at libexpat.org
> [mailto:expat-discuss-bounces+mukesh.s=mphasis.com at libexpat.org] On
> Behalf Of Santhosh Premkumar
> Sent: Friday, April 20, 2007 11:00 AM
> To: expat-discuss at libexpat.org
> Subject: [Expat-discuss] Build on x86_64 AMD cross compilation
>
> Hi
>
> I have a problem in building Expat Library in VC++ 2005 for AMD x86_64
> bit
> compiler.
>
> I need to build the library using this compiler. I tried to configure
> using
> automake and run make, but it issues build library is invalid.
>
> Have any one run this on VC++ compiler using make file ( I am running in
> a
> shell prompt). Please guide me to run.
>
> Thanks
> Santhosh
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>

From Saumya.Agarwal at netapp.com  Fri Apr 20 10:36:33 2007
From: Saumya.Agarwal at netapp.com (Agarwal, Saumya)
Date: Fri, 20 Apr 2007 14:06:33 +0530
Subject: [Expat-discuss] How is SJIS encoding handled in expat?
Message-ID: <7026BCCA258BA2438F885772CA0B431307C4E5CC@exbtc01.hq.netapp.com>


Thanks Karl. The  problem was that XML_ParserCreate(const XML_Char *encoding); function was being called by passing UTF-8 which was overriding the encoding declaration, as you suspected.

>Not by default. You must register an "unknownEncodingHandler" that can handle SHIFT-JIS.
>Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16 for input.
>For an example, look at patch #888879 on the Expat web site. 

Where can I find an encoding handler which can handle SHIFT-JIS?  Will expat be able to support both UTF-8 and SHIFT-JIS encoding at the same time if I register such an handler?

Thanks,
Saumya

-----Original Message-----
From: Karl Waclawek [mailto:karl at waclawek.net] 
Sent: Tuesday, April 17, 2007 9:45 PM
To: expat-discuss at libexpat.org
Subject: Re: [Expat-discuss] How is SJIS encoding handled in expat?

Agarwal, Saumya wrote:
> Hi,
>  
> I have a scenario in which the encoding of the data on the server is in SJIS format. The client requests this data from the server through an API, the server sends the output in XML parsed by the expat parser.
>  
> Here is the input and output  -
>  
> <?xml version='1.0' encoding='SHIFT-JIS' ?> <!DOCTYPE netapp SYSTEM 
> 'file:/etc/netapp_filer.dtd'> <netapp 
> xmlns="http://www.netapp.com/filer/admin 
> <BLOCKED::http://www.netapp.com/filer/admin> " 
> version="1.0"><file-inode-info><inode-number>1193746</inode-number><vo
> lume-name>vol0</volume-name></file-inode-info></netapp>
>  
> OUTPUT:
> <?xml version='1.0' encoding='UTF-8' ?> <!DOCTYPE netapp SYSTEM 
> '/na_admin/netapp_filer.dtd'> <netapp version='1.1' 
> xmlns='http://www.netapp.com/filer/admin'>
> <results 
> status="passed"><volume-name>vol0</volume-name><volume-fsid>1996999850
> </volume-fsid><volume-uuid>42a93940-4ed9-11db-ba89-00a098032816</volum
> e-uuid><inode-number>1193746</inode-number><number-of-parents>1</numbe
> r-of-parents><inode-paths><inode-parent-info><inode-path>/vol/vol0/hom
> e/???????? 
> ??????.doc</inode-path></inode-parent-info></inode-paths></results></n
> etapp>
>
>  
> As seen above, the client declares the document encoding to be SHIFT-JIS. The server returns the proper data (seems like SJIS, as japanese characters are represented correctly in the output ) but the encoding declared in the output document is UTF-8. 
> Now, the strange part is that even if the client declares the document endoding to be UTF-8 in the input, the server behavior is just the same!
>  
> Here are my questions -
> 1. Does expat support SJIS encoding? 
>   

Not by default. You must register an "unknownEncodingHandler" that can handle SHIFT-JIS.
Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16 for input.
For an example, look at patch #888879 on the Expat web site.

> 2. If yes, then how does it know the data is SJIS encoded and when does it call the appropriate handler? 
>   

Normally, Expat would reject the input document. Do you know if there is an "unknownEncodingHandler"?
Or more likely, the XML_ParserCreate(const XML_Char *encoding); function is called by passing a recognized encoding (instead of null). This would override the encoding declaration and make Expat treat the document as if it thus encoded.

> 3. Is the output returned by expat, the SJIS encoded data, or does it convert the data to UTF-8 and return it?
>   

Expat always return either UTF-8 or UTF-16, depending on how it was built.
My guess is, the server forces one of the built-in encodings when calling XML_ParserCreate(const XML_Char *encoding). This can work as long as there is no sequence of bytes that represents an invalid code point in that encoding.

> 4. Is there a way through which expat can declare to the client that the data is actually SJIS and not UTF-8? We have another parser on the client side (libxml2) which fails which a parsing error when the XML output from expat is given to it, as the data is japanese while the encoding declaration is UTF-8.
>   

No, Expat always returns UTF-8 or UTF-16. I think there is an error on the server side.
Since you say the characters returned by Expat are actually SJIS, I assume that the server forces Expat to treat it as one of the built-in encodings (most likely UTF-8).
>  Karl
>
>   
_______________________________________________
Expat-discuss mailing list
Expat-discuss at libexpat.org
http://mail.libexpat.org/mailman/listinfo/expat-discuss

From karl at waclawek.net  Fri Apr 20 14:57:39 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Fri, 20 Apr 2007 08:57:39 -0400
Subject: [Expat-discuss] Build on x86_64 AMD cross compilation
In-Reply-To: <54ac2f0b0704192229g4790355fnd5c9ddfe6ee36047@mail.gmail.com>
References: <54ac2f0b0704192229g4790355fnd5c9ddfe6ee36047@mail.gmail.com>
Message-ID: <4628B8C3.3030902@waclawek.net>

Santhosh Premkumar wrote:
> Hi
>
>   I have a problem in building Expat Library in VC++ 2005 for AMD x86_64 bit
> compiler.
>
>   


Why don't you use use the VC++ 6.0 project files (.dws, .dsp)?
They can be opened and upgraded in VC++ 2005.

Karl

From karl at waclawek.net  Fri Apr 20 15:10:25 2007
From: karl at waclawek.net (Karl Waclawek)
Date: Fri, 20 Apr 2007 09:10:25 -0400
Subject: [Expat-discuss] How is SJIS encoding handled in expat?
In-Reply-To: <7026BCCA258BA2438F885772CA0B431307C4E5CC@exbtc01.hq.netapp.com>
References: <7026BCCA258BA2438F885772CA0B431307C4E5CC@exbtc01.hq.netapp.com>
Message-ID: <4628BBC1.7050408@waclawek.net>

Agarwal, Saumya wrote:
> Thanks Karl. The  problem was that XML_ParserCreate(const XML_Char *encoding); function was being called by passing UTF-8 which was overriding the encoding declaration, as you suspected.
>
>   
>> Not by default. You must register an "unknownEncodingHandler" that can handle SHIFT-JIS.
>> Out of the box, Expat only supports ASCII, ISO8859-1 , UTF-8 and UTF-16 for input.
>> For an example, look at patch #888879 on the Expat web site. 
>>     
>
> Where can I find an encoding handler which can handle SHIFT-JIS?  Will expat be able to support both UTF-8 and SHIFT-JIS encoding at the same time if I register such an handler?
>
>   

I don't know of a publicly available one. You could roll your own, using
the docs and the example
I emtioned above (for GB2312), or you could simply convert the SHIFT-JIS
input to UTF-8.
Just Google for it - there may be some OpenSource available.

Karl

From rcruz at cpsinet.com  Mon Apr 23 15:12:34 2007
From: rcruz at cpsinet.com (Robert Cruz)
Date: Mon, 23 Apr 2007 08:12:34 -0500
Subject: [Expat-discuss] UnixWare port not part of the standard distribution?
Message-ID: <002c01c785a9$11d4bb70$d105000a@cpsinet.com>

I'm trying to work with an external vendor whose platform we depend on in
many ways.  We were looking into utilizing their XML capabilities, which use
the expat library, however we kept receiving an error message telling us
that we did not have expat installed.  

We run SCO UnixWare 7.14, and I have installed a package for expat 1.95.1
that I got from SCO's skunkware website.  I installed that port, but I still
received the same error message from the third party platform.  

When I contacted them about the error, they stated that the UnixWare port
must not be a part of the standard distribution of the library that they are
using.  I've looked through some of the files in expat 2, and have seen
config code for UnixWare, so I'm not 100% convinced that what they are
saying is accurate.  However, I'm obviously not a developer on the expat
project, and therefore my opinion doesn't really count for much.  

I am curious to see if their position has any merit.  If it does, what needs
to be done to fold any UnixWare specific development into the standard
distribution of the expat library?

Thanks,


Robert Cruz
Senior Programmer
CPSI

6600 Wall Street 
Mobile, Alabama 36695 
Tel: 251.639.8100
Fax: 251.639.8214 

http://www.cpsinet.com


From stevencvernon at comcast.net  Sun Apr 29 07:09:48 2007
From: stevencvernon at comcast.net (Steve Vernon)
Date: Sat, 28 Apr 2007 22:09:48 -0700
Subject: [Expat-discuss] expat Memory Footprint?
Message-ID: <005d01c78a1c$9c4603a0$6402a8c0@dell4700>

What is the memory footprint for expat?  How does it grow with the input document?

I would like to know what is the starting size, then what "variables" it depends upon and what is the factor of expansion for each such variable.  (Of course, I am talking about non-static data size - compiled code size is not likely to be an issue.)

For example, I assume that for parsing that we need to at minimum keep track of all the open start tags (ones that so far have seen no corresponding end tags) and that these are stored in XML_Chars.  But perhaps there is other overhead per open start tag.  And, of course, there could be information stored for DTD items.

The best estimate of the size would be appreciated.  At most a factor of 2 over the reality.  It seems that XML_MIN_SIZE affects some size, but am not certain that it affects data size - again, we won't likely have issues with code size.  We will likely want to use most features of expat, so we don't want to turn off features.

We have a constrained memory environment and will need to do many parallel parses.  With enough parallelism the total memory footprint could get prohibitive if expat is not frugal.  Essentially every byte may count.

Thanks in advance.