From carlos@pehoe.civil.ist.utl.pt Tue Oct 2 17:15:04 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Tue Oct 2 16:15:04 2001 Subject: [Expat-discuss] BUFSIZ Message-ID: <200110022318.AAA23504@pehoe.civil.ist.utl.pt> Regarding the examples that come with Expat, In elements.c the buffer has a size indicated by BUFSIZ which in turn is defined in my Intel-based Linux system as 8192, and in outline.c 8192 is explicitly indicated. What are the implications of using a smaller or larger buffer in expat, of course this reflects in memory usage, number of operations required to parse the whole document, etc... But why using 2**13? Sorry if this is a dumb question, I am a XML newbie... Carlos From fdrake@acm.org Tue Oct 2 21:23:01 2001 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue Oct 2 20:23:01 2001 Subject: [Expat-discuss] BUFSIZ In-Reply-To: <200110022318.AAA23504@pehoe.civil.ist.utl.pt> References: <200110022318.AAA23504@pehoe.civil.ist.utl.pt> Message-ID: <15290.33442.977801.67256@grendel.zope.com> Carlos Pereira writes: > In elements.c the buffer has a size indicated by BUFSIZ which > in turn is defined in my Intel-based Linux system as 8192, > and in outline.c 8192 is explicitly indicated. > > What are the implications of using a smaller or larger buffer > in expat, of course this reflects in memory usage, number of > operations required to parse the whole document, etc... The smaller the number of delays for disk reads, the better, of course, but I don't suspect there's enough control here for that to really matter -- really controlling the I/O performance with modern operating systems and disk controllers is non-trivial and not worth the trouble for most applications. And everything changes as soon as the disk system is changed; an adaptive piece of software is very difficult to create and debug. It probably only pays off for very high-end database systems and in gigabit-networking research (where the delays could invalidate the research if the network is waiting for data to be read or written). So I wouldn't bother worrying about this; I've seen the side where it matters, and this isn't it! > But why using 2**13? I think this is the most common disk block size on modern hard disks, but I'm not entirely sure. Making it a multiple of this makes some sense for I/O efficiency. I don't know just how much of an issue this is in practice. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From persicom@acedsl.com Mon Oct 8 19:54:15 2001 From: persicom@acedsl.com (Matthew O. Persico) Date: Mon Oct 8 18:54:15 2001 Subject: [Expat-discuss] Need to build on Win2000k Message-ID: <3BC259D4.4070205@acedsl.com> I know that the recomended installation on Winboxes is to use the Windows binary. However, when trying to build XML::Parser, I get a compile error: D:\opt\Expat\1.95.2\libs\expat.lib : fatal error LNK1106: invalid file or disk full: cannot seek to 0x3b620589 NMAKE : fatal error U1077: 'link' : return code '0xc' Stop. I have gotten this type of error before. It is usually due to trying to link with some lib file build with VC++6.0. I am using VC++5.0. :-( I took a shot at using the expat.dsw file in the Expat\Source directory of the Win32 distribution. I got this error LINK : fatal error LNK1104: cannot open file "LIBCMTD.lib" Error executing link.exe. Does anyone have any pointers as to how I may proceed? Thank you -- Matthew O. Persico http://www.acecape.com/dsl AceDSL:The best ADSL in Verizon area From persicom@acedsl.com Tue Oct 9 19:50:10 2001 From: persicom@acedsl.com (Matthew O. Persico) Date: Tue Oct 9 18:50:10 2001 Subject: [Expat-discuss] Need to build on Win2000k References: <3BC259D4.4070205@acedsl.com> Message-ID: <3BC3AAA9.1020605@acedsl.com> Matthew O. Persico wrote: > I know that the recomended installation on Winboxes is to use the > Windows binary. However, when trying to build XML::Parser, I get a > compile error: > > D:\opt\Expat\1.95.2\libs\expat.lib : fatal error LNK1106: invalid file > or disk full: cannot seek to 0x3b620589 > NMAKE : fatal error U1077: 'link' : return code '0xc' > Stop. > > I have gotten this type of error before. It is usually due to trying to > link with some lib file build with VC++6.0. I am using VC++5.0. :-( > > I took a shot at using the expat.dsw file in the Expat\Source directory > of the Win32 distribution. I got this error > LINK : fatal error LNK1104: cannot open file "LIBCMTD.lib" > Error executing link.exe. > > Does anyone have any pointers as to how I may proceed? > > Thank you > Turns out (with a little mental prodding from brc@fourlittlemice.com) that the library was back on the installation disk, in plain sight, no CAB files involved (surprise, surprise). I dropped the library into the DevStudio tree and volia! it linked like a charm. My next problem is that while I was building Perl's XML::Parser, the test kept failing because it couldn't find XML_GetSpecfiedAttributeCount in the expat.dll. If any one has any other clues, I'd be grateful again, (realizing this is NOT a Perl group). -- Matthew O. Persico http://www.acecape.com/dsl AceDSL:The best ADSL in Verizon area From tcpo@yahoo.com Wed Oct 10 03:05:11 2001 From: tcpo@yahoo.com (Stephen Po) Date: Wed Oct 10 02:05:11 2001 Subject: [Expat-discuss] using XML_SetUnknownEncodingHandler for big5 encoding Message-ID: <20011010090423.59972.qmail@web10705.mail.yahoo.com> hi all, i am trying to use expat with xmls in big5 encoding. i know i need to set an unknown encoding handler for the job. however, i don't know how should i implement the callback function even after i've read the docs. can someone show me some codes on how to use XML_SetUnknownEncodingHandler? thanks very much, stephen po __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com From una.kearns@documentum.com Fri Oct 12 16:36:05 2001 From: una.kearns@documentum.com (Kearns, Una) Date: Fri Oct 12 15:36:05 2001 Subject: [Expat-discuss] Problem ignoring non declared entities during parsing Message-ID: <379250A13595D31190660090278AC63C0598D95E@corpismsg02> This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ---------------------- multipart/alternative attachment Hi, I have the following problem -- Due to the stage in processing I will have XML files that will not be well-formed --- what will be missing are Doctype Declarations and Entity Declarations. Is there a way with Expat to add an entity handler so this will still parse (i.e. ignore this case) --- i.e. I will have entity references with no entity declarations. So on hitting an entity -- just do nothing and continue parsing if no declaration found. Thanks for you assistance, Una ---------------------- multipart/alternative attachment An HTML attachment was scrubbed... URL: http://mail.libexpat.org/pipermail-21/expat-discuss/attachments/20011012/d42ed0a6/attachment.html ---------------------- multipart/alternative attachment-- From carlos@pehoe.civil.ist.utl.pt Sun Oct 14 09:05:04 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Sun Oct 14 08:05:04 2001 Subject: [Expat-discuss] attribute line numbers? Message-ID: <200110141508.QAA28198@pehoe.civil.ist.utl.pt> Hi all, Expat newbie alert! When I find a wrong attribute (detected by my own app code), is there a way to know the file line where that attribute actually was? As far as I can see, XML_GetCurrentLineNumber (parser) gives me the line where the new element starts, which can be many lines above the real problem. Thanks! Carlos From fdrake@acm.org Mon Oct 15 07:27:05 2001 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon Oct 15 06:27:05 2001 Subject: [Expat-discuss] attribute line numbers? In-Reply-To: <200110141508.QAA28198@pehoe.civil.ist.utl.pt> References: <200110141508.QAA28198@pehoe.civil.ist.utl.pt> Message-ID: <15306.57943.343018.198032@grendel.zope.com> Carlos Pereira writes: > When I find a wrong attribute (detected by my own app code), > is there a way to know the file line where that attribute > actually was? XML_GetCurrentLineNumber() is as close as you can get; sorry. Even if Expat gave you the line number for the name= portion of the attribute, that could be many (albiet blank) lines from the problem as well, so I think it unlikely that the problem is really solved with just the line number. If the source uses an encoding your application understands, the start of the element is really enough to locate the attribute value (if using namespaces, make sure you hook the namespace callback so you can track prefixes for proper decoding). -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From carlos@pehoe.civil.ist.utl.pt Mon Oct 15 13:11:09 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Mon Oct 15 12:11:09 2001 Subject: [Expat-discuss] Re: attribute line numbers? Message-ID: <200110151914.UAA30444@pehoe.civil.ist.utl.pt> >Carlos Pereira writes: > > When I find a wrong attribute (detected by my own app code), > > is there a way to know the file line where that attribute > > actually was? > > XML_GetCurrentLineNumber() is as close as you can get; sorry. Even >if Expat gave you the line number for the name= portion of the >attribute, that could be many (albiet blank) lines from the problem as >well, so I think it unlikely that the problem is really solved with >just the line number. If the source uses an encoding your application >understands, the start of the element is really enough to locate the >attribute value (if using namespaces, make sure you hook the namespace >callback so you can track prefixes for proper decoding). > >-Fred Thanks, Showing the attribute name (which is unique for each element) and the line number of the element is enough for users to detect mistakes. Anyway, I thought Expat could have a second array, for example: int *line = XML_GetAttributeLineNumbers(parser); printf("Wrong attribute name %s at line %d\n", attribute[i], line[i]); The line numbers for the attribute name and the attribute value could be different, as you said, so in general line[i] != line[i+1] I am NOT saying that this is a good ideia, just asking... Would this "feature" have an impact in Expat speed? (after all you still have to count the lines...) I certainly want to see Expat keeping the crown as the fastest XML parser under the sun... :-) Carlos From persicom@acedsl.com Wed Oct 17 20:49:10 2001 From: persicom@acedsl.com (Matthew O. Persico) Date: Wed Oct 17 19:49:10 2001 Subject: [Expat-discuss] Error using pre-built windows libs Message-ID: <3BCE4460.3090207@acedsl.com> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment While trying to test the build of Perl module XML::Parser, I get this: >The procedure entry point XML_GetSpecifiedAttributeCount could not be >located in the dynamic library expat.dll I am using a self-built Perl on Win2k, using VC++6.0 (details below). I tried both the pre-built version of the expat libs (expat_win32bin_1_95_2.exe from sourceforge) and I built them using the code that comes with the pre-built version. I look at the expat code and in xmlparse.c I find this: >int XML_GetSpecifiedAttributeCount(XML_Parser parser) >{ > return nSpecifiedAtts; ?} So, if it's in the code, does anyone have any ideas as to why it is apparently is not in the library? Given that these are C and not C++ files, there should be no name-mangling issues yes? What could I be doing to track down this problem? Thank you perl info: > Summary of my perl5 (revision 5 version 6 subversion 1) configuration: > Platform: > osname=MSWin32, osvers=4.0, archname=MSWin32-x86-multi-thread > uname='' > config_args='undef' > hint=recommended, useposix=true, d_sigaction=undef > usethreads=undef use5005threads=undef useithreads=define usemultiplicity=define > useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef > use64bitint=undef use64bitall=undef uselongdouble=undef > Compiler: > cc='cl', ccflags ='-nologo -O1 -MD -DNDEBUG -DWIN32 -D_CONSOLE -DNO_STRICT -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DPERL_MSVCRT_READFIX', > optimize='-O1 -MD -DNDEBUG', > cppflags='-DWIN32' > ccversion='', gccversion='', gccosandvers='' > intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 > d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10 > ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4 > alignbytes=8, usemymalloc=n, prototype=define > Linker and Libraries: > ld='', ldflags ='-nologo -nodefaultlib -release -libpath:"D:\opt\perl\5.6.1\lib\MSWin32-x86-multi-thread\CORE" -machine:x86' > libpth=d:\mvs\vc98\lib > libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib wsock32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib > perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib wsock32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib > libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl56.lib > Dynamic Linking: > dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' > cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -release -libpath:"D:\opt\perl\5.6.1\lib\MSWin32-x86-multi-thread\CORE" -machine:x86' > > > Characteristics of this binary (from libperl): > Compile-time options: MULTIPLICITY USE_ITHREADS PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS > Built under MSWin32 > Compiled at Oct 16 2001 23:39:46 > %ENV: > PERLPATH="D:\opt\perl\5.6.1\bin\MSWin32-x86-multi-thread;D:\opt\perl\5.6.1\bin" > @INC: > D:/opt/perl/5.6.1/lib/MSWin32-x86-multi-thread > D:/opt/perl/5.6.1/lib > D:/opt/perl/site/5.6.1/lib/MSWin32-x86-multi-thread > D:/opt/perl/site/5.6.1/lib -- Matthew O. Persico http://www.acecape.com/dsl AceDSL:The best ADSL in Verizon area ---------------------- multipart/mixed attachment Summary of my perl5 (revision 5 version 6 subversion 1) configuration: Platform: osname=MSWin32, osvers=4.0, archname=MSWin32-x86-multi-thread uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef usethreads=undef use5005threads=undef useithreads=define usemultiplicity=define useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='cl', ccflags ='-nologo -O1 -MD -DNDEBUG -DWIN32 -D_CONSOLE -DNO_STRICT -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DPERL_MSVCRT_READFIX', optimize='-O1 -MD -DNDEBUG', cppflags='-DWIN32' ccversion='', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4 alignbytes=8, usemymalloc=n, prototype=define Linker and Libraries: ld='', ldflags ='-nologo -nodefaultlib -release -libpath:"D:\opt\perl\5.6.1\lib\MSWin32-x86-multi-thread\CORE" -machine:x86' libpth=d:\mvs\vc98\lib libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib wsock32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib wsock32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl56.lib Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -release -libpath:"D:\opt\perl\5.6.1\lib\MSWin32-x86-multi-thread\CORE" -machine:x86' Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY USE_ITHREADS PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS Built under MSWin32 Compiled at Oct 16 2001 23:39:46 %ENV: PERLPATH="D:\opt\perl\5.6.1\bin\MSWin32-x86-multi-thread;D:\opt\perl\5.6.1\bin" @INC: D:/opt/perl/5.6.1/lib/MSWin32-x86-multi-thread D:/opt/perl/5.6.1/lib D:/opt/perl/site/5.6.1/lib/MSWin32-x86-multi-thread D:/opt/perl/site/5.6.1/lib . ---------------------- multipart/mixed attachment-- From pkadakuntla@axsone.com Fri Oct 19 12:42:21 2001 From: pkadakuntla@axsone.com (Kadakuntla, Pankaja) Date: Fri Oct 19 11:42:21 2001 Subject: [Expat-discuss] UTF-8 data Message-ID: <4417661F91D5354996C29CDE68B680B6389C47@corpntex01.ctronsoft.com> This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ---------------------- multipart/alternative attachment When I parse an XML file using Expat I get a Parse error "XML_ERROR_INVALID_TOKEN". I see the line contains a special character (0XC2). I read from the documentation that Expat parser uses UTF-8 as the default encoding and has built in support for the following encodings. * utf-8 * utf-16 * iso-8859-1 * us-ascii I have two questions. 1. Is there any way, I could tell the parser to ignore such characters? 2. In the program that generates XML I need to validate the characters for a specific encoding before I write them out to the XML file. Has any one done this before? Can Expat internal functions be of any use to validate characters for the encoding specified (Ex UTF-8, iso-8859-1 etc). ---------------------- multipart/alternative attachment An HTML attachment was scrubbed... URL: http://mail.libexpat.org/pipermail-21/expat-discuss/attachments/20011019/efab5b36/attachment.html ---------------------- multipart/alternative attachment-- From carlos@pehoe.civil.ist.utl.pt Fri Oct 19 13:21:23 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Fri Oct 19 12:21:23 2001 Subject: [Expat-discuss] explicit unsetting required before new one? Message-ID: <200110191923.UAA16513@pehoe.civil.ist.utl.pt> I need to change the element_start and element_end handlers while parsing. Do I need to unset the old handlers with: XML_SetElementHandler (parser, NULL, NULL); before setting the new handlers? (Is this just changing the function pointers, or some memory dellocation is actually taking place, that requires an explicit freeing operation to avoid, say, a mem leak?) Carlos From fdrake@acm.org Fri Oct 19 13:40:21 2001 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri Oct 19 12:40:21 2001 Subject: [Expat-discuss] explicit unsetting required before new one? In-Reply-To: <200110191923.UAA16513@pehoe.civil.ist.utl.pt> References: <200110191923.UAA16513@pehoe.civil.ist.utl.pt> Message-ID: <15312.32703.730077.839325@grendel.zope.com> Carlos Pereira writes: > I need to change the element_start > and element_end handlers while parsing. ... > (Is this just changing the function pointers, > or some memory dellocation is actually taking place, Just set the new ones; there's no specific call to deallocate memory or anything until you're completely done with the parser object, and then you just get rid of the whole thing. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From carlos@pehoe.civil.ist.utl.pt Wed Oct 24 14:19:07 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Wed Oct 24 13:19:07 2001 Subject: [Expat-discuss] XML_ParseBuffer Message-ID: <200110242021.VAA30580@pehoe.civil.ist.utl.pt> According to the manual, XML_ParseBuffer returns 0 for bad formation and "non zero" otherwise. A quick printf test shows this "non-zero" to be 1 on my Linux system. Can I rely on this? a successful XML_ParseBuffer always returns 1? or is it more safe to test always against 0? I am using the usual #define FALSE/TRUE and often is more clean to compare ==TRUE than !=FALSE Thanks! Carlos From fdrake@acm.org Wed Oct 24 14:36:06 2001 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed Oct 24 13:36:06 2001 Subject: [Expat-discuss] XML_ParseBuffer In-Reply-To: <200110242021.VAA30580@pehoe.civil.ist.utl.pt> References: <200110242021.VAA30580@pehoe.civil.ist.utl.pt> Message-ID: <15319.9279.257162.481855@grendel.zope.com> Carlos Pereira writes: > According to the manual, XML_ParseBuffer > returns 0 for bad formation and "non zero" otherwise. > A quick printf test shows this "non-zero" to be 1 on > my Linux system. > > Can I rely on this? a successful XML_ParseBuffer > always returns 1? or is it more safe to test > always against 0? The documentation doesn't say it is, so assume it's an implementation detail. The right way to test this is to use the result as a boolean, not to compare it to a specific value: if (XML_ParseBuffer(...)) { /* handle the error */ ... } else { /* do useful work */ ... } -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From carlos@pehoe.civil.ist.utl.pt Mon Oct 29 15:33:02 2001 From: carlos@pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Mon Oct 29 15:33:02 2001 Subject: [Expat-discuss] external entity Message-ID: <200110300039.AAA03492@pehoe.civil.ist.utl.pt> Hi, I am a bit confused regarding external entities and expat, I do want to implement recursive external general entities in my app, however after I get the new parser, it's not clear to me who is in charge (my app code or Expat) of executing the following tasks: 1) open/close the new file that contains the external entity. 2) do I need to explicitly use fread and XML_ParseBuffer, as in the first file? say, do I need again code as this? buffer = XML_GetBuffer (parser, APP_BUFFER); length = fread (buffer, 1, APP_BUFFER, fp); if (XML_ParseBuffer(parser, length, end) == FALSE) (where parser and fp refer to the new parser and file) 3) if an entity calls another external... and this calls another...etc... and then the last one calls the first one, promoting some sort of infinite loop, who should detect this error, me or Expat? I think it would be nice if future releases of Expat could come with a simple example (say, like outline.c and element.c) showing how to use (general/parameter) external entities... :-) Thanks! Carlos From reach_panki@yahoo.com Tue Oct 30 14:29:03 2001 From: reach_panki@yahoo.com (Pankaja Kadakuntla) Date: Tue Oct 30 14:29:03 2001 Subject: [Expat-discuss] Encoding Message-ID: <20011030222809.31610.qmail@web14504.mail.yahoo.com> Hi, I have a question about Expat's encoding. Expat documentation says "Although expat may accept input in various encodings, the strings that it passes to the handlers are always encoded in UTF-8. The application is responsible for any translation of these strings into other encodings. Also no matter what you put in, as long as it is in a known encoding, what you get out is UTF-8.". What happens if I have a Japanese document that has some characters that are not in the UTF-8 character set. How would expat encode such characters? __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com From patrick@meer.net Tue Oct 30 16:45:01 2001 From: patrick@meer.net (Patrick McCormick) Date: Tue Oct 30 16:45:01 2001 Subject: [Expat-discuss] does expat detect illegal utf-8 sequences? Message-ID: <04d801c161a5$0937e240$a39d9dd1@CG479672a> I have a problem where users like to use iso-8859-1 without declaring it = in the prolog, like this: ab=E9cdef expat properly defaults to utf-8 in this case. As I understand utf-8, th= e =E9 character (0xE7) has a bitfield that looks like the start of a three = byte sequence. A 3-byte sequence is supposed to look like this: bytes | bits | representation 3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv the above two bytes (c and d) don't match the 10vvvvvv mask, so =E9cd is = an illegal utf-8 sequence. But expat doesn't throw a well-formedness error. Expat uses this macro in xmltok.c to figure out what's illegal: #define UTF8_INVALID3(p) \ ((*p) =3D=3D 0xED \ ? (((p)[1] & 0x20) !=3D 0) \ : ((*p) =3D=3D 0xEF \ ? ((p)[1] =3D=3D 0xBF && ((p)[2] =3D=3D 0xBF || (p)[2] =3D=3D 0xBE))= \ : 0)) but I don't understand what it's checking for. Can someone explain? If the mask I mention above is correct, the check should look something like this: #define UTF8_INVALID3(p) \ (!(((p)[0] & 0xF0) =3D=3D 0xE0 && \ ((p)[1] & 0xC0) =3D=3D 0x80 && \ ((p)[2] & 0xC0) =3D=3D 0x80)) It's entirely possible that I am not understanding utf-8 properly - can someone explain what supposed to happen with the document above? Patrick