From WaynePeterson at SierraSystems.com Sun May 9 08:43:09 2010 From: WaynePeterson at SierraSystems.com (Peterson, Wayne) Date: Sat, 8 May 2010 23:43:09 -0700 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf Message-ID: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is mostly working but minidom seems to have problems dealing with Windows cr/lf characters. It creates an extra textnode that needs to be ignored instead of just returning the xml elements. I have tried different methods of opening the file but it doesn't seem to make a difference. It is happiest when reading a file in Unix format. Wayne Peterson | Consultant Sierra Systems (T): 403-264-0955 (C): 403-710-9248 (F): 403-233-2108 7th Floor, Canadian Centre 833 4th Avenue SW Calgary, Alberta, T2P 3T5 Management Consulting | System Integration | Managed Services website: www.SierraSystems.com ----Notice Regarding Confidentiality---- This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged. Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email. If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited. Please notify us of the error in communication by return email and destroy all copies of this Email. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Sun May 9 19:27:05 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 09 May 2010 19:27:05 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> Message-ID: <4BE6F069.9050107@behnel.de> Peterson, Wayne, 09.05.2010 08:43: > I am parsing an XML file with Python 2.6.5 minidom in Windows and it is > mostly working but minidom seems to have problems dealing with Windows > cr/lf characters. It creates an extra textnode that needs to be ignored > instead of just returning the xml elements. I have tried different > methods of opening the file but it doesn't seem to make a difference. It > is happiest when reading a file in Unix format. Whitespace is significant in the W3C DOM, so minidom must provide it in the DOM tree. It doesn't "have problems" because it creates text nodes for them, that's just the way things work. Note that the xml.etree.ElementTree package tends to be a lot more user friendly for XML handling than the minidom package, simply because if focuses on the XML Infoset and moves text out of the way when dealing with elements. Stefan From dieter at handshake.de Mon May 10 07:50:03 2010 From: dieter at handshake.de (Dieter Maurer) Date: Mon, 10 May 2010 07:50:03 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> Message-ID: <19431.40587.142951.710412@gargle.gargle.HOWL> Peterson, Wayne wrote at 2010-5-8 23:43 -0700: >I am parsing an XML file with Python 2.6.5 minidom in Windows and it is >mostly working but minidom seems to have problems dealing with Windows >cr/lf characters. It creates an extra textnode that needs to be ignored >instead of just returning the xml elements. I have tried different >methods of opening the file but it doesn't seem to make a difference. It >is happiest when reading a file in Unix format. The parser should not see these "cr/lf" characters at all. Python strings itself use only "\n" (aka "lf") to delimite lines. The "\r" (aka "cr") should only be introduced when those lines are written to text files. And they should be removed when those line are read in again. Are you sure that you access your files as "text" files? -- Dieter From stefan_ml at behnel.de Mon May 10 08:57:43 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 May 2010 08:57:43 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <19431.40587.142951.710412@gargle.gargle.HOWL> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> Message-ID: <4BE7AE67.9090707@behnel.de> Dieter Maurer, 10.05.2010 07:50: > Peterson, Wayne wrote at 2010-5-8 23:43 -0700: >> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is >> mostly working but minidom seems to have problems dealing with Windows >> cr/lf characters. It creates an extra textnode that needs to be ignored >> instead of just returning the xml elements. I have tried different >> methods of opening the file but it doesn't seem to make a difference. It >> is happiest when reading a file in Unix format. > > The parser should not see these "cr/lf" characters at all. > > Python strings itself use only "\n" (aka "lf") to delimite lines. > The "\r" (aka "cr") should only be introduced when those lines > are written to text files. And they should be removed when > those line are read in again. > > Are you sure that you access your files as "text" files? The correct way to parse XML files is as binary data. Stefan From dieter at handshake.de Mon May 10 09:07:55 2010 From: dieter at handshake.de (Dieter Maurer) Date: Mon, 10 May 2010 09:07:55 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <4BE7AE67.9090707@behnel.de> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> <4BE7AE67.9090707@behnel.de> Message-ID: <19431.45259.366507.508666@gargle.gargle.HOWL> Stefan Behnel wrote at 2010-5-10 08:57 +0200: >Dieter Maurer, 10.05.2010 07:50: >> Peterson, Wayne wrote at 2010-5-8 23:43 -0700: >>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is >>> mostly working but minidom seems to have problems dealing with Windows >>> cr/lf characters. It creates an extra textnode that needs to be ignored >>> instead of just returning the xml elements. I have tried different >>> methods of opening the file but it doesn't seem to make a difference. It >>> is happiest when reading a file in Unix format. >> >> The parser should not see these "cr/lf" characters at all. >> >> Python strings itself use only "\n" (aka "lf") to delimite lines. >> The "\r" (aka "cr") should only be introduced when those lines >> are written to text files. And they should be removed when >> those line are read in again. >> >> Are you sure that you access your files as "text" files? > >The correct way to parse XML files is as binary data. Why do you think so? The default "minidom" parser seems not to expect "\r\n" line endings.... -- Dieter From stefan_ml at behnel.de Mon May 10 09:43:25 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 10 May 2010 09:43:25 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <19431.45259.366507.508666@gargle.gargle.HOWL> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> <4BE7AE67.9090707@behnel.de> <19431.45259.366507.508666@gargle.gargle.HOWL> Message-ID: <4BE7B91D.1070404@behnel.de> Dieter Maurer, 10.05.2010 09:07: > Stefan Behnel wrote at 2010-5-10 08:57 +0200: >> Dieter Maurer, 10.05.2010 07:50: >>> Peterson, Wayne wrote at 2010-5-8 23:43 -0700: >>>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is >>>> mostly working but minidom seems to have problems dealing with Windows >>>> cr/lf characters. It creates an extra textnode that needs to be ignored >>>> instead of just returning the xml elements. I have tried different >>>> methods of opening the file but it doesn't seem to make a difference. It >>>> is happiest when reading a file in Unix format. >>> >>> The parser should not see these "cr/lf" characters at all. >>> >>> Python strings itself use only "\n" (aka "lf") to delimite lines. >>> The "\r" (aka "cr") should only be introduced when those lines >>> are written to text files. And they should be removed when >>> those line are read in again. >>> >>> Are you sure that you access your files as "text" files? >> >> The correct way to parse XML files is as binary data. > > Why do you think so? > > The default "minidom" parser seems not to expect "\r\n" line endings.... Interesting. Then this might really be a bug. There was a change in Python 2.6.5 that broke universal newline handling for the codecs module, this might hit here. However, according to what the OP described, the cr/lf characters turn up correctly now, so ISTM that it's the plain '\n' line ending that needs fixing. Stefan From WaynePeterson at SierraSystems.com Mon May 10 16:04:05 2010 From: WaynePeterson at SierraSystems.com (Peterson, Wayne) Date: Mon, 10 May 2010 07:04:05 -0700 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <19431.40587.142951.710412@gargle.gargle.HOWL> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> Message-ID: <35D9A03D476F124C9172E61E8C7D60A204E64F76@SCVANEX5.sierrasys.com> That's what I thought as well. I was expecting the parser to ignore all forms of linefeed. I believe I am accessing my files as text files. The documentation for minidom.parse says you can pass it a file name or a file object and I have tried it both ways with the same result. Here is the open statement I am using. infile = open(in_path_file, 'r') in_xmldoc = minidom.parse(infile) The input file contains cr/lf linefeeds x'0a0d'. When I do something like, surveys = form.childNodes the surveys.firstChild node will contain x'0a' which I have to ignore. Wayne -----Original Message----- From: Dieter Maurer [mailto:dieter at handshake.de] Sent: Sunday, May 09, 2010 11:50 PM To: Peterson, Wayne Cc: xml-sig at python.org Subject: Re: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf Peterson, Wayne wrote at 2010-5-8 23:43 -0700: >I am parsing an XML file with Python 2.6.5 minidom in Windows and it is >mostly working but minidom seems to have problems dealing with Windows >cr/lf characters. It creates an extra textnode that needs to be ignored >instead of just returning the xml elements. I have tried different >methods of opening the file but it doesn't seem to make a difference. It >is happiest when reading a file in Unix format. The parser should not see these "cr/lf" characters at all. Python strings itself use only "\n" (aka "lf") to delimite lines. The "\r" (aka "cr") should only be introduced when those lines are written to text files. And they should be removed when those line are read in again. Are you sure that you access your files as "text" files? -- Dieter ----Notice Regarding Confidentiality---- This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged. Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email. If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited. Please notify us of the error in communication by return email and destroy all copies of this Email. Thank you. From billk at sunflower.com Mon May 10 19:59:17 2010 From: billk at sunflower.com (Bill Kinnersley) Date: Mon, 10 May 2010 12:59:17 -0500 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> Message-ID: <4BE84975.2010205@sunflower.com> > I am parsing an XML file with Python 2.6.5 minidom in Windows and it is > mostly working but minidom seems to have problems dealing with Windows > cr/lf characters. It creates an extra textnode that needs to be ignored > instead of just returning the xml elements. I have tried different > methods of opening the file but it doesn?t seem to make a difference. It > is happiest when reading a file in Unix format. > > *Wayne Peterson **|** Consultant > Sierra Systems Wayne, It sounds to me like you're doing everything correctly. - XML files are text files, and should be read as text. - In the absence of a DTD, all whitespace is regarded as significant. Typically this means yes, there will be a text node between consecutive element nodes. - The XML processor is required to return end-of-line as a single '\n', regardless of which OS or programming language. If you are traversing every node, you'll need to explicitly ignore the text nodes. More usually you don't have to deal with them, because you know what nodes you're looking for and pick them out with GetElementsByTagName. From fdrake at acm.org Mon May 10 20:58:55 2010 From: fdrake at acm.org (Fred Drake) Date: Mon, 10 May 2010 14:58:55 -0400 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <4BE84975.2010205@sunflower.com> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <4BE84975.2010205@sunflower.com> Message-ID: On Mon, May 10, 2010 at 1:59 PM, Bill Kinnersley wrote: > - XML files are text files, and should be read as text. XML files contain encoded text, and must be handled as binary files. -Fred -- Fred L. Drake, Jr. "Chaos is the score upon which reality is written." --Henry Miller From stefan_ml at behnel.de Tue May 11 08:16:13 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 May 2010 08:16:13 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <4BE84975.2010205@sunflower.com> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <4BE84975.2010205@sunflower.com> Message-ID: <4BE8F62D.7090107@behnel.de> Bill Kinnersley, 10.05.2010 19:59: > - XML files are text files, and should be read as text. Sorry, but the only sane way to read them is as binary data. Passing unicode text to the parser will interfere with the encoding declaration at the beginning. > - The XML processor is required to return end-of-line as a single '\n', > regardless of which OS or programming language. Interesting. I wasn't aware of that, but it's true. http://www.w3.org/TR/REC-xml/#sec-line-ends Stefan From martin at v.loewis.de Tue May 11 09:14:59 2010 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 11 May 2010 09:14:59 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <19431.45259.366507.508666@gargle.gargle.HOWL> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> <4BE7AE67.9090707@behnel.de> <19431.45259.366507.508666@gargle.gargle.HOWL> Message-ID: <4BE903F3.1050801@v.loewis.de> >> The correct way to parse XML files is as binary data. > > Why do you think so? > > The default "minidom" parser seems not to expect "\r\n" line endings.... Why do you say that? It expects them just fine, replacing them with \n line endings, then inserting those into the DOM tree. Just as it should. I believe the OP was complaining that it creates those text nodes in the first place, not that it does or does not specifically do that for \r\n line endings. Regards, Martin From dieter at handshake.de Tue May 11 09:42:25 2010 From: dieter at handshake.de (Dieter Maurer) Date: Tue, 11 May 2010 09:42:25 +0200 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <4BE903F3.1050801@v.loewis.de> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <19431.40587.142951.710412@gargle.gargle.HOWL> <4BE7AE67.9090707@behnel.de> <19431.45259.366507.508666@gargle.gargle.HOWL> <4BE903F3.1050801@v.loewis.de> Message-ID: <19433.2657.55260.95682@gargle.gargle.HOWL> "Martin v. L?wis" wrote at 2010-5-11 09:14 +0200: >>> The correct way to parse XML files is as binary data. >> >> Why do you think so? >> >> The default "minidom" parser seems not to expect "\r\n" line endings.... > >Why do you say that? It expects them just fine, replacing them with \n >line endings, then inserting those into the DOM tree. Just as it should. >I believe the OP was complaining that it creates those text nodes in >the first place, not that it does or does not specifically do that for >\r\n line endings. I may have misunderstood the original problem report. I have read it as: I see "\r\n" text nodes. -- Dieter From WaynePeterson at SierraSystems.com Wed May 12 15:46:39 2010 From: WaynePeterson at SierraSystems.com (Peterson, Wayne) Date: Wed, 12 May 2010 06:46:39 -0700 Subject: [XML-SIG] Parsing XML file with Minidom has problem with cr/lf In-Reply-To: <4BE6F069.9050107@behnel.de> References: <35D9A03D476F124C9172E61E8C7D60A204E64EFC@SCVANEX5.sierrasys.com> <4BE6F069.9050107@behnel.de> Message-ID: <35D9A03D476F124C9172E61E8C7D60A204EDBB64@SCVANEX5.sierrasys.com> Thank you everyone for the excellent replies. As someone noticed, my original complaint was that the parser was returning linefeeds at all in the DOM tree. I thought that the Windows cr/lf format was causing this but now understand that this is what it is supposed to do. I received conflicting advice on whether to process the XML files as binary or text but that is a topic for a different thread. Wayne ----Notice Regarding Confidentiality---- This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged. Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email. If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited. Please notify us of the error in communication by return email and destroy all copies of this Email. Thank you.