suppressing bad characters in output PCDATA (converting JSON to XML)

Mon Nov 28 07:11:15 EST 2011

On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote:

> I'm converting JSON data to XML using the standard library's json and
> xml.dom.minidom modules.  I get the input this way:
> 
> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
> errors='replace') big_json = json.load(input_source)
> input_source.close()
> 
> Then I recurse through the contents of big_json to build an instance of
> xml.dom.minidom.Document (the recursion includes some code to rewrite
> dict keys as valid element names if necessary), 

How are you doing that? What do you consider valid?

> and I save the document:
> 
> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8',
> errors='replace') doc.writexml(xml_file, encoding='UTF-8')
> xml_file.close()
> 
> 
> I thought this would force all the output to be valid, but xmlstarlet
> gives some errors like these on a few documents:

It will force the output to be valid UTF-8 encoded to bytes, not 
necessarily valid XML.

> PCDATA invalid Char value 7
> PCDATA invalid Char value 31

What's xmlstarlet, and at what point does it give this error? It doesn't 
appear to be in the standard library.

> I guess I need to process each piece of PCDATA to clean out the control
> characters before creating the text node:
> 
>   text = doc.createTextNode(j)
>   root.appendChild(text)
> 
> What's the best way to do that, bearing in mind that there can be
> multibyte characters in the strings?

Are you mixing unicode and byte strings?

Are you sure that the input source is actually UTF-8? If not, then all 
bets are off: even if the decoding step works, and returns a string, it 
may contain the wrong characters. This might explain why you are getting 
unexpected control characters in the output: they've come from a badly 
decoded input.

Another possibility is that your data actually does contain control 
characters where there shouldn't be any.

-- 
Steven