suppressing bad characters in output PCDATA (converting JSON to XML)

Tue Nov 29 07:50:59 EST 2011

On 2011-11-28, Steven D'Aprano wrote:

> On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote:
>
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules.  I get the input this way:
>> 
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
>> errors='replace') big_json = json.load(input_source)
>> input_source.close()
>> 
>> Then I recurse through the contents of big_json to build an instance of
>> xml.dom.minidom.Document (the recursion includes some code to rewrite
>> dict keys as valid element names if necessary), 
>
> How are you doing that? What do you consider valid?

Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to
the beginning of any potential tag that doesn't start with a letter.
This is good enough for my purposes.

>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>
> It will force the output to be valid UTF-8 encoded to bytes, not 
> necessarily valid XML.

Yes!

>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> What's xmlstarlet, and at what point does it give this error? It doesn't 
> appear to be in the standard library.

It's a command-line tool I use a lot for finding the bad bits in XML,
nothing to do with python.

http://xmlstar.sourceforge.net/

>> I guess I need to process each piece of PCDATA to clean out the control
>> characters before creating the text node:
>> 
>>   text = doc.createTextNode(j)
>>   root.appendChild(text)
>> 
>> What's the best way to do that, bearing in mind that there can be
>> multibyte characters in the strings?
>
> Are you mixing unicode and byte strings?

I don't think I am.

> Are you sure that the input source is actually UTF-8? If not, then all 
> bets are off: even if the decoding step works, and returns a string, it 
> may contain the wrong characters. This might explain why you are getting 
> unexpected control characters in the output: they've come from a badly 
> decoded input.

I'm pretty sure that the input is really UTF-8, but has a few control
characters (fairly rare).

> Another possibility is that your data actually does contain control 
> characters where there shouldn't be any.

I think that's the problem, and I'm looking for an efficient way to
delete them from unicode strings.

-- 
Some say the world will end in fire; some say in segfaults.
                                                 [XKCD 312]