suppressing bad characters in output PCDATA (converting JSON to XML)

Tue Nov 29 07:57:22 EST 2011

On 2011-11-28, Stefan Behnel wrote:

> Adam Funk, 25.11.2011 14:50:
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules.  I get the input this way:
>>
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
>
> It doesn't make sense to use codecs.open() with a "b" mode.

OK, thanks.

>> big_json = json.load(input_source)
>
> You shouldn't decode the input before passing it into json.load(), just 
> open the file in binary mode. Serialised JSON is defined as being UTF-8 
> encoded (or BOM-prefixed), not decoded Unicode.

So just do
  input_source = open(input_file, 'rb')
  big_json = json.load(input_source)
?

>> input_source.close()
>
> In case of a failure, the file will not be closed safely. All in all, use 
> this instead:
>
>      with open(input_file, 'rb') as f:
>          big_json = json.load(f)

OK, thanks.

>> Then I recurse through the contents of big_json to build an instance
>> of xml.dom.minidom.Document (the recursion includes some code to
>> rewrite dict keys as valid element names if necessary)
>
> If the name "big_json" is supposed to hint at a large set of data, you may 
> want to use something other than minidom. Take a look at the 
> xml.etree.cElementTree module instead, which is substantially more memory 
> efficient.

Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.

>> and I save the document:
>>
>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
>> doc.writexml(xml_file, encoding='UTF-8')
>> xml_file.close()
>
> Same mistakes as above. Especially the double encoding is both unnecessary 
> and likely to fail. This is also most likely the source of your problems.

Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).

>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>>
>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> This strongly hints at a broken encoding, which can easily be triggered by 
> your erroneous encode-and-encode cycles above.

No, I've checked the JSON input and those exact control characters are
there too.  I want to suppress them (delete or replace with spaces).

> Also, the kind of problem you present here makes it pretty clear that you 
> are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
> when trying to write binary data to a Unicode file.

Sorry, I forgot to mention the version I'm using, which is "2.7.2+".

-- 
In the 1970s, people began receiving utility bills for
-£999,999,996.32 and it became harder to sustain the 
myth of the infallible electronic brain.  (Stob 2001)