suppressing bad characters in output PCDATA (converting JSON to XML)

Adam Funk a24061 at ducksburg.com
Fri Dec 2 05:30:13 EST 2011


On 2011-11-29, Stefan Behnel wrote:

> Adam Funk, 29.11.2011 13:57:
>> On 2011-11-28, Stefan Behnel wrote:

>>> If the name "big_json" is supposed to hint at a large set of data, you may
>>> want to use something other than minidom. Take a look at the
>>> xml.etree.cElementTree module instead, which is substantially more memory
>>> efficient.
>>
>> Well, the input file in this case contains one big JSON list of
>> reasonably sized elements, each of which I'm turning into a separate
>> XML file.  The output files range from 600 to 6000 bytes.
>
> It's also substantially easier to use, but if your XML writing code works 
> already, why change it.

That module looks useful --- thanks for the tip.  (TBH, I'm using
minidom mainly because I've used it before and the API is similar to
the DOM APIs I've used in other languages.)


> You should read up on Unicode a bit.

It wouldn't do me any harm.  :-)


>>>> I thought this would force all the output to be valid, but xmlstarlet
>>>> gives some errors like these on a few documents:
>>>>
>>>> PCDATA invalid Char value 7
>>>> PCDATA invalid Char value 31
>>>
>>> This strongly hints at a broken encoding, which can easily be triggered by
>>> your erroneous encode-and-encode cycles above.
>>
>> No, I've checked the JSON input and those exact control characters are
>> there too.
>
> Ah, right, I didn't look closely enough. Those are forbidden in XML:
>
> http://www.w3.org/TR/REC-xml/#charsets
>
> It's sad that minidom (apparently) lets them pass through without even a 
> warning.

Yes, it is!  I've now found this, which seems to fix the problem:

http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html


-- 
The internet is quite simply a glorious place. Where else can you find
bootlegged music and films, questionable women, deep seated xenophobia
and amusing cats all together in the same place?         [Tom Belshaw]



More information about the Python-list mailing list