Unicode and rdf

deelan ggg at zzz.it
Wed Mar 10 08:26:20 EST 2004


A.M. Kuchling wrote:

>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory
>>Project) and am having great difficulty just getting Python to read
>>the files.  The files are RDF in UTF-8 encoding according to the
>>dmoz.org web site, but I get the following error:
> 
> Oh dear.   
> 
> Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
> up because the data was so bad -- some categories included content in
> various Chinese encodings despite the file's claim to be UTF-8.  I
> eventually gave up because debugging a program that fails after running for
> six hours is really, really tedious.

unfortunately it seems that some encoding issues are still there, i've 
written this little script to convert RDF/XML dmoz.org dump in turtle 
(really ntriples in UTF-8) using rdflib but it fails after
700 lines or so:

from rdflib.TripleStore import TripleStore as Store
from rdflib.BNode import BNode
from rdflib.Literal import Literal

from purple.quoting import quote

store = Store()
store.load('file:structure.rdf')

import codecs
outfile = codecs.open('structure.ttl', 'w', 'utf-8')

for triple in store.triples((None, None, None)):

     s = triple[0]
     if isinstance(s, BNode): # URI of bNode?
        s = '%s' % s
     else:
        s = '<%s>' % s

     p = triple[1]

     o = triple[2]
     if isinstance(o, Literal): # URI, bNode or Literal?
        if o.language:
           o = '"%s"@%s' % (quote(o), o.language)
        elif o.datatype:
           o = '"%s"^^<%s>' % (quote(o), o.datatype)
        else:
           o = '"%s"' % quote(o)
     elif isinstance(o, BNode):
        o = '%s' % o
     else:
        o = '<%s>' % o

     outfile.write('%s <%s> %s .\n' % (s, p, o))


outfile.close()




but it stops giving:

xml.sax._exceptions.SAXParseException:file:///D|/TMPSTU%7E1/dmoz.org/structure. 
rdf:712:45: not well-formed (invalid token)

i'm gonna try this script with musicbrainz datadump and see if
the UTF-8 data is encoded better.

-- 
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#me> a foaf:Person ; foaf:nick "deelan" ;
foaf:weblog <http://www.deelan.com/> .



More information about the Python-list mailing list