Unicode and rdf
deelan
ggg at zzz.it
Wed Mar 10 08:26:20 EST 2004
A.M. Kuchling wrote:
>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory
>>Project) and am having great difficulty just getting Python to read
>>the files. The files are RDF in UTF-8 encoding according to the
>>dmoz.org web site, but I get the following error:
>
> Oh dear.
>
> Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
> up because the data was so bad -- some categories included content in
> various Chinese encodings despite the file's claim to be UTF-8. I
> eventually gave up because debugging a program that fails after running for
> six hours is really, really tedious.
unfortunately it seems that some encoding issues are still there, i've
written this little script to convert RDF/XML dmoz.org dump in turtle
(really ntriples in UTF-8) using rdflib but it fails after
700 lines or so:
from rdflib.TripleStore import TripleStore as Store
from rdflib.BNode import BNode
from rdflib.Literal import Literal
from purple.quoting import quote
store = Store()
store.load('file:structure.rdf')
import codecs
outfile = codecs.open('structure.ttl', 'w', 'utf-8')
for triple in store.triples((None, None, None)):
s = triple[0]
if isinstance(s, BNode): # URI of bNode?
s = '%s' % s
else:
s = '<%s>' % s
p = triple[1]
o = triple[2]
if isinstance(o, Literal): # URI, bNode or Literal?
if o.language:
o = '"%s"@%s' % (quote(o), o.language)
elif o.datatype:
o = '"%s"^^<%s>' % (quote(o), o.datatype)
else:
o = '"%s"' % quote(o)
elif isinstance(o, BNode):
o = '%s' % o
else:
o = '<%s>' % o
outfile.write('%s <%s> %s .\n' % (s, p, o))
outfile.close()
but it stops giving:
xml.sax._exceptions.SAXParseException:file:///D|/TMPSTU%7E1/dmoz.org/structure.
rdf:712:45: not well-formed (invalid token)
i'm gonna try this script with musicbrainz datadump and see if
the UTF-8 data is encoded better.
--
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#me> a foaf:Person ; foaf:nick "deelan" ;
foaf:weblog <http://www.deelan.com/> .
More information about the Python-list
mailing list