Unicode and rdf
Richard West
rwest004 at opti.cgi.net
Wed Mar 10 00:45:30 EST 2004
Almost forgot. I'm running Python 2.3.3.
On Tue, 09 Mar 2004 23:41:30 -0600, Richard West
<rwest004 at opti.cgi.net> wrote:
>
>
>I'm trying to parse the rdf dumps from dmoz.org (Open Directory
>Project) and am having great difficulty just getting Python to read
>the files. The files are RDF in UTF-8 encoding according to the
>dmoz.org web site, but I get the following error:
>
>UnicodeDecodeError: 'utf8' codec can't decode bytes in position
>52376-52378: invalid data
>
>Here's a sample of code that will reproduce the problem:
>
>
>import sys
>import codecs
>from xml.sax import make_parser, handler
>
>def main():
> f = codecs.open(sys.argv[1], 'r', 'utf-8')
> parser = make_parser()
> parser.setContentHandler(dmoz())
> parser.parse(f)
>
>class dmoz(handler.ContentHandler):
> def startElement(self, name, attrs):
> print('%s' % name)
>
>if(__name__=='__main__'):
> main()
>
>
>I'm working with the dump from February 23rd, 2004. On the dmoz.org
>web site news pertaining to the rdf dumps, there is an entry from
>March 3rd, 2003 which states that they are filtering the data to
>"prevent UTF-8 and XML character encoding problems". So I am assuming
>that the UTF-8 files I have are valid. I run into the problem with
>both the structure.rdf.u8 file and the content.rdf.u8 file.
>
>What am I doing wrong?
>
>
>-Richard
>
>
>dmoz.org rdf dumps: http://rdf.dmoz.org/
>
>dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html
>
>
More information about the Python-list
mailing list