Unicode and rdf

Wed Mar 10 00:45:30 EST 2004

Almost forgot.  I'm running Python 2.3.3.

On Tue, 09 Mar 2004 23:41:30 -0600, Richard West
<rwest004 at opti.cgi.net> wrote:

>
>
>I'm trying to parse the rdf dumps from dmoz.org (Open Directory
>Project) and am having great difficulty just getting Python to read
>the files.  The files are RDF in UTF-8 encoding according to the
>dmoz.org web site, but I get the following error:
>
>UnicodeDecodeError: 'utf8' codec can't decode bytes in position
>52376-52378: invalid data
>
>Here's a sample of code that will reproduce the problem:
>
>
>import sys
>import codecs
>from xml.sax import make_parser, handler
>
>def main():
>    f = codecs.open(sys.argv[1], 'r', 'utf-8')
>    parser = make_parser()
>    parser.setContentHandler(dmoz())
>    parser.parse(f)
>
>class dmoz(handler.ContentHandler):
>    def startElement(self, name, attrs):
>        print('%s' % name)
>
>if(__name__=='__main__'):
>    main()
>
>
>I'm working with the dump from February 23rd, 2004.  On the dmoz.org
>web site news pertaining to the rdf dumps, there is an entry from
>March 3rd, 2003 which states that they are filtering the data to
>"prevent UTF-8 and XML character encoding problems".  So I am assuming
>that the UTF-8 files I have are valid.  I run into the problem with
>both the structure.rdf.u8 file and the content.rdf.u8 file.
>
>What am I doing wrong?
>
>
>-Richard
>
>
>dmoz.org rdf dumps: http://rdf.dmoz.org/
>
>dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html
>
>