[Tutor] xpath - html entities issue -- &

bruce badouglas at gmail.com
Tue Oct 4 10:02:37 EDT 2016


Hi.

Just realized I might have a prob with testing a crawl.

I get a page of data via a basic curl. The returned data is
html/charset-utf-8.

I did a quick replace ('&','&') and it replaced the '&' as desired.
So the content only had '&' in it..

I then did a parseString/xpath to extract what I wanted, and realized I
have '&' as representative of the '&' in the returned xpath content.

My issue, is there a way/method/etc, to only return the actual char, not
the html entiy (&)

I can provide a more comprehensive chunk of code, but minimized the post to
get to the heart of the issue. Also, I'd prefer not to use a sep parse lib.

------------------------------------
code chunk

import libxml2dom

q1=libxml2dom

s2= q1.parseString(a.toString().strip(), html=1)
tt=s2.xpath(tpath)

tt=tt[0].toString().strip()
print "tit "+tt

-------------------------------------


the content of a.toString() (shortened)
.
.
.
                 <div class="material-group-overview">
                    <div class="icon-book"></div>
                    <h3 class="material-group-title">Organization
Development & Change
                        <span>Edition: 10th</span>
                    </h3>
                    <a class="material-group-toggle-top-link"
id="toggle-top_1" href="javascript:void(0);" title="Click to hide options
for material">

.
.
.

the xpath results are

                <div class="material-group-overview">
                    <div class="icon-book"></div>
                    <h3 class="material-group-title">Organization
Development & Change
                        <span>Edition: 10th</span>
                    </h3>


As you can see.. in the results of the xpath (toString())
 the & --> &

I'm wondering if there's a process that can be used within the toString()
or do you really have to wrap each xpath/toString with a unescape() kind of
process to convert htmlentities to the requisite chars.

Thanks


More information about the Tutor mailing list