[Tutor] xpath - html entities issue -- &
bruce
badouglas at gmail.com
Tue Oct 4 10:02:37 EDT 2016
Hi.
Just realized I might have a prob with testing a crawl.
I get a page of data via a basic curl. The returned data is
html/charset-utf-8.
I did a quick replace ('&','&') and it replaced the '&' as desired.
So the content only had '&' in it..
I then did a parseString/xpath to extract what I wanted, and realized I
have '&' as representative of the '&' in the returned xpath content.
My issue, is there a way/method/etc, to only return the actual char, not
the html entiy (&)
I can provide a more comprehensive chunk of code, but minimized the post to
get to the heart of the issue. Also, I'd prefer not to use a sep parse lib.
------------------------------------
code chunk
import libxml2dom
q1=libxml2dom
s2= q1.parseString(a.toString().strip(), html=1)
tt=s2.xpath(tpath)
tt=tt[0].toString().strip()
print "tit "+tt
-------------------------------------
the content of a.toString() (shortened)
.
.
.
<div class="material-group-overview">
<div class="icon-book"></div>
<h3 class="material-group-title">Organization
Development & Change
<span>Edition: 10th</span>
</h3>
<a class="material-group-toggle-top-link"
id="toggle-top_1" href="javascript:void(0);" title="Click to hide options
for material">
.
.
.
the xpath results are
<div class="material-group-overview">
<div class="icon-book"></div>
<h3 class="material-group-title">Organization
Development & Change
<span>Edition: 10th</span>
</h3>
As you can see.. in the results of the xpath (toString())
the & --> &
I'm wondering if there's a process that can be used within the toString()
or do you really have to wrap each xpath/toString with a unescape() kind of
process to convert htmlentities to the requisite chars.
Thanks
More information about the Tutor
mailing list