Extract email address from Java script in html source using python
Friedrich Rentsch
anthra.norell at bluewin.ch
Sun May 24 13:48:27 EDT 2015
On 05/23/2015 04:15 PM, savitha devi wrote:
> What I exactly want is the java script is in the html code. I am trying for
> a regular expression to find the email address embedded with in the java
> script.
>
> On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <rosuav at gmail.com> wrote:
>
>> On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8 at gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner
>> level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>> //<!--
>>> document.getElementById('cloak48218').innerHTML = '';
>>> var prefix = 'ma' + 'il' + 'to';
>>> var path = 'hr' + 'ef' + '=';
>>> var addy48218 = 'info' + '@';
>>> addy48218 = addy48218 + 'tsv-neuried' + '.' +
>>> 'de';
>>> document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>> //-->
>> This is deliberately being done to prevent scripted usage. What
>> exactly are you needing to do this for?
>>
>> You're basically going to have to execute the entire block of
>> JavaScript code, and then decode the entities to get to what you want.
>> Doing it manually is pretty easy; doing it automatically will
>> virtually require a language interpreter.
>>
>> ChrisA
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
This is just about nuts and bolts, not about the ethics of presumed
intentions.
Hope it helps one way or other
Frederic
-------------------------------------------------------------------------------
sample = '''//<!--
document.getElementById('cloak48218').innerHTML = '';
var prefix = 'ma' + 'il' + 'to';
var path = 'hr' + 'ef' + '=';
var addy48218 = 'info' + '@';
addy48218 = addy48218 + 'tsv-neuried' + '.' +
'de';
document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
//-->'''
>>> import SE # Download from PyPi at https://pypi.python.org/pypi/SE
>>> def make_se_translator ():
# Make SE substitutions
subs_list = []
# Make &# code substitutions
for i in range (256):
subs_list.append ('&#%d;=%c' % (i, chr(i)))
# Delete Java stuff
subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
subs_list.append (' "var =" "\n=" //<!--= //-->= ')
# Java syntax? Tweaks needed to get the sample working
subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')
# Add more as needed trial and error style
# subs_list.append ( . . . format: ' old=new "delete this=" '
# Make text
subs = '\n'.join (subs_list)
# Make SE translator
translator = SE.SE (subs)
# return translator, subs # print subs, if you want to see what
they look like
return translator
>>> translator = make_se_translator ()
>>> translation = translator (sample)
>>> print translation # See
innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '=';
addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.'
+'de'; innerHTML += '<a ' + path +prefix + ':' + addy48218 + '>' +
addy48218+'</a>';
>>> exec (translation.lstrip ())
>>> print innerHTML
<a href=mailto:info at tsv-neuried.de>info at tsv-neuried.de</a>
More information about the Python-list
mailing list