Extract email address from Java script in html source using python

Sun May 24 13:48:27 EDT 2015

On 05/23/2015 04:15 PM, savitha devi wrote:
> What I exactly want is the java script is in the html code. I am trying for
> a regular expression to find the email address embedded with in the java
> script.
>
> On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <rosuav at gmail.com> wrote:
>
>> On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8 at gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner
>> level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>>   //<!--
>>>   document.getElementById('cloak48218').innerHTML = '';
>>>   var prefix = 'ma' + 'il' + 'to';
>>>   var path = 'hr' + 'ef' + '=';
>>>   var addy48218 = 'info' + '@';
>>>   addy48218 = addy48218 + 'tsv-neuried' + '.' +
>>> 'de';
>>>   document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>>   //-->
>> This is deliberately being done to prevent scripted usage. What
>> exactly are you needing to do this for?
>>
>> You're basically going to have to execute the entire block of
>> JavaScript code, and then decode the entities to get to what you want.
>> Doing it manually is pretty easy; doing it automatically will
>> virtually require a language interpreter.
>>
>> ChrisA
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>

This is just about nuts and bolts, not about the ethics of presumed 
intentions.

Hope it helps one way or other

Frederic

------------------------------------------------------------------------------- 

sample = '''//<!--
  document.getElementById('cloak48218').innerHTML = '';
  var prefix = 'ma' + 'il' + 'to';
  var path = 'hr' + 'ef' + '=';
  var addy48218 = 'info' + '@';
  addy48218 = addy48218 + 'tsv-neuried' + '.' +
'de';
  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
  //-->'''

 >>> import SE  # Download from PyPi at https://pypi.python.org/pypi/SE

 >>> def make_se_translator ():

     # Make SE substitutions
     subs_list = []

     # Make &# code substitutions
     for i in range (256):
         subs_list.append ('&#%d;=%c' % (i, chr(i)))

     # Delete Java stuff
     subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
     subs_list.append (' "var =" "\n=" //<!--= //-->= ')

     # Java syntax? Tweaks needed to get the sample working
     subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')

     # Add more as needed trial and error style
     # subs_list.append ( . . . format: ' old=new "delete this=" '

     # Make text
     subs = '\n'.join (subs_list)

     # Make SE translator
     translator = SE.SE (subs)

     # return translator, subs   # print subs, if you want to see what 
they look like
     return translator

 >>> translator = make_se_translator ()

 >>> translation = translator (sample)

 >>> print translation   # See
  innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '='; 
addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.' 
+'de'; innerHTML += '<a ' + path  +prefix + ':' + addy48218 + '>' + 
addy48218+'</a>';

 >>> exec (translation.lstrip ())

 >>> print innerHTML
<a href=mailto:info at tsv-neuried.de>info at tsv-neuried.de</a>