[AstroPy] Reading HTM table with astropy.table
Ivan Valtchanov
ivvv68 at gmail.com
Fri May 21 06:56:15 EDT 2021
Dear all,
I have a strange problem reading an HTML table with astropy.table.
Some columns contain hyperlinks and even though I supply the necessary
htmldict it still cannot read those properly. With the same htmldict I
managed to write an HTML table keeping the columns with HTML tags.
Here is a short code I did to illustrate the issue:
#############
# Read HTML table with HTML tags
#
import io
import bleach
from astropy.table import Table, join
html_table = """
<html>
<head>
<meta charset="utf-8"/>
<meta content="text/html;charset=UTF-8" http-equiv="Content-type"/>
</head>
<body>
<table>
<thead>
<tr><th>col1</th><th>col2</th><th>col3</th></tr>
</thead>
<tr><td>A</td><td>B</td><td><img alt="image" src="image.jpg"
width="300"></td></tr>
</table>
</body>
</html>
"""
# this one works as expected
bleach.clean(html_table,tags=['img'],attributes=['src','alt','width'])
html_dict = {"raw_html_cols":["col3"], "raw_html_clean_kwargs":
{'tags': ['img'], 'attributes': ['src','alt','width']}}
t = Table.read(io.BytesIO(bytes(html_table,encoding='utf-8')),format='ascii.html',htmldict=html_dict)
#
# this doesn't
print (t)
>>> print (t)
col1 col2 col3
---- ---- ----
A B --
###############
As you can see, col3 should be an <img ...> but it's '--'.
Using bleach.clean with whitelisted tags and attributes properly keeps
the <img ...>.
Any advice on this?
Of course I found a workaround (replace <img with |img) and then read
it, but it seems to me I've done all as explained in the docs.
If it's a problem then I can raise a github issue, but I want to make
sure that I'm not missing something here.
Cheers,
Ivan V
More information about the AstroPy
mailing list