[AstroPy] Reading HTM table with astropy.table

Ivan Valtchanov ivvv68 at gmail.com
Fri May 21 06:56:15 EDT 2021


Dear all,

I have a strange problem reading an HTML table with astropy.table.
Some columns contain hyperlinks and even though I supply the necessary
htmldict it still cannot read those properly. With the same htmldict I
managed to write an HTML table keeping the columns with HTML tags.

Here is a short code I did to illustrate the issue:

#############
# Read HTML table with HTML tags
#
import io
import bleach
from astropy.table import Table, join

html_table = """
<html>
<head>
<meta charset="utf-8"/>
<meta content="text/html;charset=UTF-8" http-equiv="Content-type"/>
</head>
<body>
<table>
<thead>
<tr><th>col1</th><th>col2</th><th>col3</th></tr>
</thead>
<tr><td>A</td><td>B</td><td><img alt="image" src="image.jpg"
width="300"></td></tr>
</table>
</body>
</html>
"""

# this one works as expected
bleach.clean(html_table,tags=['img'],attributes=['src','alt','width'])

html_dict = {"raw_html_cols":["col3"], "raw_html_clean_kwargs":
{'tags': ['img'], 'attributes': ['src','alt','width']}}

t = Table.read(io.BytesIO(bytes(html_table,encoding='utf-8')),format='ascii.html',htmldict=html_dict)
#
# this doesn't
print (t)
>>> print (t)
col1 col2 col3
---- ---- ----
   A    B   --
###############

As you can see, col3 should be an <img ...> but it's '--'.

Using bleach.clean with whitelisted tags and attributes properly keeps
the <img ...>.

Any advice on this?

Of course I found a workaround (replace <img with |img) and then read
it, but it seems to me I've done all as explained in the docs.

If it's a problem then I can raise a github issue, but I want to make
sure that I'm not missing something here.

Cheers,
Ivan V


More information about the AstroPy mailing list