[AstroPy] Reading HTM table with astropy.table

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Sun May 23 21:02:05 EDT 2021


Hi Ivan,

The current implementation is taking the "text" attribute of the <td> tag.
For your example the text in the <td> tag is actually empty. So the short
answer is that there is no built-in option that will make the reader do
what you would expect. If you are familiar with beautifulsoup4 parsing,
here is the code:

            data_elements = soup.find_all('td')
            if data_elements:
                yield [el.text.strip() for el in data_elements]

One point that we should make more clear in the docs (sorry) is that
"raw_html_.." options apply only to writing, and are ignored when reading.

I could imagine a new option to allow taking whatever string is contained
in the <td> tag instead of just the text. If you agree then please go ahead
and open an issue on GitHub.

Cheers,
Tom

On Fri, May 21, 2021 at 6:56 AM Ivan Valtchanov <ivvv68 at gmail.com> wrote:

> Dear all,
>
> I have a strange problem reading an HTML table with astropy.table.
> Some columns contain hyperlinks and even though I supply the necessary
> htmldict it still cannot read those properly. With the same htmldict I
> managed to write an HTML table keeping the columns with HTML tags.
>
> Here is a short code I did to illustrate the issue:
>
> #############
> # Read HTML table with HTML tags
> #
> import io
> import bleach
> from astropy.table import Table, join
>
> html_table = """
> <html>
> <head>
> <meta charset="utf-8"/>
> <meta content="text/html;charset=UTF-8" http-equiv="Content-type"/>
> </head>
> <body>
> <table>
> <thead>
> <tr><th>col1</th><th>col2</th><th>col3</th></tr>
> </thead>
> <tr><td>A</td><td>B</td><td><img alt="image" src="image.jpg"
> width="300"></td></tr>
> </table>
> </body>
> </html>
> """
>
> # this one works as expected
> bleach.clean(html_table,tags=['img'],attributes=['src','alt','width'])
>
> html_dict = {"raw_html_cols":["col3"], "raw_html_clean_kwargs":
> {'tags': ['img'], 'attributes': ['src','alt','width']}}
>
> t =
> Table.read(io.BytesIO(bytes(html_table,encoding='utf-8')),format='ascii.html',htmldict=html_dict)
> #
> # this doesn't
> print (t)
> >>> print (t)
> col1 col2 col3
> ---- ---- ----
>    A    B   --
> ###############
>
> As you can see, col3 should be an <img ...> but it's '--'.
>
> Using bleach.clean with whitelisted tags and attributes properly keeps
> the <img ...>.
>
> Any advice on this?
>
> Of course I found a workaround (replace <img with |img) and then read
> it, but it seems to me I've done all as explained in the docs.
>
> If it's a problem then I can raise a github issue, but I want to make
> sure that I'm not missing something here.
>
> Cheers,
> Ivan V
> _______________________________________________
> AstroPy mailing list
> AstroPy at python.org
> https://mail.python.org/mailman/listinfo/astropy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/astropy/attachments/20210523/f668def9/attachment.html>


More information about the AstroPy mailing list