XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

Wed Sep 29 23:53:44 EDT 2021

On Thursday, September 30, 2021 at 9:20:37 AM UTC+8, hongy... at gmail.com wrote:
> On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote: 
> > On 2021-09-29 01:22:03 -0700, hongy... at gmail.com wrote: 
> > > I tried to convert a xls file into csv with the following command, but failed: 
> > > 
> > > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls 
> > > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' 
> > > 
> > > The above testing file is located at here [1]. 
> > > 
> > > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls 
> > Why is that file name .xls when it's obviously an HTML file?
> Good catch! Thank you for pointing this out. This file is automatically exported from my university's teaching management system, and it was assigned the .xls extension by default. 

According to the above comment, after I change the extension to html, the following python code will do the trick:

import sys
import pandas as pd

if len(sys.argv) != 2:
    print('Usage: ' + sys.argv[0] + ' input-file')
    exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
    if i > 0:
        for j in myhtml_pd[2].columns:
            if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
                print(myhtml_pd[2].loc[i][j])

HZ