Write tables from Word (.docx) to Excel (.xlsx) using xlsxwriter

Peter Otten __peter__ at web.de
Wed May 27 14:40:25 EDT 2020


BBT wrote:

> On Thursday, 28 May 2020 01:36:26 UTC+8, Peter Otten  wrote:
>> BBT wrote:
>> 
>> > I am trying to parse a word (.docx) for tables, then copy these tables
>> > over to excel using xlsxwriter. This is my code:
>> > 
>> > from docx.api import Document
>> > import xlsxwriter
>> >  
>> > document = Document('/Users/xxx/Documents/xxx/Clauses Sample - Copy v1
>> > - for merge.docx') tables = document.tables
>> >  
>> > wb = xlsxwriter.Workbook('C:/Users/xxx/Documents/xxx/test clause
>> > retrieval.xlsx') Sheet1 = wb.add_worksheet("Compliance")
>> > index_row = 0
>> >  
>> > print(len(tables))
>> >  
>> > for table in document.tables:
>> > data = []
>> > keys = None
>> > for i, row in enumerate(table.rows):
>> >     text = (cell.text for cell in row.cells)
>> >  
>> >     if i == 0:
>> >         keys = tuple(text)
>> >         continue
>> >     row_data = dict(zip(keys, text))
>> >     data.append(row_data)
>> >     #print (data)
>> >     #big_data.append(data)
>> >     Sheet1.write(index_row,0, str(row_data))
>> >     index_row = index_row + 1
>> >  
>> > print(row_data)
>> >  
>> > wb.close()
>> > 
>> > 
>> > This is my desired output: https://i.stack.imgur.com/9qnbw.png
>> > 
>> > However, here is my actual output: https://i.stack.imgur.com/vpXej.png
>> > 
>> > I am aware that my current output produces a list of string instead.
>> > 
>> > Is there anyway that I can get my desired output using xlsxwriter?
>> 
>> I had to simulate docx.api. With that caveat the following seems to work:
>> 
>> import xlsxwriter
>>  
>> # begin simulation of
>> # from docx.api import Document
>> 
>> class Cell:
>>     def __init__(self, text):
>>         self.text = text
>> 
>> class Row:
>>     def __init__(self, cells):
>>         self.cells = [Cell(c) for c in cells]
>> 
>> class Table:
>>     def __init__(self, data):
>>         self.rows = [
>>             Row(row) for row in data
>>         ]
>> 
>> class Document:
>>     def __init__(self):
>>         self.tables = [
>>             Table([
>>                 ["Hello", "Test"],
>>                 ["est", "ing"],
>>                 ["gg", "ff"]
>>             ]),
>>             Table([
>>                 ["Foo", "Bar", "Baz"],
>>                 ["ham", "spam", "jam"]
>>             ])
>>         ]
>> 
>> document = Document()
>> 
>> # end simulation
>> 
>> wb = xlsxwriter.Workbook("tmp.xlsx")
>> sheet = wb.add_worksheet("Compliance")
>>  
>> offset = 0
>> for table in document.tables:
>>     for y, row in enumerate(table.rows):
>>         for x, cell in enumerate(row.cells):
>>             sheet.write(y + offset, x, cell.text)
>>     offset +=  len(table.rows) + 1  # one empty row between tables
>> 
>> wb.close()
> 
> 
> Hi Peter, thank you for your efforts :)
> 
> However, what if there are many tables in the word document, it would be
> tedious to have to code the texts in the tables one by one. Can I instead,
> call on the word document and let Python do the parsing for tables and its
> contents?

I don't understand. You have docx.api available, so you can replace

the "simulation" part of my example with just these two lines: 

from docx.api import Document
document = Document('/Users/xxx/Documents/xxx/Clauses Sample - Copy v1 - for merge.docx')

I should work -- it's just that I cannot be sure it will work because I could
not test it.





More information about the Python-list mailing list