csv read clean up and write out to csv

Sacha Rook sacharook at gmail.com
Fri Nov 2 13:25:09 EDT 2012


Hi

I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked up with html.

trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the next line/record.

If I 'openwith' excel the description stays on the correct line/record?

I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of text formatted with newline/CR where appropriate.

So far I have this but don't know where to go from here can someone help me?

import csv

infile = open('c:\data\input.csv', 'rb')
outfile = open('c:\data\output.csv', 'wb')

reader = csv.reader(infile)
writer = csv.writer(outfile)


for line in reader:
    print line
    writer.writerow(line)


The input.csv is set out as follows;
HEADER ROW 1st
"FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"

A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOME RECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED

"2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www.atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."

THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
ADDED AS NEW RECORDS ON SUBSEQUENT LINES. 

"2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
<ul>
<li>3 1/2" FDD Lock.</li>
<li>Die casting housing and cylinder chrome plated.</li>
<li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
<li>2 Keys supplied per lock. <br /></li>
</ul>"

I know I am far from complete but don't know how to proceed :-)
As I said I want to reconstruct a clean record either strip out the html tags or 
at least escape the records appropriately..

Thanks all



More information about the Python-list mailing list