From iermilov at informatik.uni-leipzig.de Thu Dec 6 00:40:49 2012 From: iermilov at informatik.uni-leipzig.de (Ivan Ermilov) Date: Wed, 05 Dec 2012 23:40:49 -0000 Subject: [Csv] csv sniffer - incorrect dialect identification Message-ID: <50BFDA3D.3080106@informatik.uni-leipzig.de> Hello everybody, currently I've got a task of converting ~9500 CSV files to RDF (corpus extracted from publicdata.eu portal) and I use python csv module to extract headers from csv file. I tried to use sniff method as in the next example: > with open(self.resource_dir + self.filename, 'rU') as csvfile: > dialect = csv.Sniffer().sniff(csvfile.read(1024)) > csvfile.seek(0) > reader = csv.reader(csvfile, dialect) > try: > for row in reader: > return row > except BaseException as e: > print str(e) > return [] But it fails to determine comma ',' as a delimiter in some cases (for instance, it can take 'i' as a delimiter, which is nonsense in real-world applications). This is really bad, because comma delimiter is the most frequently used one and should be determined without mistake. If I know which delimiters are possible in my corpus is there a way to tell sniffer to choose between them? Kind regards, Ivan Ermilov. From tony at tony.gen.nz Sat Dec 22 05:22:55 2012 From: tony at tony.gen.nz (Tony Wallace) Date: Sat, 22 Dec 2012 17:22:55 +1300 Subject: [Csv] csv sniffer - incorrect dialect identification In-Reply-To: <50BFDA3D.3080106@informatik.uni-leipzig.de> References: <50BFDA3D.3080106@informatik.uni-leipzig.de> Message-ID: <50D5359F.8040602@tony.gen.nz> If I were importing 9500 CSV files generated as output from a single database I would not even try to use dialect detection. Better to determine what the correct dialect is and parse it with a statically assigned dialect. This dialect could be stored in your application metadata or assigned in code. The reason is that in handling production data quantities there are always a few records that trip up code or detection algorithms. Better to find out what the gotcha's are and deal with them once and for all. Tony