How to find bad row with db api executemany()?

Fri Mar 29 20:41:03 EDT 2013

In article <mailman.3971.1364595940.2939.python-list at python.org>,
 Dennis Lee Bieber <wlfraed at ix.netcom.com> wrote:

> If using MySQLdb, there isn't all that much difference... MySQLdb is
> still compatible with MySQL v4 (and maybe even v3), and since those
> versions don't have "prepared statements", .executemany() essentially
> turns into something that creates a newline delimited "list" of
> "identical" (but for argument substitution) statements and submits that
> to MySQL.

Shockingly, that does appear to be the case.  I had thought during my 
initial testing that I was seeing far greater throughput, but as I got 
more into the project and started doing some side-by-side comparisons, 
it the differences went away.

We're sucking in a pretty huge amount of data.  The source document is a 
7 GB gzipped XML file.  I'm not sure how big it is uncompressed (we use 
gzip.GzipFile to uncompress on the fly) but I'm guessing something like 
a 30x compression ratio so 200 GB?  The last time we imported the whole 
set, it ran for 21 days!

It turns out, the problems we were seeing were all inserts into a new 
table we added.   Apparently, the default charset is latin-1 and we 
didn't notice that when we altered the schema!  Once I noticed that all 
the other tables were utf-8 and changed this one to be that, the 
problems went away.

Sadly, I ended up resorting to a truly ugly hack to diagnose the 
problem.  I catch the exception and parse the text message.  Yuck.

    try:
        self.executemany(self.sql_statement, self.sql_params)
    except MySQLdb.Error as ex:
        code, message = ex.args
        m = re.search(r".* at row (\d+)$", message)
        if m:
            i = int(m.group(1)) - 1  # Python is 0-index, SQL, 1-index

The other truly horrible part of the project was when I decided it was 
bad for my import script to have too much schema knowledge hard-wired 
in.  So, I decided to use SQLAlchemy to introspect the database and 
discover the column names, types, and defaults.  It turns out, if an 
integer column has a default (say, 0), the introspected data comes back 
with the default as the string, '0'.  WTF???

Does Postgress's Python adapter handle executemany() in a sane way?  
We're not wedded to MySQL in any way.  We use it for exactly this one 
process.  We get these XML dumps from a supplier's SQL-Server database.  
We stage the data in MySQL, then export what we need into MongoDB.  We 
could easily swap out the MySQL staging for Postgress if that worked 
better.

Hmmm, we do take advantage of REPLACE INTO, which I think is a 
non-standard MySQL addition.  Not sure if Postgress supports that.