OT Re: Parallel insert to postgresql with thread

Fri Oct 26 00:04:49 EDT 2007

If you're not Scott Daniels, beware that this conversation has gone  
horribly off topic and, unless you have an interest in PostreSQL, you  
may not want to bother reading on...

On Oct 25, 2007, at 9:46 PM, Scott David Daniels wrote:

> Erik Jones wrote:
>>
>> On Oct 25, 2007, at 7:28 AM, Scott David Daniels wrote:
>>> Diez B. Roggisch wrote:
>>>> Abandoned wrote:
>>>>> Hi..
>>>>> I use the threading module for the fast operation. But ....
>>> [in each thread]
>>>>> def save(a,b,c):
>>>>>             cursor.execute("INSERT INTO ...
>>>>>             conn.commit()
>>>>>             cursor.execute(...)
>>>>> How can i insert data to postgresql the same moment ?...
>>>>
>>>> DB modules aren't necessarily thread-safe. Most of the times, a
>>>> connection (and ... cursor) can't be shared between threads.
>>>> So open a connection for each thread.
>>>
>>> Note that your DB server will have to "serialize" your inserts, so
>>> ... a single thread through a single connection to the DB is the way
>>> to go.  Of course it (the DB server) may be clever enough to behave
>>> "as if" they are serialized, but most of your work parallelizing at
>>> your end simply creates new work at the DB server end.
>>
>> Fortunately, in his case, that's not necessarily true....  If he
>> goes the recommended route with a separate connection for each  
>> thread,
>> then Postgres will not serialize multiple inserts coming from  
>> separate
>> connections unless there is something like and ALTER TABLE or REINDEX
>> concurrently happening on the table.
>> The whole serialized inserts thing is strictly something popularized
>> by MySQL and is by no means necessary or standard (as with a lot of
>> MySQL).
>
> But he commits after every insert, which _does_ force serialization  
> (if
> only to provide safe transaction boundaries).  I understand you can  
> get
> clever at how to do it, _but_ preserving ACID properties is exactly  
> what
> I mean by "serialize,"

First, bad idea to work with your own definition of a very domain  
specific and standardized term.  Especially when Postgres's Multi- 
Version Concurrency Control mechanisms are designed specifically for  
the purpose of preserve ACID compliance without forcing serialized  
transactions on the user.

Second, unless he specifically sets his transaction level to  
serializable, he will be working in read-committed mode.  What this  
specifically means is that two (or more) transactions writing to the  
same table will not block any of the others.  Let's say the user has  
two concurrent inserts to run on the same table that, for whatever  
reason, take a while to run (for example, they insert the results of  
some horribly complex or inefficient select), if either is run in  
serializable mode then which ever one starts a fraction of a second  
sooner will run until completion before the second is even allowed to  
begin. In (the default) read-committed mode they will both begin  
executing as soon as they are called and will write their data  
regardless of conflicts.  At commit time (which may be sometime later  
for transactions with multiple statements are used) is when conflicts  
are resolved.  So, if between the two example transactions there does  
turn out to be a conflict betwen their results, whichever commits  
second will roll back and, since the data written by the second  
transaction will not be marked as committed, it will never be visible  
to any other transactions and the space will remain available for  
future transactions.

Here's the relevant portion of the Postgres docs on all of this:  
http://www.postgresql.org/docs/8.2/interactive/mvcc.html

> and while I like to bash MySQL as well as the
> next person, I most certainly am not under the evil sway of the vile
> MySQL cabal.

Good to hear ;)

>
> The server will have to be able to abort each transaction
> _independently_ of the others, and so must serialize any index
> updates that share a page by, for example, landing in the same node
> of a B-Tree.

There is nothing inherent in B-Trees that prevents identical datum  
from being written in them.  If there was the only they'd be good for  
would be unique indexes.  Even if you do use a unique index, as noted  
above, constraints and conflicts are only enforced at commit time.

Erik Jones

Software Developer | Emma®
erik at myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com