Large Two Dimensional Array

Denis McMahon denismfmcmahon at gmail.com
Wed Jan 29 11:32:19 EST 2014


On Tue, 28 Jan 2014 21:25:54 -0800, Ayushi Dalmia wrote:

> Hello,
> 
> I am trying to implement IBM Model 1. In that I need to create a matrix
> of 50000*50000 with double values. Currently I am using dict of dict but
> it is unable to support such high dimensions and hence gives memory
> error. Any help in this regard will be useful. I understand that I
> cannot store the matrix in the RAM but what is the most efficient way to
> do this?

This looks to me like a table with columns:

word1 (varchar 20) | word2 (varchar 20) | connection (double)

might be your best solution, but it's going a huge table (2G5 rows)

The primary key is going to be the combination of all 3 columns (or 
possibly the combination of word1 and word2) and you want indexes on 
word1 and word2, which will slow down populating the table, but speed up 
searching it, and I assume that searching is going to be a much more 
frequent operation than populating.

Also, creating a database has the additional advantage that next time you 
want to use the program for a conversion between two languages that 
you've previously built the data for, the data already exists in the 
database, so you don't need to build it again.

I imagine you would have either one table for each language pair, or one 
table for each conversion (treating a->b and b->a as two separate 
conversions).

I'm also guessing that varchar 20 is long enough to hold any of your 
50,000 words in either language, that value might need adjusting 
otherwise.

-- 
Denis McMahon, denismfmcmahon at gmail.com



More information about the Python-list mailing list