[Chicago] Perl Follow-up

Sat Mar 13 19:07:08 CET 2010

The sort of thing you suggest is the motive for this program.

The human Y chromosome is about 60 million nucleotide base pairs (60Mpb) 
in length. Other than the mitochondrial DNA, it is the shortest in my 
library. Human chromosome 1 is the longest at about a quarter billion 
bases pairs.

Running some hypothetical analysis on chrY is doable on my home machine. 
Running the reverse complement program on chr1 causes it to bork in 
MS-Windows. I have 2GB of memory. In Ubuntu Linux it will run, but it 
starts thrashing out to the swap partition toward the end.

The point is that with 8GB of memory, and a well focussed goal, doing it 
all in-memory is now quite feasible.

I have given a great deal of thought to other algorithms. If, for 
example, I had several hundred patterns, or motifs, to search for, I 
could use a serious array of state machines, and use indices to resolve 
candidacy and ongoing validity for massively parallel occurrences, in a 
single pass, even if I was doing the reverse complement on the fly. For 
each motif, I would wind up with an array of pointers into the 
chromosome. And I could do it with the sort of generator, or small 
moving window of buffered input, which you propose. This might be just 
the thing for a massive processor array computer to chew on.

The sort of experiment which I intend to do next, would be to look for 
CG-heavy sections of DNA. These indicate good areas for the termination 
of pre-RNA transcription. What they tend to do during transcription is 
bind to themselves, knot up, and break off the transcription. These 
potential transcription-ends are easier to spot than the starting areas, 
and then I would just look upstream to see if there's anything interesting.

As for looking at my code in 6 months: yes, I've done that in GW-BASIC. 
The code didn't work. I shelved it. When I came back fresh, the problem 
was obvious: I was using the same variable "B" for two different things. 
Doh!

c4

Jonathan Hayward wrote:
> One other comment in regard to generators...
> 
> If I remember my biology, the Y chromosome is a short one, perhaps the 
> shortest human chromosome, and it weighed in at 600M (conceivably this 
> could be compressed considerably by straightforwardly using two bits 
> rather than one byte per nucleotide). You may have a lot of memory, but 
> it could be conceivable that you could want to do things with 
> chromosome(s) that cannot easily be loaded in memory, especially with 
> two copies, one before the translation, and one afterwards.
> 
> The following pseudocode is not optimized but takes a file object 
> (opened for reading) and, piece by piece, returns the transformed 
> result, taking a fixed (i.e. O(1)) amount of memory regardless of the 
> size of the data. That is, you should theoretically be able to buy a 
> low-memory computer /at a garage sale/ and run code on this principle 
> over the entire human genome without memory constraints being an issue:
> 
> import string
> 
> def transform(filehandle, block_size=1024):
>     translation = string.maketrans("TACG", "ATGC")
>     while 1:
>         input = filehandle.read(block_size)
>         if not input:
>             return
>         yield input.translate(translation)
> 
> Are you familiar with generators? Basically, a generator is a function 
> that instead of returning once, keeps on yielding a result; an obvious 
> use-case is to take a problem that could be solved by using a large list 
> (using O(n) memory, a bit problematic if you're handling encyclopedias, 
> chromosomes, etc.), and allow another approach that uses an amount of 
> memory that is small and fixed. You don't specifically need as a 
> specification to your program that it store a whole chromosome in memory 
> at once; it's just that the most obvious solutions involve doing that, 
> and generators allow you to transform an arbitrarily large piece of 
> information while using a fixed (and, perhaps, quite low) amount of memory.