Newbie question about text encoding

Fri Feb 27 09:02:37 EST 2015

On 02/27/2015 06:54 AM, Steven D'Aprano wrote:
> Dave Angel wrote:
>
>> On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
>>> Dave Angel wrote:
>>>
>>>> (Although I believe Seymour Cray was quoted as saying that virtual
>>>> memory is a crock, because "you can't fake what you ain't got.")
>>>
>>> If I recall correctly, disk access is about 10000 times slower than RAM,
>>> so virtual memory is *at least* that much slower than real memory.
>>>
>>
>> It's so much more complicated than that, that I hardly know where to
>> start.
>
> [snip technical details]
>
> As interesting as they were, none of those details will make swap faster,
> hence my comment that virtual memory is *at least* 10000 times slower than
> RAM.
>

The term "virtual memory" is used for many aspects of the modern memory 
architecture.  But I presume you're using it in the sense of "running in 
a swapfile" as opposed to running in physical RAM.

Yes, a page fault takes on the order of 10,000 times as long as an 
access to a location in L1 cache.  I suspect it's a lot smaller though 
if the swapfile is on an SSD drive.  The first byte is that slow.

But once the fault is resolved, the nearby bytes are in physical memory, 
and some of them are in L3, L2, and L1.  So you're not running in the 
swapfile any more.  And even when you run off the end of the page, 
fetching the sequentially adjacent page from a hard disk is much faster. 
  And if the disk has well designed buffering, faster yet.  The OS tries 
pretty hard to keep the swapfile unfragmented.

The trick is to minimize the number of page faults, especially to random 
locations.  If you're getting lots of them, it's called thrashing.

There are tools to help with that.  To minimize page faults on code, 
linking with a good working-set-tuner can help, though I don't hear of 
people bothering these days.  To minimize page faults on data, choosing 
one's algorithm carefully can help.  For example, in scanning through a 
typical matrix, row order might be adjacent locations, while column 
order might be scattered.

Not really much different than reading a text file.  If you can arrange 
to process it a line at a time, rather than reading the whole file into 
memory, you generally minimize your round-trips to disk.  And if you 
need to randomly access it, it's quite likely more efficient to memory 
map it, in which case it temporarily becomes part of the swapfile system.

-- 
DaveA