Puzzled: Y am I ending up with extra bytes?

Sat Feb 23 18:56:03 EST 2002

"A.Newby" <deathtospam43423 at altavista.com> wrote in message news:<Xns91BF3BEAA45A2AusYourStandingInIt at 130.133.1.4>...
> Why is this happening? I read large chunks of data from a text file, 
> according to byte locations specified on another file, and for some reason, 
> this function (below), spits out a few extra bytes.
> 
> Here's the code, as entered into the Python shell...... 
> 
> 
>     	index = map(string.rstrip, open('D:\cgi-bin\indx.txt').readlines())
>     	#this opens the index file, which has precise byte locations of each 
>     	#chunk of data I want to extract from the log.txt file, and turns it 
>     	#into a list.
> 
> 
>     	def fish(end, start, deduct):
> 	    	chat = open('D:\cgi-bin\log.txt', 'r')
> 	    	g = int(index[end]) - int(index[start])

This seems to assume that chunks in the log.txt file are contiguous.
E.g. 
log.txt contains
chunk1RLlonger-chunkRLthird-chunkRL
0....v....1....v....2....v....3....v
[R = carriage return 0x0D, L = line feed 0x0A]
and indx.txt contains
0
8
22
35 ######### Better have an EOF sentinel, otherwise you are cactoid on
the last chunk.

and you expect:
fish(1, 2, 0) to give "longer-chunk\n"
and fish(0, 2, 0) to give "chunk1\nlonger-chunk\n"

Is this so? It is not difficult to see where actual might differ from
expected by 1 byte per line. seek() will position you properly, but
read(n) in text mode will pull in more than you "expect".

Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("foo.txt", "rb")
>>> f.seek(0); f.read()
'01234567\r\nabcdefgh\r\nABCDEFGH\r\n\r\n'
>>> f.seek(0); f.read(20)
'01234567\r\nabcdefgh\r\n'
>>> f = open("foo.txt", "r")
>>> f.seek(0); f.read(20)
'01234567\nabcdefgh\nAB'
>>>

However an allegedly semi-constant "about 4" is a puzzler. Are most
chunks 4 lines????? I suggest that you set up a *SMALL* test set like
the above, and try that. If you still have a problem, then you can
post the contents of your test data files, and (in some precision)
what the result was.

You will probably have to open the log.txt file in binary mode, and
strip out any "\r" you find when you read from the seek position. This
would still work on *x, but go hopelessly wrong on a Mac. Better:
first change "\r\n" to "\n", then change "\r" to "\n".

> 	    	chat.seek(int(index[start]))
> 	    	print chat.read(g - deduct)
> 
> Now, if I enter the following into the command line ...
> 
> fish(205, 204, 0)
> 
> ,,, I get about four extra characters. God knows why! So that's why I try 

*ABOUT* four or *EXACTLY* four??? Are they the first ~four characters
from the next chunk?

> ...
> 
> fish(205, 204, 4) 
> 
> ... And it seems to work perfectly. I can even "fish" up to about about ten 
> "index" lines with it.

Your meaning is unclear here. Show the Python code.

> But as soon as I try and fish out any more than 
> that, I get the dreaded extra 4 bytes of code again. Why?

... and here too.

> 
> Now I know what you're thinking. You're thinking that perhaps my index file 
> is corrupted, and hasn't got an accurate account of precisely what's in the 
> log file. I suspected that might be the case myself, but ... when I try 
> fishing out each chunk of data individually, it works fine! I can even do 
> this ...
> 
> for x in range(1, 90):
> 	fish(211+x, 210 + x, 4)
> 
> ...... without ending up with that extra data I don't want. However, this 
> method is too slow for my purposes.

Not surprising, seeing fish() is opening the log.txt file each time
you call it!

> Plus, I really wanna know what it is 
> that's going wrong.
> 
> Can anyone spot it?
> 
> BTW, if you're curious, this is part of a chat script I'm putting together. 
> 
> 
> cdewin at dingoblue.net.au