[ python-Bugs-1636950 ] Newline skipped in "for line in file"

Wed Jun 27 12:00:22 CEST 2007

Bugs item #1636950, was opened at 2007-01-16 17:56
Message generated for change (Comment added) made by runedevik
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
Status: Closed
Resolution: Invalid
Priority: 5
Private: No
Submitted By: Andy Monthei (amonthei)
Assigned to: Nobody/Anonymous (nobody)
Summary: Newline skipped in "for line in file"

Initial Comment:
When processing huge fixed block files of about 7000 bytes wide and several hundred thousand lines long some pairs of lines get read as one long line with no line break when using "for line in file:".  The problem is even worse when using the fileinput module and reading in five or six huge files consisting of 4.8 million records causes several hundred pairs of lines to be read as single lines. When a newline is skipped it is usually followed by several more in the next few hundred lines. I have not noticed any other characters being skipped, only the line break.

O.S. Windows (5, 1, 2600, 2, 'Service Pack 2')
Python 2.5

----------------------------------------------------------------------

Comment By: Rune Devik (runedevik)
Date: 2007-06-27 12:00

Message:
Logged In: YES 
user_id=1212666
Originator: NO

Hi

I have the same problem with a huge file (8GB) containing long lines.
Sometimes two lines are merged into one and rerunning the test script that
reads the file it's always the same lines that are merged. Also the merging
happens more frequently towards the end of the file it seems. I tried to
reproduce with a smaller data set (10 lines before the two lines that get
merged, the two lines that gets merged and the 10 lines after that) but I
was not able to reproduce on this smaller data set. However if you open
this huge file in "rb" mode instead of "r" mode everything works as it
should and no lines are merged at all! If I copy the file over to linux and
rerun the test script no lines are merged (regardless if mode is "r" or
"rb") so this is windows specific and might have something todo with the
adding of \r\n if only \n is found when you open the file in "r" mode
maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both
windows XP and windows 2003. 

More stats on the input file in both "r" mode and "rb" mode below:

Input file size: 8 695 828 KB

fp = open(file, "r"):
  - total number of lines read:  668909
  - length of the longest line:  13179792
  - length of the shortest line: 89
  - 56 lines contains the content of two lines
  - Always just two lines that are merged into one! 
  - Always the same lines that are merged rerunning the test on the same
file. 

open(file, "rb"):
  - total number of lines read:  668965
  - length of the longest line:  13179793
  - length of the shortest line: 90
  - no lines merged

Regards,
Rune Devik

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2007-01-21 01:46

Message:
Logged In: YES 
user_id=357491
Originator: NO

Well, with Andy saying he can't reproduce the problem I am going to close
as invalid.

Andy, if you ever happen to be able to upload data that triggers it, then
please re-open this bug.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-20 23:53

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I have had no luck creating random data to reproduce the problem which
leaves me to come to the conclusion that it was the data itself.  Using a
hex editor I find no problem with the line breaks.

The data that triggers this bug is transferred several time before it gets
to me. It originates on a Unix box, then goes to an IBM mainframe, then to
my Windows machine and through many updates along the way. It may be an
EBCDIC/ASCII conversion or possibly something to do with the mainframe to
PC transfer. Whatever it is, it's in the data itself.

The only thing that bothers me is that Java somehow is not affected by
this bad data.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-18 16:34

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I am using open() for reading the file, no other features. I have also had
fileinput.input(fileList) compound the problem.  Each file that this has
happened to is a fixed block file of either 6990 or 7700 bytes wide but
this I think is insignificant. When looking at the file in a hex editor
everything looks fine and a small Java program using a buffered reader will
give me the correct line count when Python does not.

Using something like fp.read(8192) I'm sure might temporarily solve my
problem but I will keep working on getting a file I can upload.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2007-01-18 10:23

Message:
Logged In: YES 
user_id=89016
Originator: NO

Are you using any of the unicode reading features (i.e. codecs.EncodedFile
etc.) or are you using plain open() for reading the file?

----------------------------------------------------------------------

Comment By: Mark Roberts (mark-roberts)
Date: 2007-01-18 08:12

Message:
Logged In: YES 
user_id=1591633
Originator: NO

I don't know if this helps: I spent the last little while creating /
reading random files that all (seemingly) matched the description you gave
us.  None of these files failed to read properly.  (e.g., have the right
amount of rows with a line length that seemingly was the right line. 
Definitely no doubling lines).

Perusing the file source code found a detailed discussion of fgets vs
fgetc for finding the next line in the file.  Have you tried reading the
file with fp.read(8192) or similar?  Hopefully you're able to reproduce the
bug with scrubbed data (because I couldn't construct random data to do so).
 Good luck.

----------------------------------------------------------------------

Comment By: Mark Roberts (mark-roberts)
Date: 2007-01-18 06:24

Message:
Logged In: YES 
user_id=1591633
Originator: NO

How wide are the min and max widths of the lines?  This problem is of
particular interest to me.

----------------------------------------------------------------------

Comment By: Andy Monthei (amonthei)
Date: 2007-01-17 22:58

Message:
Logged In: YES 
user_id=1693612
Originator: YES

I can not upload the files that trigger this because of the data that is
in them but I am working on getting around that.

In my data line 617391 in a fixed block file of 6990 bytes wide gets read
in with the next line after it.  The line break is 0d0a (same as the
others) where the bug happens so I am wondering if it is a buffer issue
where the linebreak falls at the edge, however no other characters are ever
missed. The total file is 888420 lines and this happens in four spots.

I will hopefully have a file to send soon.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2007-01-16 23:33

Message:
Logged In: YES 
user_id=357491
Originator: NO

Do you happen to have a sample you could upload that triggers the bug?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1636950&group_id=5470