Python text file fetch specific part of line

cs at zip.com.au cs at zip.com.au
Thu Jul 28 03:04:02 EDT 2016


On 27Jul2016 22:12, Arshpreet Singh <arsh840 at gmail.com> wrote:
>I am writing Imdb scrapper, and getting available list of titles from IMDB 
>website which provide txt file in very raw format, Here is the one part of 
>file(http://pastebin.com/fpMgBAjc) as the file provides tags like Distribution  
>Votes,Rank,Title I want to parse title names, I tried with readlines() method 
>but it returns only list which is quite heterogeneous, is it possible that I 
>can parse each value comes under title section?

Just for etiquette: please just post text snippets like that inline in your 
text. Some people don't like fetching random URLs, and some of us are not 
always online when reading and replying to email. Either way, having the text 
in the message, especially when it is small, is preferable.

To your question:

Your sample text looks like this:

    New  Distribution  Votes  Rank  Title
      0000000125  1680661   9.2  The Shawshank Redemption (1994)
      0000000125  1149871   9.2  The Godfather (1972)
      0000000124  786433   9.0  The Godfather: Part II (1974)
      0000000124  1665643   8.9  The Dark Knight (2008)
      0000000133  860145   8.9  Schindler's List (1993)
      0000000133  444718   8.9  12 Angry Men (1957)
      0000000123  1317267   8.9  Pulp Fiction (1994)
      0000000124  1209275   8.9  The Lord of the Rings: The Return of the King 
(2003)
      0000000123  500803   8.9  Il buono, il brutto, il cattivo (1966)
      0000000133  1339500   8.8  Fight Club (1999)
      0000000123  1232468   8.8  The Lord of the Rings: The Fellowship of the 
Ring (2001)
      0000000223  832726   8.7  Star Wars: Episode V - The Empire Strikes Back 
(1980)
      0000000233  1243066   8.7  Forrest Gump (1994)
      0000000123  1459168   8.7  Inception (2010)
      0000000223  1094504   8.7  The Lord of the Rings: The Two Towers (2002)
      0000000232  676479   8.7  One Flew Over the Cuckoo's Nest (1975)
      0000000232  724590   8.7  Goodfellas (1990)
      0000000233  1211152   8.7  The Matrix (1999)

Firstly, I would suggest you not use readlines(), it pulls all the text into 
memory. For small text like this is it ok, but some things can be arbitrarily 
large, so it is something to avoid if convenient. Normally you can just iterate 
over a file and get lines.

You want "text under the Title." Looking at it, I would be inclined to say that 
the first line is a header and the rest consist of 4 columns: a number 
(distribution?), a vote count, a rank and the rest (title plus year).

You can parse data like that like this (untested):

  # presumes `fp` is reading from the text
  for n, line in enumerate(fp):
    if n == 0:
      # heading, skip it
      continue
    distnum, nvotes, rank, etc = split(line, 3)
    ... do stuff with the various fields ...

I hope that gets you going. If not, return with what code you have, what 
happened, and what you actually wanted to happen and we may help further.

Cheers,
Cameron Simpson <cs at zip.com.au>



More information about the Python-list mailing list