Python text file fetch specific part of line

honeygne at gmail.com honeygne at gmail.com
Tue Aug 2 02:55:18 EDT 2016


On Thursday, July 28, 2016 at 1:00:17 PM UTC+5:30, c... at zip.com.au wrote:
> On 27Jul2016 22:12, Arshpreet Singh <arsh840 at gmail.com> wrote:
> >I am writing Imdb scrapper, and getting available list of titles from IMDB 
> >website which provide txt file in very raw format, Here is the one part of 
> >file(http://pastebin.com/fpMgBAjc) as the file provides tags like Distribution  
> >Votes,Rank,Title I want to parse title names, I tried with readlines() method 
> >but it returns only list which is quite heterogeneous, is it possible that I 
> >can parse each value comes under title section?
> 
> Just for etiquette: please just post text snippets like that inline in your 
> text. Some people don't like fetching random URLs, and some of us are not 
> always online when reading and replying to email. Either way, having the text 
> in the message, especially when it is small, is preferable.
> 
> To your question:
> 
> Your sample text looks like this:
> 
>     New  Distribution  Votes  Rank  Title
>       0000000125  1680661   9.2  The Shawshank Redemption (1994)
>       0000000125  1149871   9.2  The Godfather (1972)
>       0000000124  786433   9.0  The Godfather: Part II (1974)
>       0000000124  1665643   8.9  The Dark Knight (2008)
>       0000000133  860145   8.9  Schindler's List (1993)
>       0000000133  444718   8.9  12 Angry Men (1957)
>       0000000123  1317267   8.9  Pulp Fiction (1994)
>       0000000124  1209275   8.9  The Lord of the Rings: The Return of the King 
> (2003)
>       0000000123  500803   8.9  Il buono, il brutto, il cattivo (1966)
>       0000000133  1339500   8.8  Fight Club (1999)
>       0000000123  1232468   8.8  The Lord of the Rings: The Fellowship of the 
> Ring (2001)
>       0000000223  832726   8.7  Star Wars: Episode V - The Empire Strikes Back 
> (1980)
>       0000000233  1243066   8.7  Forrest Gump (1994)
>       0000000123  1459168   8.7  Inception (2010)
>       0000000223  1094504   8.7  The Lord of the Rings: The Two Towers (2002)
>       0000000232  676479   8.7  One Flew Over the Cuckoo's Nest (1975)
>       0000000232  724590   8.7  Goodfellas (1990)
>       0000000233  1211152   8.7  The Matrix (1999)
> 
> Firstly, I would suggest you not use readlines(), it pulls all the text into 
> memory. For small text like this is it ok, but some things can be arbitrarily 
> large, so it is something to avoid if convenient. Normally you can just iterate 
> over a file and get lines.
> 
> You want "text under the Title." Looking at it, I would be inclined to say that 
> the first line is a header and the rest consist of 4 columns: a number 
> (distribution?), a vote count, a rank and the rest (title plus year).
> 
> You can parse data like that like this (untested):
> 
>   # presumes `fp` is reading from the text
>   for n, line in enumerate(fp):
>     if n == 0:
>       # heading, skip it
>       continue
>     distnum, nvotes, rank, etc = split(line, 3)
>     ... do stuff with the various fields ...
> 
> I hope that gets you going. If not, return with what code you have, what 
> happened, and what you actually wanted to happen and we may help further.
Thanks I am able to do it with following:
https://github.com/alberanid/imdbpy/blob/master/bin/imdbpy2sql.py (it was very helpful)

python imdbpy2sql.py -d <.txt files downloaded from IMDB> -u sqlite:/where/to/save/db --sqlite-transactions



More information about the Python-list mailing list