Python text file fetch specific part of line

cs at zip.com.au cs at zip.com.au
Fri Jul 29 22:46:55 EDT 2016


On 29Jul2016 18:42, Gordon Levi <gordon at address.invalid> wrote:
>cs at zip.com.au wrote:
>
>>On 28Jul2016 19:28, Gordon Levi <gordon at address.invalid> wrote:
>>>Arshpreet Singh <arsh840 at gmail.com> wrote:
>>>>I am writing Imdb scrapper, and getting available list of titles from IMDB
>>>>website which provide txt file in very raw format, Here is the one part of
>>>>file(http://pastebin.com/fpMgBAjc) as the file provides tags like
>>>>Distribution  Votes,Rank,Title I want to parse title names, I tried with
>>>>readlines() method but it returns only list which is quite heterogeneous, is
>>>>it possible that I can parse each value comes under title section?
>>>
>>>Beautiful Soup will make your task much easier
>>><https://www.crummy.com/software/BeautifulSoup/>.
>>
>>Did you look at his sample data?
>
>No. I read he was "writing an IMDB scraper, and getting the available
>list of titles from the IMDB web site". It's here
><http://www.imdb.com/>.
>
>> Plain text, not HTML or XML. Beautiful Soup is
>>not what he needs here.
>
>Fortunately the OP told us his application rather than just telling us
>his current problem. His life would be much easier if he ignored the
>plain text he has obtained so far and started again using a Beautiful
>Soup tutorial.

Or bypass IMDB's computer unfriendliness and go straight to http://omdbapi.com/

You can have JSON directly from it, and avoid BS entirely. BS is an amazing 
library, but is essentially a workaround for computer-hostile websites: those 
not providing clean machine readable data, and only unstable mutable HTML 
output.

Cheers,
Cameron Simpson <cs at zip.com.au>



More information about the Python-list mailing list