[Tutor] Question about python code that is not working

Mon Jun 19 18:39:14 EDT 2023

Apologies if it feels as if I keep telling you what to do: please answer 
to the list so that (a) others can jump-in and assist you, and (b) if 
anyone else is suffering a similar problem, they can gain (almost) as 
much as you - also a reason for selecting a meaningful subject-line for 
email messages!

Now, let's talk:-

On 20/06/2023 09.56, Arthur Kolbe wrote:
> Hey again! I've been working on my code for some more, many things that 
> needed to be improved. This is the code as of right now:
...
> 
> What I want is this:
> 
> Software
> 
> Enter websites

The way you have broken-down the problem into smaller sub-problems (and 
they into smaller ...) is good analysis and design!

When ready to code, what I do is take that narrative specification and 
turn it into Python function-names and docstrings. This is workable if 
the sub-problems have been broken-down sufficiently - and works on the 
grounds that each sub-problem will require one (or more) functions in 
which to code its solution.

In this fashion, the top-down design becomes a bottom-up construction. 
As the sub-problems are solved, the slightly-larger sub-problems can be 
addressed - often a matter of ensuring that the sub-problem solutions 
"integrate" correctly (hence term: "integration testing"). Continuing 
until it's 'all done'. Ah, would that life were so easy...

Thus, starting from those function-names and docstring solution-methods, 
I code those sub-problem solutions, one at a time. This means that I can 
also* build a (set of) test(s) to ensure that I've got that (little) bit 
correct - it's so much easier to see where things have gone-wrong if 
there is only one function in-play!

* I (try to - but am human/lazy/often trying to work quickly) use a 
technique called "TDD" (Test-Driven Development) which suggests that one 
should use the spec to write the test *first*, and then write code which 
will *deliver* to spec.
[however such is possibly a distraction at this moment. So, tuck it 
behind you ear, and come back to it when you're inclined]

Thus, if this sub-problem: "enter websites", is built as a 
self-contained function, can the function be given a URL as argument, 
and respond with the page-header and/or content? There we go - first 
test written, and (making assumption) first sub-problem solved!

Get the idea?
(see also "Modular Programming")

> Software checks website if crawling forbidden or not

Good practice!

> If allowed, crawls every page on website, looks for 404/410 pages that 
> were once present on the website (status code 200)

There are Linux tools which do this (curl and wget). They have options 
to create/vary a pause between making requests of a site - to avoid 
'hammering' the server. May be worth a perusal...

> Creates CSV.
> 
>   Two tables. Both two columns. Table 1 Left C: All websites one 
> entered, Right C: EITHER "404 pages found on website", "no 404 pages 
> found on website", "Scraping not allowed" or "Website can't be reached". 
> Remember, the entire websites are supposed to be crawled for 404 pages. 
> So in the second table in the left column all pages that were found. 
> right column status code. This second table in the csv is so that I can 
> Make sure the program did or didnt find 404 pages.

Business folk can't seem to get enough of spreadsheets (although this is 
a .CSV file, cf using openpyxl (or some-such) to build a spreadsheet 
directly).

Whereas a web-site verifier/monitor like this, and Python program[me]s 
which run 'in the background', are often better-off tracking progress 
and results in a "log". There is even a logging library in the PSL 
(Python Standard Library)!

> and what is happening right now is this:
> the code when running, creates one file with all pages it finds to the 
> first website I enter, then when done with crawling that website, 
> creates another file with the same name in the same directory, 

Oops!

This problem wouldn't happen if a single log-file were being employed - 
similarly a single workbook (although would still find same issue if 
delivered as a separate work-sheet for each web-site).

The "websites" list of URLs to be inspected needs to be accompanied by a 
'destination' file-name (for this purpose). Alternately, if you can 
guarantee unique naming, perhaps use urllib.parse to split each URL into 
components and use the web-site (netloc) - with or without TLD; as the 
file-name?

> overwriting the old one, where all the pages of the second website I 
> entered are, then the third, and when its done it creates one last file 
> where all the websites that were entered are shown in a table in the 
> left column but for some reason it says "website cant be reached" for 
> them in the right column although like I just said all the pages of the 
> websites were found in the files created before. First two files are 
> just the second table so to say but only for one website, the last file 
> is only the first table so to say but not correct.

Re-read this and note how it is difficult to locate exactly where the 
error starts - because it is only revealed at the reporting stage.

> Thanks for helping in advance!

The 'help' is non-specific. Should you decide to change the delivery 
method or implement the naming-scheme, maybe the problem is solved.

However, better practise will help. If smaller units are tested 
in-isolation, the problem will (likely) be revealed sooner, ie calling a 
function and *not* gaining the result desired (tested for). If you are 
able to narrow-down the code the way you have narrowed-down the 
problem-description, I think you will have it beaten (and the next bug, 
and those in the next program[me]...

-- 
Regards,
=dn