Makin search on the other site and getting data and writing in xml

Tue Sep 26 01:31:32 EDT 2006

On Mon, 25 Sep 2006 13:51:55 +0200, Fredrik Lundh wrote:

>     http://www.google.com/terms_of_service.html
> 
>     "You may not send automated queries of any sort to Google's system without express
>     permission in advance from Google."

I'm not just being a pedantic weasel here, but what's an automated query?
Google's ToS is a legal document (maybe), and if both parties don't agree
on the meanings of terms, well, then it is a lousy legal document and a
recipe for trouble.

Google don't define "automated query"it, and I don't think they can. In
fact, the closest they come to defining it is to list three things they
want to prevent, NONE of which have anything to do with the distinction
between automated and non-automated.

(What on earth is "meta-searching"? If you're going to use terms which
don't have a commonly understood meaning, define what they mean.)

If I want to search for "foo", and I type "foo" into the Firefox search
box, is that an automated query?

What if I type "gg: foo" into Konqueror's address bar, which expands to
"http://www.google.com/search?q=foo"? Is it okay if I type the URL by hand
myself?

Can I use the browser to save the search page to a local HTML file? If
Google says no, how can they possibly hope to stop me?

What if I type this command into my shell?

elinks --dump "http://www.google.com/search?q=foo" > output.html

What if I type

wget "http://www.google.com/search?q=foo"

into the shell? Surely that's no more automated than typing "foo"
into Google's search box. (wget doesn't in fact work, as Google recognises
its user-agent string and blocks it, EVEN in cases where I am using wget
manually. What, can't Google themselves tell the difference between
automatic and non-automatic searching?)

Where is the line I must not cross?

The thing is, Google doesn't want people "reselling" their services, and I
respect Google's intention. But trying to draw a distinction between
"automated" and "non-automated" requests is difficult if not impossible,
as can be seen by the heavy-handed way Google blocks the manual use of
wget. I don't condone the gross abuse of Google's service, but I don't
think an artificial distinction between automated and non-automated is a
useful way to go about it.

Of course, what I think isn't important. If Google wants to write legal
contracts that won't stand up in court (speaking as somebody who isn't a
lawyer and whose legal advice is worthless), they can. But the point is, I
see no ethical nor legal reason why a user can't create a script which is
called MANUALLY by the user and does what a browser does, namely send and
receive data from websites (which may or may not include Google). 

And that, it seems to me, is what the Original Poster wanted.

-- 
Steven D'Aprano