Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked on a web crawler once before.. you've got no idea how annoying these kinds of websites are to detect.

http://en.wikipedia.org/wiki/Spider_trap

.. although I'm sure they've got some smart people working on it at Google.



You detect them like Unix detects a symlink loop: it punts on the problem and just errors after 8 symlink traversals.

The equivalent is a crawl depth limit, which could be a hard limit (dumb) or a function of page-rank (smart) and the trustworthiness of the inbound link (smarter) and also the data quality & diversity of the traversed pages (best).

There are very good reasons why Googlebot seems to hit you from one IP at a time -- it's a long-running thread that is making all sorts of decisions about your site as it crawls.


Google has indexed 38 so far: http://www.google.com.au/search?q=site%3Aianab.com%2Ftrillio...

Tackling the data quality & diversity of the traversed pages (best):

Producing English text with the 40 bits, by driving a generative grammar or a markov/travesty generator, would make it harder for Google to detect that the pages are auto-generated. It's unlikely to infer the function f(URL) -> text (or even to attempt it), but would limit the recursion for the other reasons you mention.

(guessing) sites like hackernews are indexed primarily by recursion (few direct inbound links to specific stories).


(guessing) sites like hackernews are indexed primarily by recursion (few direct inbound links to specific stories).

Correct. Notice that it is difficult to find old HN comments on Google, since after a while there are no short paths from the home page to them. In practice & all else being equal (quality, length, spamminess, speed, age, uniqueness, PR, etc), the maximum depth a page can afford to have is about 6 or 7.


http://drunkmenworkhere.org/ did an interesting study of how some search engines handled infinite sites.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: