The simplest route to that for now would be to search the titles via Algolia, sorting by popularity. I could I suppose gin up the URLs for that, though as I've already noted, I think I've spammed this particular thread enough with long-list comments. (HN prizes intellectual curiosity, and whilst a few tables might meet that criterion, I think I'm pushing the limits.)
The difference between my data & analysis and Algolia is that Algolia doesn't itself report on either front-page-specific items, or on stories which have been repeated. But given a list of front-page stories, or repeated-front-page stories, you can search Algolia ... to surface all instances of those stories. The front page will in general have
If you're suggesting I list the URLs themselves directly ... as I'm working with the archive data, I don't have that readily available.
My current workflow is, roughly:
- crawl (or update) the front-page archive
- rerender the captured HTML as plain text, using w3m's `-dump` flag
- parse that text into a tagged multi-line-per-record format with the raw title line, parsed title, date (and several sub-element parsings of that), site ( as reportedd by HN), points, submitter, comments, and (artefact of the original question I'd sought to answer) any US cities or states mentioned
- create various reports and abstracts based off of that. "hn-titles", "date summary" (mostly the parsed data arranged on one line for easier awk processing), cities (US and "globally significant") and US states reports, etc.
Conspicuously absent in the parsing (3rd step) are both the full article URL, and the HN post URL.
I've got those, in the raw HTML, but I'd need to go through that and parse the original which up until now has been Too Much Work given what I can do with what I have now.
And if you're wondering how many votes are required to make the front page, here's a summary, by year, of the univariate stats for votes for the 30th story per page, that is, the lowest-ranked:
Note that the first three years were pretty low (min = 1, mean = 2.48, for 2007), but goinng back 5 years a story with > 23 points could have made the front page. There are also a few days with < 30 stories, all occurring in 2007 if memory serves.
(This is my first time seeing these particular stats, another analysis I'd been thinking of doing for a while. I had calculated the delta between 1st and 30th ranked stories going back some time. Also the variance by day of week, which is also fairly significant.)
The simplest route to that for now would be to search the titles via Algolia, sorting by popularity. I could I suppose gin up the URLs for that, though as I've already noted, I think I've spammed this particular thread enough with long-list comments. (HN prizes intellectual curiosity, and whilst a few tables might meet that criterion, I think I'm pushing the limits.)
The difference between my data & analysis and Algolia is that Algolia doesn't itself report on either front-page-specific items, or on stories which have been repeated. But given a list of front-page stories, or repeated-front-page stories, you can search Algolia ... to surface all instances of those stories. The front page will in general have
If you're suggesting I list the URLs themselves directly ... as I'm working with the archive data, I don't have that readily available.
My current workflow is, roughly:
- crawl (or update) the front-page archive
- rerender the captured HTML as plain text, using w3m's `-dump` flag
- parse that text into a tagged multi-line-per-record format with the raw title line, parsed title, date (and several sub-element parsings of that), site ( as reportedd by HN), points, submitter, comments, and (artefact of the original question I'd sought to answer) any US cities or states mentioned
- create various reports and abstracts based off of that. "hn-titles", "date summary" (mostly the parsed data arranged on one line for easier awk processing), cities (US and "globally significant") and US states reports, etc.
Conspicuously absent in the parsing (3rd step) are both the full article URL, and the HN post URL.
I've got those, in the raw HTML, but I'd need to go through that and parse the original which up until now has been Too Much Work given what I can do with what I have now.
And if you're wondering how many votes are required to make the front page, here's a summary, by year, of the univariate stats for votes for the 30th story per page, that is, the lowest-ranked:
Note that the first three years were pretty low (min = 1, mean = 2.48, for 2007), but goinng back 5 years a story with > 23 points could have made the front page. There are also a few days with < 30 stories, all occurring in 2007 if memory serves.(This is my first time seeing these particular stats, another analysis I'd been thinking of doing for a while. I had calculated the delta between 1st and 30th ranked stories going back some time. Also the variance by day of week, which is also fairly significant.)