Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That would involve Real Work.

The simplest route to that for now would be to search the titles via Algolia, sorting by popularity. I could I suppose gin up the URLs for that, though as I've already noted, I think I've spammed this particular thread enough with long-list comments. (HN prizes intellectual curiosity, and whilst a few tables might meet that criterion, I think I'm pushing the limits.)

The difference between my data & analysis and Algolia is that Algolia doesn't itself report on either front-page-specific items, or on stories which have been repeated. But given a list of front-page stories, or repeated-front-page stories, you can search Algolia ... to surface all instances of those stories. The front page will in general have

If you're suggesting I list the URLs themselves directly ... as I'm working with the archive data, I don't have that readily available.

My current workflow is, roughly:

- crawl (or update) the front-page archive

- rerender the captured HTML as plain text, using w3m's `-dump` flag

- parse that text into a tagged multi-line-per-record format with the raw title line, parsed title, date (and several sub-element parsings of that), site ( as reportedd by HN), points, submitter, comments, and (artefact of the original question I'd sought to answer) any US cities or states mentioned

- create various reports and abstracts based off of that. "hn-titles", "date summary" (mostly the parsed data arranged on one line for easier awk processing), cities (US and "globally significant") and US states reports, etc.

Conspicuously absent in the parsing (3rd step) are both the full article URL, and the HN post URL.

I've got those, in the raw HTML, but I'd need to go through that and parse the original which up until now has been Too Much Work given what I can do with what I have now.

And if you're wondering how many votes are required to make the front page, here's a summary, by year, of the univariate stats for votes for the 30th story per page, that is, the lowest-ranked:

  Year: 2007 (days: 300)
  n: 172, sum: 427, min: 1, max: 6, mean: 2.482558, median: 2, sd: 1.131546
  
  Year: 2008 (days: 366)
  n: 172, sum: 1442, min: 2, max: 16, mean: 8.383721, median: 8, sd: 3.181337
  
  Year: 2009 (days: 365)
  n: 172, sum: 3493, min: 7, max: 55, mean: 20.308140, median: 20, sd: 6.398405
  
  Year: 2010 (days: 365)
  n: 172, sum: 6582, min: 20, max: 59, mean: 38.267442, median: 39, sd: 8.689477
  
  Year: 2011 (days: 364)
  n: 172, sum: 10312, min: 28, max: 89, mean: 59.953488, median: 61, sd: 13.266858
  
  Year: 2012 (days: 366)
  n: 172, sum: 12492, min: 31, max: 150, mean: 72.627907, median: 74, sd: 17.430260
  
  Year: 2013 (days: 365)
  n: 172, sum: 14354, min: 44, max: 184, mean: 83.453488, median: 82, sd: 21.248547
  
  Year: 2014 (days: 363)
  n: 172, sum: 14513, min: 5, max: 131, mean: 84.377907, median: 85, sd: 19.878349
  
  Year: 2015 (days: 365)
  n: 172, sum: 14770, min: 19, max: 332, mean: 85.872093, median: 70, sd: 47.614140
  
  Year: 2016 (days: 365)
  n: 172, sum: 19451, min: 29, max: 352, mean: 113.087209, median: 97, sd: 61.186786
  
  Year: 2017 (days: 365)
  n: 172, sum: 21843, min: 36, max: 588, mean: 126.994186, median: 103, sd: 76.123514
  
  Year: 2018 (days: 365)
  n: 172, sum: 23678, min: 27, max: 430, mean: 137.662791, median: 111, sd: 80.183359
  
  Year: 2019 (days: 365)
  n: 172, sum: 23138, min: 31, max: 491, mean: 134.523256, median: 116.5, sd: 76.338079
  
  Year: 2020 (days: 366)
  n: 172, sum: 25700, min: 23, max: 551, mean: 149.418605, median: 127.5, sd: 89.049755
  
  Year: 2021 (days: 365)
  n: 172, sum: 28075, min: 50, max: 507, mean: 163.226744, median: 134.5, sd: 90.869361
  
  Year: 2022 (days: 365)
  n: 172, sum: 27565, min: 40, max: 409, mean: 160.261628, median: 139.5, sd: 76.489698
  
  Year: 2023 (days: 172)
  n: 172, sum: 27805, min: 43, max: 616, mean: 161.656977, median: 129.5, sd: 97.531724

Note that the first three years were pretty low (min = 1, mean = 2.48, for 2007), but goinng back 5 years a story with > 23 points could have made the front page. There are also a few days with < 30 stories, all occurring in 2007 if memory serves.

(This is my first time seeing these particular stats, another analysis I'd been thinking of doing for a while. I had calculated the delta between 1st and 30th ranked stories going back some time. Also the variance by day of week, which is also fairly significant.)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: