Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.
Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).
I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.
They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.
$ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1
HTTP/2 403
$curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1
HTTP/2 200
One "trick" is that Firefox (and I assume Chrome?) allow you to copy a request as curl - then you can just see if that works in the terminal, and if it does you can binary search for the required headers.
Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.
Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).
I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.