Lexis especially makes even scraping public state laws and admin codes difficult, which is extra frustrating because they are the legal publisher of record in a number of states.
I've been considering trying to launch an OpenStates style scraper project for US laws and admin codes, but haven't had the time to attack 100 more scrapers. Even with AI help, the volume is significant.
Who will maintain the git repo per state [1] [2]? There is value in a pipeline that continually ingests this data from various sources and pushes it into the Internet Archive, but if you wish to treat it as authoritative, it must have a human minding it, because of entropy and decay. Even the Python Software Foundation has a budget of ~$5M/year. Hence my openlaws.us example.
If it was as easy as writing a scraper and dumping it all in a bucket or repo, it'd already be done. It's just the usual thankless hard work over time grind.
I know a thing or two about that, 2,400 commits to the scrapers powering openstates over the past 9 years.
Even with openstates, we have an API but don't "just" dump the bills to git for legacy nerd reasons.
The nice thing about laws is that the host websites (or PDFs) don't change templates _that_ often, so generally you can rescrape quarterly (or in some states, annually) without a ton of maintenance. With administrative codes you need to scrape more often, but the websites are still pretty stable.
The downside is that codes in particular are often big, so a single scrape might need to make 20,000 or more requests, so you have to be very careful about rate limiting and proxies, which goes to my original point that it sucks that accessing this stuff is such a mess.
The assessor's office in my county provides data older than two weeks (iirc) as a few sql export dumps because that's how things were done back in the day.
Current information is gated behind a web2.0 view of their live data with severe limits. It wasn't designed to be scraped and is in fact hostile to the attempt. I'd imagine they're seeing rising hosting costs and that they'll keep rising.
I should reach out to them and see what this looks like from their angle. The local commercial real estate community is pretty tech-savvy and I'm wondering if we could all be a bit more proactive around data access.
I'd love to hear your thoughts on county vs state vs national data! I'd be very interested in any bandwidth usage or processing requirement info you might have recorded.
Also the states themselves can’t update their own laws. In Nevada the code (NRS) from Nevada’s website is out of date. Very embarrassing imo and hard to get it to work bc AI can’t have a trusted source of data.
The answer is probaly embedded within the concept of codification of acts. Legislatures pass acts, which are kind of like diffs for statutory law. But there is no base document, just a series of diffs from the beginning. Somewhere along the way, someone did a lot of work to “codify” the law, and when you go look up 18 USC 1001, and then click “next,” you are taking advantage of the codification process.
But the person who did the codification has some rights thereto, meaning that while NV can post every act that passed the legislature, they can’t publish someone else’s codification of the statutes.
This matters very little because everyone just has Westlaw and no one uses the state legislature’s website to cite statutes.
I would argue it does matter because the public has to know all the law as ignorance of the law is no excuse. If Westlaw is limited or an unreasonable monetary burden on the populace (or possibly, depending on your argument, costs anything at all) then the argument kind of evaporates of how you can prosecute someone, because an essential part of law is that the party under criminal penalty must be put under clear notice of what is illegal.
IMO, this should also extend to opinions -- if there is precedent that guides what the law is, it needs to be publicly published free of charge so that the public is put under notice what the law is. (someone might mention something like PACER is free in small quantities, I would counter it would cost you a gazillion dollars to be fully informed of all the precedent that forms the full common law meaning of the laws.) This is especially important in mala prohibita crimes since there's no way to even guess through moral/ethical deduction.
> argument kind of evaporates of how you can prosecute someone, because an essential part of law is that the party under criminal penalty must be put under clear notice of what is illegal.
I reckon that’s why the sixth amendment exists but if you want to make a free PACER, go for it.
The codification needs to become part of the process of passing acts. The government should be required to publish the updated code themselves along with any act that changed it. The whole concept that a commercial entity can have rights to the fully assembled text is terribly broken.
If anybody is worried about the jobs those businesses created, then tell them to pivot into publishing commented editions of the codes (add cross-references, references to relevant court decisions, etc.).
The codification happened hundreds of years ago, though.
But you could do it too! The Congressional Record is a thing, and it publishes all the acts of Congress, all the way back to the beginning.
The problem is that after you were done, the first thing someone would ask you is to cross-cite everything into the West Annotated code because no one else has your code and no one cares about it, because we all have Westlaw.
(Which publishes commented editions of the codes, with cross references, references to relevant court decisions, etc.)
It's all a little bit antiquated but it works fine. Someday it will change. I too thought it should work the way people are describing upthread when I was a computer guy but it is what it is.
I would imagine you either start with the first acts of your legislature and codify it from the beginning, or you start with some version of the code you figure you have rights to and go from there. It seems like it would be insane to do that job halfway, but that's not my area of expertise.
I have no idea whatsoever what is going on in Nevada.
I've been considering trying to launch an OpenStates style scraper project for US laws and admin codes, but haven't had the time to attack 100 more scrapers. Even with AI help, the volume is significant.