Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even if successfully litigated, doesn't this just move the scraping activity to less-obvious means, and to better-funded scrapers less concerned with legalities, making LinkedIn's efforts to clamp down upon scraping even more difficult? Is there a business case for LinkedIn to monetize the scraping by selling access to an API instead?

Between bot-nets, mechanical turks, deep learning, data brokering, lack of globally-enforced privacy laws that require divulging of sourced personal data, etc., I can't see a way for LinkedIn to prevent others from scraping and gaining from their publicly- and user-accessible data. They'll drive it underground, but if the concern is preventing others from grabbing the data at all, versus performance management, they'll still leak like a sieve.



LinkedIn gutted their API a couple years ago and now the information they took out has been moved to a private API

If your business is in the recruiter space, expect to have your API keys revoked and to receive a cease and desist letter as well.


This sounds like the setup to an up-spiraling arms race with "dark scrapers". I'm guessing behind-the-scenes, LinkedIn figures they can out-spend the extralegal scrapers, and likely considers their efforts will deliver halo effects to the rest of Microsoft. It would be educational to hear how LinkedIn plans to take down bot-net-based scraping that uses deep learning to identify patterns that successfully mimic human users and bypass their bot detection; could possibly help other white hats who want to battle bots and general malware.


I'm skeptical of the potential of dark scrapers at scale. You'd need to simulate too much human behavior to be unidentifiable, and humans are slow.

You would need real-looking bot accounts that you'll use to scrape. You'd need a realistically randomized rate limit, sampling from some distribution conditional on the type of the source page. You'd need realistic mouse/keyboard movements. Realistic hours of operations. Can't be scraping at 4AM and 4PM, and all of the hours in-between. Occasional noise operations, such as searching for a job, or getting salary estimates. You'd be geographically constrained. You wouldn't want your bot from Boston to be looking at too many individuals in Houston (regularly). Maybe you'd use a Markov chain to have the bot make decisions? I doubt the blackhats would have good training data for a neural net. You'd need tens of thousands of these bots to cover the linkedin user base in reasonable time (say, once every week on average), and these bots would have to either overlap or seriously underlap on who they cover.

Best use case would scraper-API that you can use to look up batches of specific people, with your bots looking at others only to look realistic.

(Or maybe not? It's a fun question, but I know fuck all about this. Not my area of expertise.)


Along the lines of "fun question, I'll take a stab at it just for giggles"; this would be far more interesting as an interview question than "estimate how many soccer balls can fit in a 747".

Average botnet size is 20,000 compromised PC's. Srizbi is estimated at 450,000. Another vector I'd explore is teaming up with crypto-miners. As I understand it, there are no economic returns tapping into the CPUs any longer, so miners are using only GPUs and ASICs; if this is true, they'll have some spare CPU cycles, that they'd probably be willing to rent out to get some marginal returns on the CPUs that have to run and manage the mining chips, running a JVM or some other VM. If we can do that, then we can probably tap 2-3M hosts, many of them rotating in and out per day.

Throw out an army of mechanical turk assignments to get real humans to register fake accounts. They get paid upon submitting an account and password, which your scraping servers verify, then change the password and commandeer. Perhaps have them register the fake account while running under a container or VM on their computer; the container/VM is instrumented to capture all activity. The activity metrics and data are uploaded to a deep learning system, that identifies the patterns that work and the ones that don't, and uses that to guide the developers of what to randomize, and by how much.

Add in a component to randomly invite/follow other fake and real accounts, and generate Markov-chain-generated copypasta. Set aside a portion of the fake accounts to only build up networks of users. Initially restrict the market of customers to those who only want once-a-year-updated data. As the network builds, use the notification of changes to selectively scrape only changed user profiles, and upsell for more up-to-date profiles at that time.

If I was LinkedIn, I'd probably concentrate on infiltrating botnet operators, and shutting them down. It would be one large cat-and-mouse game.


If the goal of such an operation were to effectively create an alternative to LinkedIn, along the likes of other "claim your listing" sites, then this could be a worthwhile cause.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: