Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.

BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.

The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.



> without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.


Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!


But that would require hiring staff to manage the process, and that is money taken away from sponsoring an F1 racing team.


The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.

But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.

[1] https://www.thedrive.com/news/crowdstrike-sponsored-mercedes...


Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.


Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.


These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.


If you're the admin, I would assume you have the ability to disable Crowdstrike. There must be some way to uninstall it, right?


Not if you want to keep the magic green compliance checkbox!


Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.

Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?


It's obviously the second option.


The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.


I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.

Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.

The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.


Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.


I cannot find that comment. Care to link it?



An admin can obviously disable a scheduled task... It's not "impossible" to remove the software, just annoying.


It's not obvious - the owner of the computer sets the rules.


If you're the owner, just turn it off and uninstall.


Doesn't malware do that as well?

But what other malware has been as successful? Crowdstrike can rest easy knowing it's taken down many of the most critical systems in the world.

Oh, no, actually, if Crowdstrike WAS malware, the authors would be in prison.. not running a $90B company.


it does. several crowdstrike alerts popped when i was remediating systems of the broken driver.


Wouldn't this be an attack vector? Use some low-hanging bug to bring down an entire security module, allowing you to escalate?


It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.


But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.


I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case


Availability of a system that can’t ensure data integrity seems equally bad though.


Tell that to the millions of people whose flights were canceled, the surgeries not performed, etc etc.


What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?


How does crowdstrike protect against instructions being saved on the wrong patient’s record?


Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.


Or anyone who owns CrowdStrike shares.


They’d surely have used some kind of Unix if uptime mattered.


before you get all smug recognize that linux has the exact same architecture, just because it wasn't impacted - this time.


Too late, I was born smug.

If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.


I'm sure the people who missed their flights because of this disagree.


Or families of those who die.


If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.


> Wouldn't this be an attack vector?

Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.

There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.

See: The optimal amount of fraud is non-zero.


In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.


That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.


I think we're in agreement. I now realize my previous comment replied to the wrong comment. I meant to reply to Lx1oG-AWb6h_ZG0. Sorry.


Requires state level social engineering.

Might by why north Koreans are trying to get work from home jobs.

https://www.businessinsider.com/woman-helped-north-korea-fin...


It does. CrowdStrike forced itself into boot process. Normal windows drivers will be disable automatically if they caused a crash


I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.


It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.


Certainly living up to the name


Indeed, far more damage caused than any actual malware!


This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.


This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.

I guess we won't really know until they release the post mortem...


Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.


But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.


That is the right way to do it.


> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.

If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.


If the file was signed, wouldn't that have prevented the corrupted transmission file from being loaded.

I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.


> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.

But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.


Still a staggered roll-out would have reduced the impact.


https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

per a new/green account


“And so that’s why we recommend using phased rollouts” -Every DevOps engineer from now on


“But that costs us money and time” - some suit.


"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh... And it even allow us to do cuts in headcount and infra by $<digits_here> a year."


So have we decided to stop using checksums or something?


Perhaps it was the checksum/signature process!


Ya gotta keep checksumming until you find a fixed point.


when something is changed, we usually re-test. that's the whole point of testing anyway. :)


> I didn't know at the time that the Windows kernel was paged.

At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.

However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.


If you want to hear the history of [DEC/VMS] NT from the horses mouth:

https://www.youtube.com/watch?v=xi1Lq79mLeE


Oh oh, 3 hours 10. I watched around half of it.

The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.

One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.


Here's a direct link:

https://youtu.be/xi1Lq79mLeE?t=4314

"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.


Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.

So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.


https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”


Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.


Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.

IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.


This is why power users want that standard old two digit '7 segment' display to show off that ONE hex code the BIOS writes to at various steps...

When stuff breaks, not if, WHEN it breaks, this at least gives a fighting chance at isolating the issue.


Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.


Old school guys at my first job could send the contents of the program counter to the speaker, and diagnose problems by the sound of it.

Definitely Old School Cool


I call this "throwing dye in the water".


I certainly used beeping for debugging more than once! : - )


Quoting James Mickens is always the winning move. I recommend the entire collection of his wisdom, https://mickens.seas.harvard.edu/wisdom-james-mickens


James Mickens’s Monitorama 2014 presentation had me laughing to the point of tears. “Look a word cloud!”

Title: "Computers are a Sadness, I am the Cure" https://vimeo.com/95066828


Say "word count" one more time!


Somebody get this man a serial port, or maybe a PC Speaker to Morse out diagnostics signals.


That's beautiful.


This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.


OS / driver development needs to be done on bare metal sometimes.


At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.


Sometimes you have to debug on a real machine. When you do, you'd usually use a serial port for your debug output. Everything has one.


>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.

Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.


Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.


"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?


Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.

The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...


There’s a design problem here if the driver can’t be self-contained in such a way that it’s possible to roll back the kernel to a known good state.


How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.


I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?


Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.


ohh


> Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

    "What are you complaining about? It works on my machine."™


> In which case why did they fail to prevent this?

"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."


You can pay for it and sign a file full of null characters. Signing has nothing to do with quality from what I understand.


"Yours sincerely,

Crowdstrike

---

PS - If you get hit by some massive crash, we refer you to our company's name. What were you expecting?"


[flagged]


Please explain this comment. How is the Crowdstrike incident related to the Key Bridge collision?


I think he's implying there was some sort of conspiracy by foreign actors.


This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.


What makes you think they have CI after what happened?


Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.


Disgruntled employee trying to use Crowd Strike to start a General Strike?


I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.


> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?


You would think automated test would come before your teammates work stations / commit to head.


Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.

In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.

The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.


Jenkins was called Hudson from 2005 until 2011, and version control is much, much older.

I'm surprised you didn't have two or more workstations.


Sorry, the comment wasn't meant to be a personal judgement on you.


I'm completely ignorant on the topic but isn't rebooting a default test for kernel code, given how sensitive it is?


Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.


But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.


That sounds like it was caught by luck, unless there was some test explicitly with that configuration in the QA process?


A lot of QA, especially at the system level, is just luck. That’s why it’s so important to dogfood internally imho.

And by internally I don’t just mean the development team, but anyone and everyone at the company who is allowed to have access to early builds.


There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.


Maybe thru luck, they're gonna uncover another xz utils backdoor MS version, but its probably gonna get covered up because, Microsoft


What does this mean?

Windows kernel paged, linux non paged?


The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.


Thank you!


Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.

Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.

EDIT: fixed spelling thanks to writing on phone.


Linux kernel memory isn’t paged out to disk, while Windows kernel memory can be: https://knowledge.broadcom.com/external/article/32146/third-...


Has that changed? I remember always creating a swap partition that was meant to be at least the size of RAM


I do not mean this to be blamey in any way shape or form and am asking only about the process:

Shouldn’t that have been caught in code review?


My manager actually blamed the more senior developer who reviewed my code for that one.


Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>

that they don't even do staged/A-B pushes was also <mind-blown-away>

But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...


So the key test, the test that was not run, was to turn the machine off and on again? Classic windows.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: