Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.
BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.
The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
> without even the most basic level of qualification
That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.
Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!
The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.
But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.
Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.
These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.
Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.
Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?
The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.
I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.
Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.
The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.
Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.
It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.
But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.
I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case
What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?
Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.
If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.
If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.
Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.
There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.
In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.
That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.
I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.
It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.
This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.
This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.
I guess we won't really know until they release the post mortem...
Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.
But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.
> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.
If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.
> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.
To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.
But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.
the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers
"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh...
And it even allow us to do cuts in headcount and infra by $<digits_here> a year."
> I didn't know at the time that the Windows kernel was paged.
At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.
However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.
The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.
One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.
"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.
Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.
So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.
"Perhaps the worst thing about being a systems person is that
other, non-systems people think that they understand the daily
tragedies that compose your life. For example, a few weeks ago,
I was debugging a new network file system that my research
group created. The bug was inside a kernel-mode component,
so my machines were crashing in spectacular and vindic-
tive ways. After a few days of manually rebooting servers, I
had transformed into a shambling, broken man, kind of like a
computer scientist version of Saddam Hussein when he was
pulled from his bunker, all scraggly beard and dead eyes and
florid, nonsensical ramblings about semi-imagined enemies.
As I paced the hallways, muttering Nixonian rants about my
code, one of my colleagues from the HCI group asked me what
my problem was. I described the bug, which involved concur-
rent threads and corrupted state and asynchronous message
delivery across multiple machines, and my coworker said,
“Yeah, that sounds bad. Have you checked the log files for
errors?” I said, “Indeed, I would do that if I hadn’t broken every
component that a logging system needs to log data. I have a
network file system, and I have broken the network, and I have
broken the file system, and my machines crash when I make
eye contact with them. I HAVE NO TOOLS BECAUSE I’VE
DESTROYED MY TOOLS WITH MY TOOLS. My only logging
option is to hire monks to transcribe the subjective experience
of watching my machines die as I weep tears of blood.”
Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.
Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.
IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.
Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.
This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.
At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.
>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.
Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.
Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.
"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."
It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?
Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.
The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...
How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.
I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?
Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.
This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.
Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.
I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.
> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.
In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.
The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.
Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.
But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.
There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.
The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.
Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.
Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.
Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>
that they don't even do staged/A-B pushes was also <mind-blown-away>
BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.
The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.