Wow, this hits close to home. Doing a page fault where you can't in the kernel i...

bonestamp2 · on July 19, 2024

> without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.

mihaaly · on July 19, 2024

Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!

nikau · on July 20, 2024

But that would require hiring staff to manage the process, and that is money taken away from sponsoring an F1 racing team.

Rinzler89 · on July 20, 2024

The funniest part was seeing Mercedes F1 team pit crew staring at BSODs at their workstations[1] while wearing CrowdStrike t-shirts. Some jokes just write themselves. Imagine if they loose the race because of their sponsor.

But hey, at least they actually dogfood the products of their sponsors instead of just taking money to shill random stuff.

[1] https://www.thedrive.com/news/crowdstrike-sponsored-mercedes...

rvnx · on July 19, 2024

Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.

tatersolid · on July 19, 2024

Because CrowdStrike is an EDR solution it likely has tamper-proofing features (scheduled tasks, watchdog services, etc.) that re-enables it. These features are designed to prevent malware or manual attackers from disabling it.

rkagerer · on July 20, 2024

These features drive me nuts because they prevent me, the computer owner/admin, from disabling. One person thought up techniques like "let's make a scheduled task that sledgehammers out the knobs these 'dumb' users keep turning' and then everyone else decided to copycat that awful practice.

Thorrez · on July 20, 2024

If you're the admin, I would assume you have the ability to disable Crowdstrike. There must be some way to uninstall it, right?

xena · on July 20, 2024

Not if you want to keep the magic green compliance checkbox!

Thorrez · on July 20, 2024

Are you saying that the compliance rule requires the software to be uninstallable? Once it's installed it's impossible to uninstall? No one can uninstall it? I have a hard time believing it's impossible to remove the software. In the extreme case, you could reimage the machine and reinstall Windows without Crowdstrike.

Or are you saying that it is possible to uninstall, but once you do that, you're not in compliance, so while it's technically possible to uninstall, you'll be breaking the rules if you do so?

g15jv2dp · on July 20, 2024

It's obviously the second option.

Thorrez · on July 20, 2024

The person I originally replied to, rkagerer, said there was some technical measure preventing rkagerer from uninstalling it even though rkagerer has admin on the computer.

rkagerer · on July 23, 2024

I was referring to the difficulty overriding the various techniques certain modern software like this use to trigger automatic updates at times outside admin control.

Disabling a scheduled task is easy, but unfortunately vendors are piling on additional less obvious hooks. Eg. Dropbox recreates its scheduled task every time you (run? update?) it, and I've seen others that utilize the various autostart registry locations (there are lots of them) and non-obvious executables to perform similar "repair" operations. You wind up in "Deny Access" whackamole and even that isn't always effective. Uninstalling isn't an option if there's a business need for the software.

The fundamental issue is their developers / product managers have decided they know better than you. For the many users out there who are clueless to IT this may be accurate, but it's frustrating to me and probably others who upvoted the original comment.

Thorrez · on July 24, 2024

Is what you're saying relevant in the Crowdstrike case? If you don't want Crowdstrike and you're an admin, I assume there are instructions that allow you to uninstall it. I assume the tamper-resistant features of Crowdstrike won't prevent you from uninstalling it.

g15jv2dp · on July 20, 2024

I cannot find that comment. Care to link it?

Thorrez · on July 21, 2024

https://news.ycombinator.com/item?id=41014484

g15jv2dp · on July 21, 2024

An admin can obviously disable a scheduled task... It's not "impossible" to remove the software, just annoying.

throwaway55110 · on July 20, 2024

It's not obvious - the owner of the computer sets the rules.

throwaway55110 · on July 20, 2024

If you're the owner, just turn it off and uninstall.

gunapologist99 · on July 20, 2024

Doesn't malware do that as well?

But what other malware has been as successful? Crowdstrike can rest easy knowing it's taken down many of the most critical systems in the world.

Oh, no, actually, if Crowdstrike WAS malware, the authors would be in prison.. not running a $90B company.

Wheaties466 · on July 19, 2024

it does. several crowdstrike alerts popped when i was remediating systems of the broken driver.

Lx1oG-AWb6h_ZG0 · on July 19, 2024

Wouldn't this be an attack vector? Use some low-hanging bug to bring down an entire security module, allowing you to escalate?

SAI_Peregrinus · on July 19, 2024

It's currently a DOS by the crashing component, so it's already broken the Availability part of Confidentiality/Integrity/Availability that defines the goals of security.

hunter2_ · on July 19, 2024

But a loss of availability is so much more palatable than the others, plus the others often result in manually restricting availability anyway when discovered.

jamie0 · on July 19, 2024

I think the wider societal impact from the loss of availability today - particularly for those in healthcare settings - might suggest this isn't always the case

prng2021 · on July 19, 2024

Availability of a system that can’t ensure data integrity seems equally bad though.

azinman2 · on July 20, 2024

Tell that to the millions of people whose flights were canceled, the surgeries not performed, etc etc.

prng2021 · on July 20, 2024

What is the importance of data integrity? If important pre-op data/instructions are missing or gets saved on the wrong patient record which causes botched surgeries, if there are misprescribed post-op medications, if there is huge confusion and delays in critical follow-up surgeries because of a 100% available system that messed up patient data across hospitals nationwide, if there are malpractice lawsuits putting entire hospitals out of business etc etc, then is that fallout clearly worth having an available system in the first place?

azinman2 · on July 20, 2024

How does crowdstrike protect against instructions being saved on the wrong patient’s record?

prng2021 · on July 20, 2024

Huh? We're talking about hypotheticals here. You're saying availability is clearly more important than data integrity. I'm saying that if a buggy kernel loadable module allowed systems to keep on running as if nothing was wrong, but actually caused data integrity problems while the system is running, that's just as bad or worse.

josephg · on July 20, 2024

Or anyone who owns CrowdStrike shares.

bee_rider · on July 20, 2024

They’d surely have used some kind of Unix if uptime mattered.

skeeter2020 · on July 20, 2024

before you get all smug recognize that linux has the exact same architecture, just because it wasn't impacted - this time.

bee_rider · on July 20, 2024

Too late, I was born smug.

If Linux and Windows have similar architectural flaws, Microsoft must have some massive execution problems. They are getting embarrassed in QA by a bunch of hobbyists, lol.

LtWorf · on July 19, 2024

I'm sure the people who missed their flights because of this disagree.

tomrod · on July 19, 2024

Or families of those who die.

sudosysgen · on July 19, 2024

If you're planning around bugs in security modules, you're better off disabling them - malware routinely use bugs in drivers to escalate, so the bug you're allowing can make the escalation vector even more powerful as now it gets to Ring 0 early loading.

tomxor · on July 20, 2024

> Wouldn't this be an attack vector?

Isn't DoSing your own OS an attack vector? and a worse one when it's used in critical infrastructure where lives are at stake.

There is a reasonable balance to strike, sometimes it's not a good idea to go to extreme measures to prevent unlikely intrusion vectors due to the non-monetary costs.

See: The optimal amount of fraud is non-zero.

Thorrez · on July 20, 2024

In the absence of a Crowdstrike bug, if an attacker is able to cause Crowdstrike to trigger a bluescreen, I assume the attacker would be able to trigger a bluescreen in some other way. So I don't think this is a good argument for removing the check.

tomxor · on July 20, 2024

That assumes it's more likely than crowdstrike mass bricking all of these computers... this is the balance, it's not about possibility, it's about probability.

Thorrez · on July 20, 2024

I think we're in agreement. I now realize my previous comment replied to the wrong comment. I meant to reply to Lx1oG-AWb6h_ZG0. Sorry.

cyanydeez · on July 19, 2024

Requires state level social engineering.

Might by why north Koreans are trying to get work from home jobs.

https://www.businessinsider.com/woman-helped-north-korea-fin...

fortran77 · on July 19, 2024

It does. CrowdStrike forced itself into boot process. Normal windows drivers will be disable automatically if they caused a crash

commandersaki · on July 19, 2024

I use Explorer Patcher on a windows 11 machine. It had a history of crash loops with Explorer that they implemented this circuit breaker functionality.

cush · on July 19, 2024

It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.

causal · on July 19, 2024

Certainly living up to the name

kefabean · on July 20, 2024

Indeed, far more damage caused than any actual malware!

yibg · on July 19, 2024

This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.

dudeism_est_03 · on July 20, 2024

This is the bit I am still trying to understand. On CrowdStrike you can define how many updates a host is behind. I.e. n (latest), n-1 (one behind) or n-2 etc. This update was applied to a 'latest' policy hosts and the n-2 hosts. To me it appears that there was more to this than just a corrupt update, otherwise how was this policy ignored? Unless it doesn't separate the update as deeply and maybe just a small policy aspect, which would also be very concerning.

I guess we won't really know until they release the post mortem...

Xamayon · on July 20, 2024

Yeah, my guess is that they roll out the updates to every client at the same time, and then have the client implement the n-1/2/whatever part locally. That worked great-ish until they pushed a corrupt (empty) update file which crashed the client when it tried to interpret the contents... Not ideal, and obviously there isn't enough internal testing before sending stuff out to actual clients.

throwaway7356 · on July 20, 2024

But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.

rkagerer · on July 20, 2024

That is the right way to do it.

usr1106 · on July 20, 2024

> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.

If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

vlod · on July 20, 2024

If the file was signed, wouldn't that have prevented the corrupted transmission file from being loaded.

I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.

mbreese · on July 20, 2024

> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.

But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.

wedesoft · on July 20, 2024

Still a staggered roll-out would have reduced the impact.

password4321 · on July 19, 2024

https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

per a new/green account

spaceywilly · on July 19, 2024

“And so that’s why we recommend using phased rollouts” -Every DevOps engineer from now on

prox · on July 20, 2024

“But that costs us money and time” - some suit.

Woodi · on July 20, 2024

"And they promise fast threat mitigation... Let allow them to take over EVERYTHING! With remote access, of course. Some form of overwatch of what they in/out by our staff ? Meh... And it even allow us to do cuts in headcount and infra by $<digits_here> a year."

cududa · on July 19, 2024

So have we decided to stop using checksums or something?

password4321 · on July 19, 2024

Perhaps it was the checksum/signature process!

function_seven · on July 19, 2024

Ya gotta keep checksumming until you find a fixed point.

mihaaly · on July 19, 2024

when something is changed, we usually re-test. that's the whole point of testing anyway. :)

usr1106 · on July 20, 2024

> I didn't know at the time that the Windows kernel was paged.

At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.

However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.

nullindividual · on July 20, 2024

If you want to hear the history of [DEC/VMS] NT from the horses mouth:

https://www.youtube.com/watch?v=xi1Lq79mLeE

usr1106 · on July 20, 2024

Oh oh, 3 hours 10. I watched around half of it.

The VMS --> WNT acronym relationship was not mentioned, maybe it was just made up later.

One thing I did not know (or maybe not remember) is that NT was originally developed exclusively for the Intel i860, one of Intel's attempts to do RISC. Of course in the late 1980s CISC seemed deemed and everyone was moving to RISC. The code name of the i860 was N10. So that might well be the inside origin of NT, the marketing name New Technology retrofitted only later.

nullindividual · on July 20, 2024

Here's a direct link:

https://youtu.be/xi1Lq79mLeE?t=4314

"New Technology", if you want to search the transcript. Per Dave, marketing did not want to use "NT" for "New Technology" because they thought no one would buy new technology.

usr1106 · on July 21, 2024

Actually it was not only x86 hardware that was not really planned for the NT kernel, also Windows user space was not the first candidate. Posix and maybe even OS/2 were earlier goals.

So the current x86 Windows monoculture came up as an accident because strategically planned new options did not materialize. The user space change should finally debunk the theory that VMS andvances into WNT was a secret plot by the engineers involved. It was probably a coincidence discovered after the fact.

dralley · on July 19, 2024

https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”

qingcharles · on July 19, 2024

Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.

YZF · on July 20, 2024

Some of us have worked on embedded systems or board bringup. Scope and logic analyzer ... Serial port a luxury.

IIRC Windows has good support for debugging device drivers via the serial port. Overall the tooling for dealing with device drivers in windows is not bad including some special purpose static analysis tool and some pretty good testing.

mjevans · on July 20, 2024

This is why power users want that standard old two digit '7 segment' display to show off that ONE hex code the BIOS writes to at various steps...

When stuff breaks, not if, WHEN it breaks, this at least gives a fighting chance at isolating the issue.

AnimalMuppet · on July 20, 2024

Yeah. Been there, done that. Write to an unused address decode to trigger the logic analyzer when I got to a specific point in the code, so I could scroll back through the address bus and figure out what the program counter had done for me to get to that piece of code.

_tom_ · on July 20, 2024

Old school guys at my first job could send the contents of the program counter to the speaker, and diagnose problems by the sound of it.

Definitely Old School Cool

suzzer99 · on July 19, 2024

I call this "throwing dye in the water".

dr_kiszonka · on July 19, 2024

I certainly used beeping for debugging more than once! : - )

dwattttt · on July 19, 2024

Quoting James Mickens is always the winning move. I recommend the entire collection of his wisdom, https://mickens.seas.harvard.edu/wisdom-james-mickens

bofh23 · on July 20, 2024

James Mickens’s Monitorama 2014 presentation had me laughing to the point of tears. “Look a word cloud!”

Title: "Computers are a Sadness, I am the Cure" https://vimeo.com/95066828

dwattttt · on July 20, 2024

Say "word count" one more time!

rkagerer · on July 20, 2024

Somebody get this man a serial port, or maybe a PC Speaker to Morse out diagnostics signals.

Arrath · on July 19, 2024

That's beautiful.

killerstorm · on July 19, 2024

This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.

rcbdev · on July 19, 2024

OS / driver development needs to be done on bare metal sometimes.

pjmorris · on July 20, 2024

At the time, Mickens worked at Microsoft Research, and with the Windows kernel development team. There may only be a few reasons to experiment on your dev machine, but that's one environment where they have those reasons.

grishka · on July 20, 2024

Sometimes you have to debug on a real machine. When you do, you'd usually use a serial port for your debug output. Everything has one.

mtlynch · on July 19, 2024

>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.

Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.

steelframe · on July 19, 2024

Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.

temac · on July 19, 2024

"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?

YZF · on July 20, 2024

Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.

The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

rlanday · on July 20, 2024

There’s a design problem here if the driver can’t be self-contained in such a way that it’s possible to roll back the kernel to a known good state.

Rinzler89 · on July 20, 2024

How so? Preventing roll-backs on software updates is a "security feature" in most cases for better and for worse. Yeah, it would be convenient for tinkerers or in rare events such as these, but would be a security issue in the 99,9..99% of the time for enterprise users where security is the main concern.

Demonoculus · on July 20, 2024

I don't really understand this, many Linux distributions like Universal Blue advertise rollbacks as a feature. How is preventing a roll-back a "security feature"?

sltkr · on July 20, 2024

Imagine a driver has an exploitable vulnerability that is fixed in an update. If an attacker can force a rollback to the vulnerable older version, then the system is still vulnerable. Disallowing the rollback fixes this.

Demonoculus · on July 21, 2024

fransje26 · on July 20, 2024

> Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

    "What are you complaining about? It works on my machine."™

einpoklum · on July 19, 2024

> In which case why did they fail to prevent this?

"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."

whydoyoucare · on July 19, 2024

You can pay for it and sign a file full of null characters. Signing has nothing to do with quality from what I understand.

einpoklum · on July 19, 2024

"Yours sincerely,

Crowdstrike

---

PS - If you get hit by some massive crash, we refer you to our company's name. What were you expecting?"

Drygord · on July 19, 2024

[flagged]

cardamomo · on July 19, 2024

Please explain this comment. How is the Crowdstrike incident related to the Key Bridge collision?

Copy8314 · on July 19, 2024

I think he's implying there was some sort of conspiracy by foreign actors.

EasyMark · on July 19, 2024

This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.

Fire-Dragon-DoL · on July 20, 2024

What makes you think they have CI after what happened?

simonh · on July 20, 2024

Apparently the flaw was added to the config file in post-processing after it had completed testing. So they thought they had testing, but actually didn't.

tomrod · on July 19, 2024

Disgruntled employee trying to use Crowd Strike to start a General Strike?

jboy55 · on July 19, 2024

I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.

brightlancer · on July 19, 2024

> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?

nick7376182 · on July 19, 2024

You would think automated test would come before your teammates work stations / commit to head.

steelframe · on July 20, 2024

Did I mention this was 15 years ago? Software development back then looked very different than it does now, especially in Wincore. There was none of this "Cloud-native development" stuff that we all know and love today. GitHub was just about 1 year old. Jenkins wouldn't be a thing for another 2 years.

In this case the "automated test" flipped all kinds of configuration options with repeated reboots of a physical workstation. It took hours to run the tests, and your workstation would be constantly rebooting, so you wouldn't be accomplishing anything else for the rest of the day. It was faster and cheaper to require 8 devs to rollback to yesterday's build maybe once every couple of quarters than to snarl the whole development process with that.

The tests still ran, but they were owned and run by a dedicated test engineer prior to merging the branch up.

Symbiote · on July 20, 2024

Jenkins was called Hudson from 2005 until 2011, and version control is much, much older.

I'm surprised you didn't have two or more workstations.

nick7376182 · on July 22, 2024

Sorry, the comment wasn't meant to be a personal judgement on you.

Fire-Dragon-DoL · on July 20, 2024

I'm completely ignorant on the topic but isn't rebooting a default test for kernel code, given how sensitive it is?

steelframe · on July 20, 2024

Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.

sateesh · on July 20, 2024

But as someone already pointed out, the issue was seen on all kinds of windows hosts. Not just the ones running a specific version, specific update etc.

martin-adams · on July 20, 2024

That sounds like it was caught by luck, unless there was some test explicitly with that configuration in the QA process?

dagmx · on July 20, 2024

A lot of QA, especially at the system level, is just luck. That’s why it’s so important to dogfood internally imho.

And by internally I don’t just mean the development team, but anyone and everyone at the company who is allowed to have access to early builds.

account42 · on July 22, 2024

There's "something that requires highly specific conditions managed to slip past QA" and then there's "our update brought down literally everyone using the software". This isn't a matter of bad luck.

jokab · on July 20, 2024

Maybe thru luck, they're gonna uncover another xz utils backdoor MS version, but its probably gonna get covered up because, Microsoft

brcmthrowaway · on July 19, 2024

What does this mean?

Windows kernel paged, linux non paged?

ww520 · on July 19, 2024

The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.

quaintdev · on July 20, 2024

Thank you!

p_l · on July 19, 2024

Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.

Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.

EDIT: fixed spelling thanks to writing on phone.

ec109685 · on July 19, 2024

Linux kernel memory isn’t paged out to disk, while Windows kernel memory can be: https://knowledge.broadcom.com/external/article/32146/third-...

nw05678 · on July 20, 2024

Has that changed? I remember always creating a swap partition that was meant to be at least the size of RAM

isatty · on July 20, 2024

I do not mean this to be blamey in any way shape or form and am asking only about the process:

Shouldn’t that have been caught in code review?

steelframe · on July 20, 2024

My manager actually blamed the more senior developer who reviewed my code for that one.

hevisko · on July 20, 2024

Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>

that they don't even do staged/A-B pushes was also <mind-blown-away>

But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...

sandworm101 · on July 20, 2024

So the key test, the test that was not run, was to turn the machine off and on again? Classic windows.