If I understand correctly, user namespace support was the major thing missing for the secure use of Linux Containers (LXC). This should bring widespread, extremely rapid container-based virtualization under Linux closer to a reality. Redhat is nominally backing this via libvirt, but mainly offers paravirt-based solutions. IBM who authored a lot of the kernel code seems to be interested the same stuff for large-scale servers. We live in interesting times.
Container-based virtualization is already here, but without user namespaces root privilege in the container implies root privilege in the host. That's no longer the case; interesting times indeed.
Maybe. Remember it is new code. In principle it should be no worse than constraining a user; there are still risks of kernel compromise (one just the other day). I would check if you use any kernel modules that are not in the main tree though. Also I don't believe it is a full root, ie it can only do operations that have been whitelisted (eg creating other namespaces), otherwise you could just use mknod and overwrite the host harddrive. So your code you want to run as root may not work. You can do stuff like open low numbered ports though if you open a network namespace, so long as the host sets up bridging for it.
I've had the same sentiment for many years... Linux is indeed a miracle.
But after spending a lot of time with it recently (e.g. getting LXC running last night), I've concluded that it would be very hard to design a system with a security API that is worse than Linux.
The issue is that Linus doesn't design ANYTHING. He doesn't believe in design; he only believes in evolution.
Unix was designed, whereas Linux is mostly a bunch of code bolted on top of Unix. It's not sustainable in the long term. Someone needs to actually design something eventually, so there is a stable base for more evolution.
Spend some time looking through these:
- traditional Unix ACL-based security
- traditional resource limits
- chroot (not secure, but used as a "part" of many security solutions)
- capabilities
- seccomp
- LSM-based
- SELinux
- AppArmor
- ...
- LXC
- cgroups
- namespaces (apparently completed with this kernel release)
- LXC user space tools
- ptrace sandboxing
- (at least a dozen projects use this)
- user mode linux
And you'll realize it's just a huge mess. I'm sure the complexity makes Linux measurably more insecure in practice. Or it just provides employment for a lot of people -- who knows.
There's never going to be a way to clean this all up, since people are relying on all of it.
I don't have that much experience with the alternatives; I'm sure they're messy in their own right. (I've used many OSes, but not security-wise.) But this definitely has me looking towards FreeBSD and such. Too bad it is more expensive on EC2.
I mentioned Minix 3 here before -- it's probably a pipe dream, but being a microkernel, it seems like a good basis for a future secure Unix. It actually was designed in some sense.
From what I gather people take the existence of root escalation exploits on Linux for granted. If that weren't so (and it shouldn't be with a microkernel), then traditional Unix security might actually cover a lot of cases that all these hacks on top are patching up.
EDIT: Also, Linux should look to DJB for guidance. Out of all the hairiniess, how do you even do this on Linux (or any Unix)? http://cr.yp.to/unix/disablenetwork.html It just seems crazy.
With the disclaimer that I very much have a dog in the fight, you might want to look at illumos[1] and its distributions like SmartOS[2] and OmniOS[3]. It has a secure, robust container model (with a hat tip to FreeBSD jails for providing inspiration over a decade ago) and a mature least-privilege model that minimizes attack surface -- not to mention ZFS, DTrace, KVM and other goodies. At the very least, you can take solace in knowing that others share your desire for cleaner alternatives...
Thanks for the links; I had heard of SmartOS but not known much about the technology.
What sort of disappointed me about LXC is that you end up with an init process and 7 or 8 children of it in each container. I am more interested in sandboxing at the level of a single process. In a lot of cases you just want to run somebody else's Python code and look at its stdout; you don't need to spin up init and family do that.
There are a hundred and one projects like this but most of them seem half-baked.
Capsicum [1] looks like what I'm interested in; there seemed to be effort around a Linux port a couple years ago but I don't think it happened. Does Illumos/SmartOS provide anything like this?
You don't need init in each container, and the encouraged model of having a whole distro in a container is bonkers. Play around with clone(2)/unshare(2) directly, and it is fairly simple. All you need to know about pid 1 is if it terminates your namespace goes, and orphan processes will reparent to it (and some signals are blocked). If you have a single process then this doesn't matter really. You can do all this from Python I expect, I have done it all from Lua with no issues.
OK from what I understand "LXC" is basically the user space tools that give you the distro in the container... it's more of a VM model.
But yeah I think I just need the underlying cgroups, and possibly some of the namespaces. Although I don't car aell that much if untrusted code can see what processes are running; just as long as it can't affect them.
Just curious what you were using containers for from Lua? Sounds interesting.
I started using them largely for testing netlink code, as it is much easier to create some isolated network devices than risk messing about with the real ones. This is part of a fairly comprehensive Linux binding for Lua https://github.com/justincormack/ljsyscall
Have you read Spenders (the grsec maintainer) recent comments on Linux security development practices? They provided interesting insight imho, even if they were bit flame-y.
Unix was designed? You wish. The very early parts of Unix were designed, but pretty much everything after that was bolted on under the banner of "worse is better".
For disable network as linked, the new seccomp mechanism is probably what you need, as you can blacklist socket system calls (or better whitelist). But it is pretty new.
I have high hopes for the automatic NUMA balancing work. We are getting a lot of cores per socket capable of generating a lot of traffic per core and the disparity which was already pretty large continues to grow.
That said the scheduler does pretty well, it beats manually binding without a lot of experimentation.
Does anyone else love reading these, even though they are mostly way over your head? There is something about kernel release notes that makes them fascinating...
3.7.x has been borderline breaking on my laptop for a month now and I'm fairly sure that the issues I've experienced affect a large number of users (excessive heat, low battery life on Sandy/Ivy-Bridge based notebooks). Compiling 3.8 as I'm typing this.
Pretty excited about seeing F2FS in mobile devices eventually, as mobile devices still need all the I/O performance they can get, considering most manufacturers are using cheap flash storage, and even the high-end ones aren't that fast.
At the moment it's had some reports of syncing issues (e.g. not syncing when it should), and a few performance odditites. All that being said, I'm not sure i've seen reports of any corruption outright but it is fairly new compared to the ext family. I'd say stick with EXT4/3 if you're using the desktop ssd for anything serious, if not, go for it; it doesn't look like it'll kill your cat and/or wife.
Oh, I don't care if it eats my root install. I've got a "reinstall everything and reconfigure everything" script that gets me back to 99% after a new install. Plus, I keep my home partition on a "totally stable TM" RAID10 BTRFS array, so I don't mind a bit of risk.
_So far_ my experience has been btrfs is stable, meaning I haven't had any data corruption and TRIM works great. That being said I have multiple backups and images of the system :)
Storing files data inside inodes is really great feature - it should make usages of file-system-as-DB approaches usable in a lot of cases where it didn't make sense before.
It's human nature to tool worship. I am no exception. But please, let's worship tools that enable radical innovation. The Linux kernel was that tool--in 1999. It's still improving, but it's not worthy of this much attention. It merely allows existing innovation to work better. It's like celebrating the latest Xeon processor--cool, yes, but not worth this much collective distraction.
Sad. This had Thunderbolt hotplug working on my Mac at one RC and then it got removed and didn't get added back in. Thunderbolt Ethernet adapter does work, but it was kernel panicing yesterday on rc7. I'll have to see tonight if it's fixed.
To others, where would I report such an issue if it's still broken?