I have a rather extensive homelab, painstakingly set up over time. It works great, I love it. Few questions for you guys:
- My real problem is disaster recovery. It would take me forever to replicate everything, if I could even remember it all. Router configurations, switch configurations, NAS, all the various docker containers scattered across different vlans, etc. I mapped out my network early on but failed to keep it up to date over time. Is there a good tool to draw, document, and keep up-to-date diagrams of my infra?
- Backup and upgrading is also a persistent problem for me. I will often set up a container, come back to it 6 months later, and have no idea what I did. I have dozens of containers scattered across different machines (NUCs, NAS, desktops, servers, etc). Every container service feels like it has its own convention for where the bind mounts need to go, what user it should be run as, what permissions etc it needs. I can't keep track of it all in my head, especially after the fact. I just want to be able to hit backup, restore, and upgrade on some centralized interface. It makes me miss the old cattle days with VM clone/snapshot. I still have a few VMs running on a proxmox machine that is sort of close, but nothing like that for the entire home lab.
I really want to get to a point, or at least move towards a solution where I could in theory, torch my house and do a full disaster recovery restore of my entire setup.
There has to be something simpler than going full kubernetes to manage a home setup. What do you guys use?
I think the core problem you have is a lack of consistency. I'd try to trim down the number of different types of device/deployment methods/etc you have. It's too hard to scale wide as a single person or two, but you can go very deep in one stack and both know it very well and also get into a spot where it's kind of like either everything works or nothing works if it's all exactly the same, which means ultimately everything will have to work well.
I accidentally wiped the drives on my server last year, but it wasn't so bad due to my setup.
My strat is having a deploy.ps1 script in every project folder that sets the project up. 80% of the time this is making the vm, rcloning the files, and installing/starting the service if needed. Roughly 3ish lines, using custom commands. Takes about 100ms to deploy assuming the vm is up. Sometimes the script gets more complicated, but the general idea is that whatever system I use under the covers, running deploy.ps1 will set it up, without internet, without dependencies, this script will work until the heat death of the universe.
After losing everything I reran the deploys (put them into a list) and got everything back.
I'm with you on some of the routing setups, mine aren't 100% documented either. I think my router/switch config is too complex honestly and I should just tone it down.
Oh I wish companies like Framework or System76 launched a reproducible manufacturing process where you code your hardware similarly to how Nix/Guix manages builds. Disaster recovery would be much easier. Perhaps Super Micro Computer can do this already but they target data centers. One can only dream.
I think the complexity of your infrastructure is not helpful - so I would start in reducing it.
I personally use one single server (Fujitsu D3417-B, Xeon 1225v5, 64GB ECC and WD SN850x 2TB NVMe) with Proxmox and an OpenWRT Router (Banana Pi BPI-R3). The Proxmox draws ~12W and the OpenWRT around 4.5W in Idle and with NodeJS I can run MeshCommander on OpenWRT for remote administration.
Since I use ZFS (with native encryption), I can do a full backup on an external drive via:
# create backup pool on external drive
zpool create -f rpoolbak /dev/sdb
# create snapshot on proxmox NVMe
zfs snapshot -r "rpool@backup-2024-01-19"
# recursively send the snapshot to the external drive (initial backup)
# pv only is there to monitor the transfer speed
zfs send -R --raw rpool@backup-2024-01-19 | pv | zfs recv -Fdu rpoolbak
An incremental backup can be done by providing the -I option and two snapshots instead of one to mark the starting point and the stop of the incremental backup:
# create new snapshot
zfs snapshot -r "rpool@backup-2024-01-20"
# only send everything between 2024-01-19 and 2024-01-20
zfs send -RI --raw rpool@backup-2024-01-19 rpool@backup-2024-01-20 | pv | zfs recv -Fdu rpoolbak
That is easy, fast and pretty reliable. If you use this `zfs-auto-snapshot` you can rewind the filesystem in 15 minute steps.
This is also pretty helpful on rewinding virtual machines[1]. Recently my NVMe (a Samsung 980 Pro) died and restoring the backup via zfs (the backup shell command in reverse order from rpoolbak to rpool) took about 2 Hours for 700GB and I was back online with my server. I was pretty happy with the results, although I know that ZFS is kind of "experimental" in some cases, especially encryption.
I've been trying to re-bootstrap my own homelab and I'm sort of gridlocked because I know I will forget or yes "cant keep it all in my head"... so I've been spending a lot of time documenting things and trying to make it simple, vanilla, and community supported when possible. As a result I have no homelab or "homeprod" as some others have coined here :)
I also found a friend who I am convincing of the same issues and so we may try mirroring our homelab documentation/process so that the "bus factor" isn't 1.
I think people mix "homelab" and "homeprod". Here there are some other comments mentioning homeprod. To me what you describe sounds more like a homeprod setup. But I like the "lab" part. When one could just relax and experiment with lots of stuff.
I do have an ECC instance for data, to be backed up to a remote S3. I could torch it. It still remains a lab. I could just buy a new hardware and restore. But it's only about the valuable data (isolated to a couple of containers), not "restore of my entire setup".
I’m quite a homelab novice and I’m always tinkering so I have the same problem as you.
So I spent time creating a robust backup solution. Every week I backup all the docker volumes for each container to an external NAS drive.
Then I also run a script which backs up my entire docker compose file for every single container in one gigantic compose file.
The benefit of this is that I can make tweaks to containers and not have to manually record every tweak I made. I know that on Sunday at midnight it all gets backed up
k8s is a lot easier for homelabs than it used to be, and imo it's quicker than nix for building a declarative homelab. templates like this one can deploy a cluster in a few hours: https://github.com/onedr0p/cluster-template
I deliberately nuked my onprem cluster a few weeks ago, and was fully restored within 2 hours (including host OS reinstalls). and most of that was waiting for backup restores over my slow internet connection. I think the break even for me was around 15-20 containers - managing backups, config, etc scales pretty well with k8s.
- My real problem is disaster recovery. It would take me forever to replicate everything, if I could even remember it all. Router configurations, switch configurations, NAS, all the various docker containers scattered across different vlans, etc. I mapped out my network early on but failed to keep it up to date over time. Is there a good tool to draw, document, and keep up-to-date diagrams of my infra?
- Backup and upgrading is also a persistent problem for me. I will often set up a container, come back to it 6 months later, and have no idea what I did. I have dozens of containers scattered across different machines (NUCs, NAS, desktops, servers, etc). Every container service feels like it has its own convention for where the bind mounts need to go, what user it should be run as, what permissions etc it needs. I can't keep track of it all in my head, especially after the fact. I just want to be able to hit backup, restore, and upgrade on some centralized interface. It makes me miss the old cattle days with VM clone/snapshot. I still have a few VMs running on a proxmox machine that is sort of close, but nothing like that for the entire home lab.
I really want to get to a point, or at least move towards a solution where I could in theory, torch my house and do a full disaster recovery restore of my entire setup.
There has to be something simpler than going full kubernetes to manage a home setup. What do you guys use?