I have a rather extensive homelab, painstakingly set up over time. It works grea...

rlabrecque · on March 8, 2024

I think the core problem you have is a lack of consistency. I'd try to trim down the number of different types of device/deployment methods/etc you have. It's too hard to scale wide as a single person or two, but you can go very deep in one stack and both know it very well and also get into a spot where it's kind of like either everything works or nothing works if it's all exactly the same, which means ultimately everything will have to work well.

macromaniac · on March 9, 2024

I accidentally wiped the drives on my server last year, but it wasn't so bad due to my setup.

My strat is having a deploy.ps1 script in every project folder that sets the project up. 80% of the time this is making the vm, rcloning the files, and installing/starting the service if needed. Roughly 3ish lines, using custom commands. Takes about 100ms to deploy assuming the vm is up. Sometimes the script gets more complicated, but the general idea is that whatever system I use under the covers, running deploy.ps1 will set it up, without internet, without dependencies, this script will work until the heat death of the universe.

After losing everything I reran the deploys (put them into a list) and got everything back.

I'm with you on some of the routing setups, mine aren't 100% documented either. I think my router/switch config is too complex honestly and I should just tone it down.

drojas · on March 8, 2024

Oh I wish companies like Framework or System76 launched a reproducible manufacturing process where you code your hardware similarly to how Nix/Guix manages builds. Disaster recovery would be much easier. Perhaps Super Micro Computer can do this already but they target data centers. One can only dream.

sandreas · on March 8, 2024

I think the complexity of your infrastructure is not helpful - so I would start in reducing it.

I personally use one single server (Fujitsu D3417-B, Xeon 1225v5, 64GB ECC and WD SN850x 2TB NVMe) with Proxmox and an OpenWRT Router (Banana Pi BPI-R3). The Proxmox draws ~12W and the OpenWRT around 4.5W in Idle and with NodeJS I can run MeshCommander on OpenWRT for remote administration.

Since I use ZFS (with native encryption), I can do a full backup on an external drive via:

  # create backup pool on external drive
  zpool create -f rpoolbak /dev/sdb

  # create snapshot on proxmox NVMe
  zfs snapshot -r "rpool@backup-2024-01-19"

  # recursively send the snapshot to the external drive (initial backup)
  # pv only is there to monitor the transfer speed
  zfs send -R --raw rpool@backup-2024-01-19 | pv | zfs recv -Fdu rpoolbak

An incremental backup can be done by providing the -I option and two snapshots instead of one to mark the starting point and the stop of the incremental backup:

  # create new snapshot
  zfs snapshot -r "rpool@backup-2024-01-20"

  # only send everything between 2024-01-19 and 2024-01-20
  zfs send -RI --raw rpool@backup-2024-01-19 rpool@backup-2024-01-20 | pv | zfs recv -Fdu rpoolbak

That is easy, fast and pretty reliable. If you use this `zfs-auto-snapshot` you can rewind the filesystem in 15 minute steps.

This is also pretty helpful on rewinding virtual machines[1]. Recently my NVMe (a Samsung 980 Pro) died and restoring the backup via zfs (the backup shell command in reverse order from rpoolbak to rpool) took about 2 Hours for 700GB and I was back online with my server. I was pretty happy with the results, although I know that ZFS is kind of "experimental" in some cases, especially encryption.

1: https://pilabor.com/series/proxmox/restore-virtual-machine-v...

tegiddrone · on March 8, 2024

I've been trying to re-bootstrap my own homelab and I'm sort of gridlocked because I know I will forget or yes "cant keep it all in my head"... so I've been spending a lot of time documenting things and trying to make it simple, vanilla, and community supported when possible. As a result I have no homelab or "homeprod" as some others have coined here :)

I also found a friend who I am convincing of the same issues and so we may try mirroring our homelab documentation/process so that the "bus factor" isn't 1.

buybackoff · on March 8, 2024

I think people mix "homelab" and "homeprod". Here there are some other comments mentioning homeprod. To me what you describe sounds more like a homeprod setup. But I like the "lab" part. When one could just relax and experiment with lots of stuff.

I do have an ECC instance for data, to be backed up to a remote S3. I could torch it. It still remains a lab. I could just buy a new hardware and restore. But it's only about the valuable data (isolated to a couple of containers), not "restore of my entire setup".

bazmattaz · on March 8, 2024

I’m quite a homelab novice and I’m always tinkering so I have the same problem as you.

So I spent time creating a robust backup solution. Every week I backup all the docker volumes for each container to an external NAS drive.

Then I also run a script which backs up my entire docker compose file for every single container in one gigantic compose file.

The benefit of this is that I can make tweaks to containers and not have to manually record every tweak I made. I know that on Sunday at midnight it all gets backed up

steveh777 · on March 8, 2024

I just use Ansible to "orchestrate" the containers. I only run one of each, configured to restart on fail, so I don't need anything smarter than that.

adhamsalama · on March 8, 2024

Maybe something like Nix/NixOS can help you?

pl4nty · on March 8, 2024

k8s is a lot easier for homelabs than it used to be, and imo it's quicker than nix for building a declarative homelab. templates like this one can deploy a cluster in a few hours: https://github.com/onedr0p/cluster-template

here's my home assistant deployment as a single file: https://github.com/pl4nty/homelab/blob/main/kubernetes/clust...

I deliberately nuked my onprem cluster a few weeks ago, and was fully restored within 2 hours (including host OS reinstalls). and most of that was waiting for backup restores over my slow internet connection. I think the break even for me was around 15-20 containers - managing backups, config, etc scales pretty well with k8s.