Owner / Builder

docker01 Collapse and Rebuild

Rebuilding a failed Docker host into a more segmented, documented, and resilient homelab service architecture.

Overview

This project documents the collapse of my original docker01 virtual machine, the containment steps taken to protect the wider Proxmox environment, and the rebuild that followed.

What began as a failed service host became a broader infrastructure correction. Instead of trying to preserve an increasingly unclear VM state, I used the incident to rebuild around clearer service boundaries, safer change habits, stronger documentation, and a more deliberate operating model for the homelab.

Problem This Project Solved

The original docker01 VM had gradually accumulated too many responsibilities while I was expanding the homelab. It was being used for containerized services, reverse proxy testing, documentation-related workflows, and future service planning all at once.

During that period, I made an unsafe storage change: thinLVM capacity was reduced after allocated space had already been consumed. That caused host-level storage instability and made docker01 unreliable enough that continued repair work felt riskier than a controlled rebuild.

The immediate problem was a failed Docker host. The deeper problem was that the environment had outgrown its early, loosely defined structure.

Response Goals

The response had several priorities:

  • Stop making additional changes to an untrusted VM
  • Confirm that the failure was isolated to docker01
  • Protect the healthy parts of the Proxmox environment before rebuilding
  • Decide whether repair or rebuild was the better path
  • Replace the failed VM with a cleaner, more purpose-driven layout
  • Preserve the lessons in a durable incident writeup and follow-on documentation

Environment at the Time of Failure

The incident affected a growing homelab environment built around Proxmox VE.

Relevant systems included:

  • Proxmox VE as the hypervisor
  • docker01 as the affected Docker host
  • jellyfin01 as a dedicated media server VM
  • wg-home-gw as the WireGuard gateway VM
  • NAS-backed storage supporting the wider environment

At the time, docker01 had become the center of too much experimental work, which made it harder to reason about its state when the failure occurred.

Failure and Initial Assessment

The failure became visible through repeated data transfer issues, worsening trust in the VM, and uncertainty around whether additional service deployment work would compound the problem.

The immediate technical trigger was shrinking thinLVM storage after already-consumed capacity had created an unsafe condition. The broader architectural weaknesses were just as important:

  • Too many unrelated duties assigned to a single VM
  • Fast-moving experimentation without enough standardization
  • Limited rollback planning during early development
  • Incomplete documentation of the VM’s evolving role
  • A growing gap between “what the host was called” and “what it was actually responsible for”

The question stopped being “How do I rescue docker01?” and became:

“What rebuild would leave the homelab in a better place than it was before?”

Containment and Protection

Before replacing the failed Docker host, I focused on preventing one bad VM from becoming a larger environment problem.

The containment process included:

  • Pausing further deployment work on docker01
  • Verifying that the Proxmox host, Jellyfin VM, WireGuard gateway, and NAS storage were not showing the same failure pattern
  • Creating backups and snapshots for the healthy VMs before continuing
  • Intentionally avoiding a host-level backup of the unstable thinLVM state
  • Scoping the rebuild around the failed workload rather than destabilizing the entire environment

This was one of the most important operational lessons from the incident: when a system becomes untrusted, the priority is not speed. The priority is to protect what is still known-good.

Rebuild Strategy

I chose to rebuild instead of repair.

That decision was based on several factors:

  • The VM was early enough in its lifecycle that a clean rebuild was practical
  • The existing configuration had become harder to trust than to reproduce
  • A rebuild offered a better opportunity to separate duties more clearly
  • The recovery process could be documented and reused later
  • Rebuilding reduced the amount of technical debt being carried forward

Rather than recreating docker01 as another general-purpose host, I split the work into more intentional roles.

Rebuild Path

1. Replaced the single failed VM with purpose-built systems

Two new VMs were created to separate concerns that had previously been mixed together:

  • mediabe01 for Docker Compose-managed backend services and media-adjacent workloads
  • documentation01 for Git-based documentation work, Markdown editing, diagrams, and repository workflows

This reduced sprawl and made each VM easier to describe, troubleshoot, and evolve.

2. Re-established a clean Linux baseline

Ubuntu Server was installed on the rebuilt media backend host with standard initial setup:

  • User account creation
  • SSH access
  • System updates
  • Foundational package installation

This helped reset the deployment process to something known and repeatable.

3. Reinstalled Docker and Compose with a cleaner operating model

Docker Engine and the Compose plugin were installed on mediabe01, and the primary user was added to the Docker group for day-to-day container management.

The rebuild also created a more consistent working path for Compose files and service configuration:

sudo mkdir -p /opt/docker
sudo chown -R $USER:$USER /opt/docker

The goal was not only to restore container hosting, but to give future services a clearer home.

4. Validated the foundation with a simple container

Before returning to larger service deployments, I tested Docker and network reachability with a simple whoami container.

That validation step helped confirm:

  • Docker was functioning
  • Container deployment worked
  • Networking behaved as expected
  • The environment was ready for more meaningful services later

5. Verified the wider environment after the rebuild

Post-rebuild validation included:

  • Docker daemon functionality
  • Compose workflow readiness
  • GitHub SSH authentication
  • SMB/NAS storage accessibility
  • Inter-VLAN routing confirmation
  • VM snapshot verification

The rebuild was only considered successful once the system was not merely online, but believable again.

What Changed After the Rebuild

The strongest outcome of the incident was not the replacement VM itself. It was the improvement in the way the environment was organized.

Before

docker01 had become a general-purpose container host for too many unrelated efforts. Its name described the technology in use, not the role it served.

After

The homelab moved toward clearer system ownership:

  • mediabe01 for media backend and Compose-managed service work
  • jellyfin01 for the dedicated media server
  • documentation01 for Git, Markdown, and infrastructure documentation workflows
  • Storage isolated through the Storage VLAN
  • Infrastructure management kept distinct through the Management VLAN

This gave the environment a better foundation for future service growth.

Operational Improvements

The incident pushed several habits from “good ideas” into actual operating standards:

  • Snapshot before risky changes
  • Preserve known-good systems before acting on unstable ones
  • Separate VM responsibilities more deliberately
  • Treat rebuilds as a valid recovery strategy when repair would preserve confusion
  • Validate foundations with simple tests before layering on complexity
  • Record decisions while details are still fresh
  • Build documentation as part of operations, not after operations

What I’m Learning

This project taught me that good infrastructure work is not just about making services run. It is about making the environment easier to understand, safer to change, and less fragile when something does go wrong.

Specific lessons included:

  • Why storage changes require more caution than “the UI allows it”
  • How quickly a general-purpose VM can accumulate hidden complexity
  • When a rebuild is healthier than a repair
  • Why backups and snapshots matter most before the next risky step
  • How role separation improves both troubleshooting and future scaling
  • Why documentation becomes more valuable immediately after an incident
  • How small validation services can reduce uncertainty before deploying larger workloads

Current Status

The failed docker01 design has been retired.

The rebuilt environment is now organized around more explicit VM roles, with mediabe01 handling containerized backend services and documentation01 supporting Git-based homelab documentation workflows. The incident itself has been documented in the public homelab repository and continues to shape how future projects are scoped, backed up, and written down.

Planned Improvements

  • Create a standard VM build checklist
  • Publish a Docker host baseline setup guide
  • Add formal backup schedule documentation
  • Document restore testing procedures
  • Add monitoring with Uptime Kuma
  • Publish reverse proxy deployment notes
  • Document firewall rules between VLANs
  • Create reusable Docker Compose deployment templates
  • Related Documentation
  • Full incident report: docker01 Collapse and Rebuild
  • Future Docker host baseline guide
  • Future backup and restore procedures
  • Future VLAN/firewall documentation
  • Full incident report at GitHub: homelab/docs/incidents/docker01 Collapse and Rebuild.md
  • VLAN design documentation at GitHub: homelab/docs/networking/vlan-design.md
  • Future: backup and restore procedures
  • Future: firewall documentation