docker01 Collapse and Rebuild

Rebuilding a failed Docker host into a more segmented, documented, and resilient homelab service architecture.

Overview

This project documents the collapse of my original docker01 virtual machine, the containment steps taken to protect the wider Proxmox environment, and the rebuild that followed.

What began as a failed service host became a broader infrastructure correction. Instead of trying to preserve an increasingly unclear VM state, I used the incident to rebuild around clearer service boundaries, safer change habits, stronger documentation, and a more deliberate operating model for the homelab.

Problem This Project Solved

The original docker01 VM had gradually accumulated too many responsibilities while I was expanding the homelab. It was being used for containerized services, reverse proxy testing, documentation-related workflows, and future service planning all at once.

During that period, I made an unsafe storage change: thinLVM capacity was reduced after allocated space had already been consumed. That caused host-level storage instability and made docker01 unreliable enough that continued repair work felt riskier than a controlled rebuild.

The immediate problem was a failed Docker host. The deeper problem was that the environment had outgrown its early, loosely defined structure.

Response Goals

The response had several priorities:

Stop making additional changes to an untrusted VM
Confirm that the failure was isolated to docker01
Protect the healthy parts of the Proxmox environment before rebuilding
Decide whether repair or rebuild was the better path
Replace the failed VM with a cleaner, more purpose-driven layout
Preserve the lessons in a durable incident writeup and follow-on documentation

Environment at the Time of Failure

The incident affected a growing homelab environment built around Proxmox VE.

Relevant systems included:

Proxmox VE as the hypervisor
docker01 as the affected Docker host
jellyfin01 as a dedicated media server VM
wg-home-gw as the WireGuard gateway VM
NAS-backed storage supporting the wider environment

At the time, docker01 had become the center of too much experimental work, which made it harder to reason about its state when the failure occurred.

Failure and Initial Assessment

The failure became visible through repeated data transfer issues, worsening trust in the VM, and uncertainty around whether additional service deployment work would compound the problem.

The immediate technical trigger was shrinking thinLVM storage after already-consumed capacity had created an unsafe condition. The broader architectural weaknesses were just as important:

Too many unrelated duties assigned to a single VM
Fast-moving experimentation without enough standardization
Limited rollback planning during early development
Incomplete documentation of the VM’s evolving role
A growing gap between “what the host was called” and “what it was actually responsible for”

The question stopped being “How do I rescue docker01?” and became:

“What rebuild would leave the homelab in a better place than it was before?”

Containment and Protection

Before replacing the failed Docker host, I focused on preventing one bad VM from becoming a larger environment problem.

The containment process included:

Pausing further deployment work on docker01
Verifying that the Proxmox host, Jellyfin VM, WireGuard gateway, and NAS storage were not showing the same failure pattern
Creating backups and snapshots for the healthy VMs before continuing
Intentionally avoiding a host-level backup of the unstable thinLVM state
Scoping the rebuild around the failed workload rather than destabilizing the entire environment

This was one of the most important operational lessons from the incident: when a system becomes untrusted, the priority is not speed. The priority is to protect what is still known-good.

Rebuild Strategy

I chose to rebuild instead of repair.

That decision was based on several factors:

The VM was early enough in its lifecycle that a clean rebuild was practical
The existing configuration had become harder to trust than to reproduce
A rebuild offered a better opportunity to separate duties more clearly
The recovery process could be documented and reused later
Rebuilding reduced the amount of technical debt being carried forward

Rather than recreating docker01 as another general-purpose host, I split the work into more intentional roles.

Rebuild Path

1. Replaced the single failed VM with purpose-built systems

Two new VMs were created to separate concerns that had previously been mixed together:

mediabe01 for Docker Compose-managed backend services and media-adjacent workloads
documentation01 for Git-based documentation work, Markdown editing, diagrams, and repository workflows

This reduced sprawl and made each VM easier to describe, troubleshoot, and evolve.

2. Re-established a clean Linux baseline

Ubuntu Server was installed on the rebuilt media backend host with standard initial setup:

User account creation
SSH access
System updates
Foundational package installation

This helped reset the deployment process to something known and repeatable.

3. Reinstalled Docker and Compose with a cleaner operating model

Docker Engine and the Compose plugin were installed on mediabe01, and the primary user was added to the Docker group for day-to-day container management.

The rebuild also created a more consistent working path for Compose files and service configuration:

sudo mkdir -p /opt/docker
sudo chown -R $USER:$USER /opt/docker

The goal was not only to restore container hosting, but to give future services a clearer home.

4. Validated the foundation with a simple container

Before returning to larger service deployments, I tested Docker and network reachability with a simple whoami container.

That validation step helped confirm:

Docker was functioning
Container deployment worked
Networking behaved as expected
The environment was ready for more meaningful services later

5. Verified the wider environment after the rebuild

Post-rebuild validation included:

Docker daemon functionality
Compose workflow readiness
GitHub SSH authentication
SMB/NAS storage accessibility
Inter-VLAN routing confirmation
VM snapshot verification

The rebuild was only considered successful once the system was not merely online, but believable again.

What Changed After the Rebuild

The strongest outcome of the incident was not the replacement VM itself. It was the improvement in the way the environment was organized.

Before

docker01 had become a general-purpose container host for too many unrelated efforts. Its name described the technology in use, not the role it served.

After

The homelab moved toward clearer system ownership:

mediabe01 for media backend and Compose-managed service work
jellyfin01 for the dedicated media server
documentation01 for Git, Markdown, and infrastructure documentation workflows
Storage isolated through the Storage VLAN
Infrastructure management kept distinct through the Management VLAN

This gave the environment a better foundation for future service growth.

Operational Improvements

The incident pushed several habits from “good ideas” into actual operating standards:

Snapshot before risky changes
Preserve known-good systems before acting on unstable ones
Separate VM responsibilities more deliberately
Treat rebuilds as a valid recovery strategy when repair would preserve confusion
Validate foundations with simple tests before layering on complexity
Record decisions while details are still fresh
Build documentation as part of operations, not after operations

What I’m Learning

This project taught me that good infrastructure work is not just about making services run. It is about making the environment easier to understand, safer to change, and less fragile when something does go wrong.

Specific lessons included:

Why storage changes require more caution than “the UI allows it”
How quickly a general-purpose VM can accumulate hidden complexity
When a rebuild is healthier than a repair
Why backups and snapshots matter most before the next risky step
How role separation improves both troubleshooting and future scaling
Why documentation becomes more valuable immediately after an incident
How small validation services can reduce uncertainty before deploying larger workloads

Current Status

The failed docker01 design has been retired.

The rebuilt environment is now organized around more explicit VM roles, with mediabe01 handling containerized backend services and documentation01 supporting Git-based homelab documentation workflows. The incident itself has been documented in the public homelab repository and continues to shape how future projects are scoped, backed up, and written down.

Planned Improvements

Create a standard VM build checklist
Publish a Docker host baseline setup guide
Add formal backup schedule documentation
Document restore testing procedures
Add monitoring with Uptime Kuma
Publish reverse proxy deployment notes
Document firewall rules between VLANs
Create reusable Docker Compose deployment templates
Related Documentation
Full incident report: docker01 Collapse and Rebuild
Future Docker host baseline guide
Future backup and restore procedures
Future VLAN/firewall documentation

Full incident report at GitHub: homelab/docs/incidents/docker01 Collapse and Rebuild.md
VLAN design documentation at GitHub: homelab/docs/networking/vlan-design.md
Future: backup and restore procedures
Future: firewall documentation