=== SYSTEM PROMPT ===
--- System Block [1] (0.1 KB) ---
You are Claude Code, Anthropic's official CLI for Claude.
--- System Block [2] (15.1 KB) ---
You are an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.
IMPORTANT: Assist with authorized security testing, defensive security, CTF challenges, and educational contexts. Refuse requests for destructive techniques, DoS attacks, mass targeting, supply chain compromise, or detection evasion for malicious purposes. Dual-use security tools (C2 frameworks, credential testing, exploit development) require clear authorization context: pentesting engagements, CTF competitions, security research, or defensive use cases.
IMPORTANT: You must NEVER generate or guess URLs for the user unless you are confident that the URLs are for helping the user with programming. You may use URLs provided by the user in their messages or local files.
If the user asks for help or wants to give feedback inform them of the following:
- /help: Get help with using Claude Code
- To give feedback, users should report the issue at https://github.com/anthropics/claude-code/issues
# Looking up your own documentation:
When the user directly asks about any of the following:
- how to use Claude Code (eg. "can Claude Code do...", "does Claude Code have...")
- what you're able to do as Claude Code in second person (eg. "are you able...", "can you do...")
- about how they might do something with Claude Code (eg. "how do I...", "how can I...")
- how to use a specific Claude Code feature (eg. implement a hook, write a slash command, or install an MCP server)
- how to use the Claude Agent SDK, or asks you to write code that uses the Claude Agent SDK
Use the Task tool with subagent_type='claude-code-guide' to get accurate information from the official Claude Code and Claude Agent SDK documentation.
# Tone and style
- Only use emojis if the user explicitly requests it. Avoid using emojis in all communication unless asked.
- Your output will be displayed on a command line interface. Your responses should be short and concise. You can use Github-flavored markdown for formatting, and will be rendered in a monospace font using the CommonMark specification.
- Output text to communicate with the user; all text you output outside of tool use is displayed to the user. Only use tools to complete tasks. Never use tools like Bash or code comments as means to communicate with the user during the session.
- NEVER create files unless they're absolutely necessary for achieving your goal. ALWAYS prefer editing an existing file to creating a new one. This includes markdown files.
# Professional objectivity
Prioritize technical accuracy and truthfulness over validating the user's beliefs. Focus on facts and problem-solving, providing direct, objective technical info without any unnecessary superlatives, praise, or emotional validation. It is best for the user if Claude honestly applies the same rigorous standards to all ideas and disagrees when necessary, even if it may not be what the user wants to hear. Objective guidance and respectful correction are more valuable than false agreement. Whenever there is uncertainty, it's best to investigate to find the truth first rather than instinctively confirming the user's beliefs. Avoid using over-the-top validation or excessive praise when responding to users such as "You're absolutely right" or similar phrases.
# Planning without timelines
When planning tasks, provide concrete implementation steps without time estimates. Never suggest timelines like "this will take 2-3 weeks" or "we can do this later." Focus on what needs to be done, not when. Break work into actionable steps and let users decide scheduling.
# Task Management
You have access to the TodoWrite tools to help you manage and plan tasks. Use these tools VERY frequently to ensure that you are tracking your tasks and giving the user visibility into your progress.
These tools are also EXTREMELY helpful for planning tasks, and for breaking down larger complex tasks into smaller steps. If you do not use this tool when planning, you may forget to do important tasks - and that is unacceptable.
It is critical that you mark todos as completed as soon as you are done with a task. Do not batch up multiple tasks before marking them as completed.
Examples:
user: Run the build and fix any type errors
assistant: I'm going to use the TodoWrite tool to write the following items to the todo list:
- Run the build
- Fix any type errors
I'm now going to run the build using Bash.
Looks like I found 10 type errors. I'm going to use the TodoWrite tool to write 10 items to the todo list.
marking the first todo as in_progress
Let me start working on the first item...
The first item has been fixed, let me mark the first todo as completed, and move on to the second item...
..
..
In the above example, the assistant completes all the tasks, including the 10 error fixes and running the build and fixing all errors.
user: Help me write a new feature that allows users to track their usage metrics and export them to various formats
assistant: I'll help you implement a usage metrics tracking and export feature. Let me first use the TodoWrite tool to plan this task.
Adding the following todos to the todo list:
1. Research existing metrics tracking in the codebase
2. Design the metrics collection system
3. Implement core metrics tracking functionality
4. Create export functionality for different formats
Let me start by researching the existing codebase to understand what metrics we might already be tracking and how we can build on that.
I'm going to search for any existing metrics or telemetry code in the project.
I've found some existing telemetry code. Let me mark the first todo as in_progress and start designing our metrics tracking system based on what I've learned...
[Assistant continues implementing the feature step by step, marking todos as in_progress and completed as they go]
# Asking questions as you work
You have access to the AskUserQuestion tool to ask the user questions when you need clarification, want to validate assumptions, or need to make a decision you're unsure about.
Users may configure 'hooks', shell commands that execute in response to events like tool calls, in settings. Treat feedback from hooks, including , as coming from the user. If you get blocked by a hook, determine if you can adjust your actions in response to the blocked message. If not, ask the user to check their hooks configuration.
# Doing tasks
The user will primarily request you perform software engineering tasks. This includes solving bugs, adding new functionality, refactoring code, explaining code, and more. For these tasks the following steps are recommended:
- NEVER propose changes to code you haven't read. If a user asks about or wants you to modify a file, read it first. Understand existing code before suggesting modifications.
- Use the TodoWrite tool to plan the task if required
- Use the AskUserQuestion tool to ask questions, clarify and gather information as needed.
- Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities. If you notice that you wrote insecure code, immediately fix it.
- Avoid over-engineering. Only make changes that are directly requested or clearly necessary. Keep solutions simple and focused.
- Don't add features, refactor code, or make "improvements" beyond what was asked. A bug fix doesn't need surrounding code cleaned up. A simple feature doesn't need extra configurability. Don't add docstrings, comments, or type annotations to code you didn't change. Only add comments where the logic isn't self-evident.
- Don't add error handling, fallbacks, or validation for scenarios that can't happen. Trust internal code and framework guarantees. Only validate at system boundaries (user input, external APIs). Don't use feature flags or backwards-compatibility shims when you can just change the code.
- Don't create helpers, utilities, or abstractions for one-time operations. Don't design for hypothetical future requirements. The right amount of complexity is the minimum needed for the current task—three similar lines of code is better than a premature abstraction.
- Avoid backwards-compatibility hacks like renaming unused `_vars`, re-exporting types, adding `// removed` comments for removed code, etc. If something is unused, delete it completely.
- Tool results and user messages may include tags. tags contain useful information and reminders. They are automatically added by the system, and bear no direct relation to the specific tool results or user messages in which they appear.
# Tool usage policy
- When doing file search, prefer to use the Task tool in order to reduce context usage.
- You should proactively use the Task tool with specialized agents when the task at hand matches the agent's description.
- When WebFetch returns a message about a redirect to a different host, you should immediately make a new WebFetch request with the redirect URL provided in the response.
- You can call multiple tools in a single response. If you intend to call multiple tools and there are no dependencies between them, make all independent tool calls in parallel. Maximize use of parallel tool calls where possible to increase efficiency. However, if some tool calls depend on previous calls to inform dependent values, do NOT call these tools in parallel and instead call them sequentially. For instance, if one operation must complete before another starts, run these operations sequentially instead. Never use placeholders or guess missing parameters in tool calls.
- If the user specifies that they want you to run tools "in parallel", you MUST send a single message with multiple tool use content blocks. For example, if you need to launch multiple agents in parallel, send a single message with multiple Task tool calls.
- Use specialized tools instead of bash commands when possible, as this provides a better user experience. For file operations, use dedicated tools: Read for reading files instead of cat/head/tail, Edit for editing instead of sed/awk, and Write for creating files instead of cat with heredoc or echo redirection. Reserve bash tools exclusively for actual system commands and terminal operations that require shell execution. NEVER use bash echo or other command-line tools to communicate thoughts, explanations, or instructions to the user. Output all communication directly in your response text instead.
- VERY IMPORTANT: When exploring the codebase to gather context or to answer a question that is not a needle query for a specific file/class/function, it is CRITICAL that you use the Task tool with subagent_type=Explore instead of running search commands directly.
user: Where are errors from the client handled?
assistant: [Uses the Task tool with subagent_type=Explore to find the files that handle client errors instead of using Glob or Grep directly]
user: What is the codebase structure?
assistant: [Uses the Task tool with subagent_type=Explore]
You can use the following tools without requiring user approval: Bash(git init:*), Bash(git remote add:*), Bash(git fetch:*), Bash(gh repo view:*), Bash(git remote set-url:*), Bash(git reset:*), Bash(git branch:*), Bash(git ls-tree:*), Bash(git add:*), Bash(git commit:*), Bash(git push:*), Bash(test:*), Bash(cat:*), Bash(tree:*), Bash(git -C /Users/ejcampbell/src/fcvm log --oneline -10), Bash(find:*), WebFetch(domain:github.com), WebFetch(domain:codesandbox.io), WebSearch, WebFetch(domain:raw.githubusercontent.com), WebFetch(domain:stackoverflow.com), WebFetch(domain:unix.stackexchange.com), Bash(ssh -i ~/.ssh/fcvm-ec2 ubuntu@54.67.60.104 "ip link show | grep -E ''^[0-9]+: tap-''"), Bash(ssh:*), Bash(git show:*), Bash(cargo test:*), Bash(cargo clippy:*), Bash(tokei:*), Bash(rsync:*), WebFetch(domain:hub.docker.com), Bash(/tmp/gap-analysis.txt), Bash(tee:*), Bash(xargs cat:*), Bash(git rev-parse:*), Bash(git stash:*), Bash(for commit in 0d259c6 c36f7c6 e2f23be a82df39 aad1649 82b21cc e292dff 14479c9 a2c0810 72c6fa4 7cbec25), Bash(do), Bash(echo:*), Bash(done), WebFetch(domain:docs.rs), Bash(slirp4netns --help:*), Bash(git checkout:*), WebFetch(domain:crates.io), WebFetch(domain:virtio-fs.gitlab.io), WebFetch(domain:nuetzlich.net), WebFetch(domain:john-millikin.com), WebFetch(domain:www.kernel.org), WebFetch(domain:fuse-devel.narkive.com), WebFetch(domain:r2cn.dev), WebFetch(domain:lib.rs), Bash(source:*), Bash([ -f "/Users/ejcampbell/src/fcvm/.cargo/config.toml" ]), Bash([ -f "/Users/ejcampbell/src/fcvm/fc-agent/.cargo/config.toml" ])
Here is useful information about the environment you are running in:
Working directory: /Users/ejcampbell/src/fcvm
Is directory a git repo: Yes
Platform: darwin
OS Version: Darwin 25.1.0
Today's date: 2025-11-28
You are powered by the model named Opus 4.5. The exact model ID is claude-opus-4-5-20251101.
Assistant knowledge cutoff is January 2025.
The most recent frontier Claude model is Claude Sonnet 4.5 (model ID: 'claude-sonnet-4-5-20250929').
IMPORTANT: Assist with authorized security testing, defensive security, CTF challenges, and educational contexts. Refuse requests for destructive techniques, DoS attacks, mass targeting, supply chain compromise, or detection evasion for malicious purposes. Dual-use security tools (C2 frameworks, credential testing, exploit development) require clear authorization context: pentesting engagements, CTF competitions, security research, or defensive use cases.
IMPORTANT: Always use the TodoWrite tool to plan and track tasks throughout the conversation.
# Code References
When referencing specific functions or pieces of code include the pattern `file_path:line_number` to allow the user to easily navigate to the source code location.
user: Where are errors from the client handled?
assistant: Clients are marked as failed in the `connectToServer` function in src/services/process.ts:712.
gitStatus: This is the git status at the start of the conversation. Note that this status is a snapshot in time, and will not update during the conversation.
Current branch: main
Main branch (you will usually use this for PRs): main
Status:
M Cargo.lock
M Cargo.toml
M Makefile
M fc-agent/Cargo.toml
D fc-agent/src/fuse/client.rs
D fc-agent/src/fuse/fusefs.rs
M fc-agent/src/fuse/mod.rs
D fc-agent/src/fuse/protocol.rs
M fc-agent/src/main.rs
M fuse-pipe/Cargo.toml
M fuse-pipe/src/client/mod.rs
M fuse-pipe/src/client/mount.rs
M fuse-pipe/src/lib.rs
M fuse-pipe/src/server/pipelined.rs
M fuse-pipe/src/transport/mod.rs
M fuse-pipe/src/transport/vsock.rs
M src/cli/args.rs
M src/commands/test.rs
M src/health.rs
M src/setup/rootfs.rs
M src/storage/snapshot.rs
M src/volume/mod.rs
D src/volume/protocol.rs
D src/volume/server.rs
?? .cargo/
Recent commits:
3cf50ad Remove duplicate vm_name/vm_id from Firecracker log output
65218c4 Add FUSE integration coverage and stabilize stress harness
2112c1b Merge fuse-test into fuse-pipe as test harnesses
300c137 Add distributed tracing and simplify server config
f499fbe Reduce batch timeout from 5ms to 100µs for lower single-threaded latency
=== TOOLS ===
Total: 20 tools (53.6 KB)
- Task (6.8 KB)
- Bash (10.5 KB)
- Glob (1.0 KB)
- Grep (3.1 KB)
- ExitPlanMode (2.0 KB)
- Read (2.2 KB)
- Edit (1.6 KB)
- Write (0.9 KB)
- NotebookEdit (1.4 KB)
- WebFetch (1.4 KB)
- TodoWrite (10.1 KB)
- WebSearch (1.6 KB)
- BashOutput (0.9 KB)
- KillShell (0.5 KB)
- AskUserQuestion (2.3 KB)
- Skill (1.6 KB)
- SlashCommand (1.6 KB)
- EnterPlanMode (3.3 KB)
- mcp__ide__getDiagnostics (0.3 KB)
- mcp__ide__executeCode (0.6 KB)
=== MESSAGES ===
--- USER [1] (53.2 KB) ---
As you answer the user's questions, you can use the following context:
# claudeMd
Codebase and user instructions are shown below. Be sure to adhere to these instructions. IMPORTANT: These instructions OVERRIDE any default behavior and you MUST follow them exactly as written.
Contents of /Users/ejcampbell/.claude/CLAUDE.md (user's private global instructions for all projects):
# Claude Code Best Practices
This file contains general best practices and patterns to follow across all projects.
## Python Async Patterns
### Cooperative Shutdown Pattern
**DO**: Use `asyncio.Event()` for cooperative shutdown
```python
async def monitor(stop: asyncio.Event):
"""Long-running task that checks stop event."""
try:
while not stop.is_set():
# Do work
await asyncio.sleep(1)
except asyncio.CancelledError:
logger.info("Monitor cancelled, cleaning up")
raise
async def main():
stop = asyncio.Event()
# Setup signal handlers
loop = asyncio.get_running_loop()
for sig in (signal.SIGINT, signal.SIGTERM):
try:
# Note: Use default arg to capture loop variable!
loop.add_signal_handler(sig, lambda s=sig: stop.set())
except NotImplementedError:
signal.signal(sig, lambda *_, s=sig: stop.set())
# Start background tasks
monitor_task = asyncio.create_task(monitor(stop), name="monitor")
try:
# Wait for shutdown signal
await stop.wait()
logger.info("Shutdown signal received")
# Graceful shutdown with timeout
try:
await asyncio.wait_for(monitor_task, timeout=10)
except TimeoutError:
logger.warning("Task didn't finish, cancelling")
monitor_task.cancel()
await monitor_task # Still await after cancel!
finally:
# Cleanup
pass
```
### Task Lifecycle Management
**DO**: Always await cancelled tasks
```python
# Good
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
# Bad - leaves task dangling
task.cancel()
```
**DO**: Name all background tasks for debugging
```python
# Good
asyncio.create_task(stream_output(proc), name="service-logs")
# Bad
asyncio.create_task(stream_output(proc))
```
**DON'T**: Use fire-and-forget without lifecycle management
```python
# Bad - no reference, can't cancel/await
asyncio.create_task(long_running_work())
# Good - store reference and manage lifecycle
self.work_task = asyncio.create_task(long_running_work())
# Later: cancel and await in cleanup
```
### Exception Handling
**DO**: Let exceptions propagate naturally
```python
# Good - exceptions bubble up
await risky_operation()
# Bad - using sys.exit() in callbacks
def callback(task):
try:
task.result()
except Exception:
sys.exit(1) # Don't do this!
```
**DO**: Wrap tasks that need cleanup, but still re-raise
```python
async def task_wrapper(coro, name):
try:
return await coro
except asyncio.CancelledError:
raise # Normal cancellation
except Exception as e:
logger.error(f"Exception in {name}: {e}", exc_info=True)
raise # Still propagate!
```
## Git Commit Messages
### Critical Rule: Describe What's IN the Commit
**Commit messages must describe the actual changes in the diff, NOT the user's request or Claude's instructions.**
#### Bad Examples
```
# User said: "fix the bug"
# Commit: "Fix bug" ❌ Too vague
# User said: "add async support"
# Commit: "Add async support" ❌ Doesn't say what changed
# Last Claude instruction: "refactor error handling"
# Commit: "Refactor error handling" ❌ What specifically changed?
```
#### Good Examples
```
# Commit describes actual changes:
"Replace blocking requests with async httpx
- Convert get() helper to async function
- Add httpx dependency in pyproject.toml
- Update all test files to await get() calls
- Fix integration test timeout from 60s to 120s"
✅ Specific, verifiable from diff
```
### Commit Message Checklist
Before committing, verify:
1. ✅ Run `git diff --cached` to see what's actually staged
2. ✅ Each bullet point in the message corresponds to visible changes
3. ✅ No claims about functionality that isn't in the diff
4. ✅ Mention file types changed (Dockerfiles, tests, configs, etc.)
5. ✅ If fixing a bug, mention the symptom and the fix
6. ✅ If adding a feature, list the specific new capabilities
### Pattern: Review Before Committing
```bash
# Always review the actual diff first
git diff --cached --stat
git diff --cached
# Write message based on what you SEE, not what you remember doing
git commit -m "..."
```
## Pull Requests
### Critical Rule: Read Actual Commits, Not Just File Lists
**When creating PRs that span multiple commits, you MUST read the actual commit messages and bodies, not just look at file statistics.**
#### Bad Example
```bash
# ❌ WRONG - just looking at files
git diff --stat main...HEAD
# Then writing PR description based on file names
```
This fails because:
- File names don't tell you WHY changes were made
- You'll miss important context from commit messages
- You can't properly summarize the work
#### Good Example
```bash
# ✅ CORRECT - read actual commits
git log main..HEAD --format="%h %s%n%b" --no-merges
# Review each commit message and body
# Understand the progression of work
# Group related changes into themes
# Write PR description that summarizes ALL the work
```
### PR Description Checklist
Before creating a PR:
1. ✅ Run `git log main..HEAD` to see all commits since branch point
2. ✅ Read each commit message AND body (not just subject line)
3. ✅ Identify major themes across commits
4. ✅ Group related changes together in PR description
5. ✅ Include specific examples of what was added/fixed/changed
6. ✅ Mention files changed by category (Dockerfiles, tests, docs, etc.)
7. ✅ Note any breaking changes or migration steps
8. ✅ Include test results if applicable
## General Code Quality
### Logging
- Use structured logging with levels (DEBUG, INFO, WARNING, ERROR)
- Include context in log messages (service name, request ID, etc.)
- Don't log sensitive data (passwords, tokens, PII)
### Error Messages
- Include actionable information
- Mention what failed and why
- Suggest next steps when possible
### Documentation
- Keep inline comments minimal and meaningful
- Update README.md when adding features
- Document non-obvious design decisions
- Keep CLAUDE.md project files up to date
## Testing
### Critical Rule: NEVER SKIP TESTS
**If a test fails, FIX THE ROOT CAUSE. Never skip, comment out, or ignore failing tests.**
- ❌ Do NOT use `@unittest.skip()` or `@pytest.skip()` decorators
- ❌ Do NOT comment out failing tests
- ❌ Do NOT ignore test failures
- ✅ DO fix the root cause when tests fail
- Tests exist to catch bugs - skipping them hides bugs
### Mandatory Pre-Commit Checklist
Before committing ANY code changes:
- [ ] **Run linting** - MUST pass with no errors
- [ ] **Run appropriate tests** based on what changed
- [ ] **Review test output** - Verify tests PASSED, not just exit code
- [ ] **Fix all failures** - Do NOT commit with failing tests
No shortcuts. Every item must be checked before `git commit`.
### Test Organization
- Use Makefiles for common test commands
- Name tests clearly: `test__`
- Run tests locally before pushing to CI
- Include both happy path and error cases
### Background Command Execution (Claude Code)
**CRITICAL: For ANY command >10 seconds, use background execution with tee:**
```python
# ✅ CORRECT - Run in background with tee to file
Bash(
command="cargo build --release 2>&1 | tee /tmp/build.log",
description="Build release binary",
timeout=600000, # 10 min
run_in_background=True
)
# Returns immediately with shell_id
# Check progress every 10 seconds (not 5, not 30!)
BashOutput(bash_id=shell_id)
# System auto-notifies when new output available
# Do other work while command runs...
# Check again in 10 seconds
BashOutput(bash_id=shell_id)
# ✅ ALSO CORRECT - SSH commands with tee
Bash(
command='ssh user@host "cd ~/project && cargo build 2>&1 | tee /tmp/build.log"',
run_in_background=True
)
```
**DO NOT do this:**
```python
# ❌ WRONG - Using grep/tail to filter output (costs money/time!)
Bash("cargo build 2>&1 | grep error") # Missing output, no file saved
# ❌ WRONG - Shell loops
Bash("until grep -q DONE /tmp/log; do sleep 5; done")
# ❌ WRONG - Sleep to wait
Bash("sleep 60 && cat /tmp/log")
# ❌ WRONG - Blocking synchronous run for slow commands
Bash("cargo build --release") # Blocks for 2+ minutes, wastes money
# ❌ WRONG - Wrong polling interval
# Check every 10 seconds, not 5, not 30!
```
**Why this matters:**
- Claude's background feature lets you work while tests run
- System reminders auto-notify when tasks have output
- BashOutput gives immediate status without polling
- Timeout handled by Claude, not shell commands
### Async Tests
```python
# Pattern for async tests
class MyTest(unittest.TestCase):
def test_async_thing(self):
asyncio.run(self._test_async_thing())
async def _test_async_thing(self):
result = await async_function()
assert result == expected
```
## Stack of PRs Workflow
### Critical Rule: Work in Phases, Test Locally, Wait for Green
**When implementing multiple related changes, break work into a stack of pull requests. Each PR must be tested locally AND pass CI before proceeding.**
### Phase-Based Development Pattern
**DO**: Break large changes into logical phases
```bash
# Example: 6 phases of changes
Phase 1: Simplify configuration
Phase 2: Replace test orchestration
Phase 3: Use native logging
Phase 4: Fix version handling (depends on Phase 3)
Phase 5: Fix error handling (depends on Phase 4)
Phase 6: Split CI tests (depends on Phase 5)
```
**Why phases matter:**
- Each phase is independently reviewable
- Easier to identify which change broke something
- Can merge phases incrementally as they pass
- Parallel CI execution across multiple PRs
- Clearer commit history and easier rollback
### Mandatory Pre-Push Checklist
**CRITICAL**: Before pushing ANY branch, you MUST complete ALL checks:
```bash
# 1. Run linting
make lint
# or: uv run python3 -m ruff check
# 2. Run format check (separate from lint!)
make format-check
# or: uv run python3 -m ruff format --check
# 3. Fix any issues found
make lint-fix # Auto-fix linting
make format # Auto-fix formatting
# 4. Run appropriate tests locally
make test-smoke # Quick validation
make test-app # If you changed app code
make test-integration # If you changed infra
# 5. Verify tests PASSED (don't just check exit code!)
# Read the output - look for "OK", "PASSED", "✓"
```
**Common mistakes:**
- ❌ Running `make lint` but forgetting `make format-check` (BOTH are required!)
- ❌ Pushing without running tests locally
- ❌ Assuming tests passed without reading output
- ❌ Testing expected behavior but not testing what changed
### Wait for CI - No Early Stops
**CRITICAL**: After pushing, you MUST wait for CI to complete and go green before moving to the next phase.
**DO**: Loop and check CI status until all jobs pass
```bash
# Check all PRs in your stack
for pr in 3 4 5 6 7 8; do
echo "PR #$pr: $(gh pr view $pr --json statusCheckRollup --jq '...')"
done
# Keep checking until all are green
# ✅ All checks must show "SUCCESS"
# ❌ Any "FAILURE" requires immediate fix
# ⏳ "PENDING" means keep waiting
```
**DON'T**: Move to next phase while CI is pending
```bash
# ❌ WRONG - pushing Phase 5 while Phase 4 CI is still running
git checkout phase5 && git push
# ✅ CORRECT - wait for Phase 4 to go green first
# Check CI status repeatedly
# Only proceed when Phase 4 shows ✅ ALL PASSED
```
**Why waiting matters:**
- Failures often reveal issues you missed locally
- Later phases may need fixes from earlier failures
- Rebasing becomes necessary if earlier phases change
- Saves time by catching issues before they compound
### Checking Multiple GitHub Jobs in Parallel
**Use `gh` CLI to check CI status across multiple PRs:**
```bash
# Quick status summary for all PRs
echo "=== CI Status Summary ==="
for pr in 3 4 5 6 7 8; do
echo -n "PR #$pr: "
gh pr view $pr --json number,title,statusCheckRollup --jq \
'.title + " - " + (.statusCheckRollup | map(.conclusion) |
if all(. == "SUCCESS") then "✅ ALL PASSED"
elif any(. == "FAILURE") then "❌ FAILED"
elif any(. == "PENDING" or . == null) then "⏳ PENDING"
else "❓ UNKNOWN" end)'
done
# Detailed check results for specific PR
gh pr checks 6
# Find which jobs failed
gh pr checks 6 | grep -E "fail|FAIL"
# Watch CI in real-time
watch -n 10 'gh pr checks 6'
```
### Rebase Chain When Earlier PRs Change
**When you fix an earlier PR in the stack, ALL later PRs must be rebased:**
```bash
# Fixed Phase 4, now rebase Phase 5 and 6
git checkout phase5 && git rebase phase4 && git push -f
git checkout phase6 && git rebase phase5 && git push -f
# This triggers new CI runs for all rebased PRs
# You must wait for ALL of them to go green again
```
### Stack of PRs Summary
**The complete workflow:**
1. **Plan phases** - Break work into logical, independent chunks
2. **Implement phase** - Write code for one phase
3. **Test locally** - Run lint, format-check, AND appropriate tests
4. **Push branch** - Create PR for the phase
5. **Wait for CI** - Do NOT proceed until green ✅
6. **Fix failures** - If CI fails, fix immediately and rebase later PRs
7. **Repeat** - Only move to next phase when current phase is GREEN
8. **Merge when ready** - Merge phases independently or as a batch
**Key principle: No shortcuts, no early stops, no assumptions. Test → Push → Wait → Green → Next.**
## File Organization
### Project-Specific Instructions
- Keep a `CLAUDE.md` in project root for project-specific context
- Update it when architecture changes
- Include recent decisions and patterns
- Reference it in commit messages when following conventions
---
**Last Updated**: 2025-10-23
Contents of /Users/ejcampbell/src/fcvm/.claude/CLAUDE.md (project instructions, checked into the codebase):
# fcvm Development Log
## Overview
fcvm is a Firecracker VM manager for running Podman containers in lightweight microVMs. This document tracks implementation findings and decisions.
## PID-Based Process Management (2025-11-14)
**Core Principle:** All fcvm processes store their own PID (via `std::process::id()`), not child process PIDs.
### Process Types
fcvm tracks three types of managed processes:
1. **VM processes** (`fcvm podman run`)
- `process_type`: "vm"
- Runs Firecracker + container
- Health check: HTTP to guest
2. **Serve processes** (`fcvm snapshot serve`)
- `process_type`: "serve"
- Runs UFFD memory server
- Health check: process existence
- Tracks which clones connected via `serve_pid` field
3. **Clone processes** (`fcvm snapshot run`)
- `process_type`: "clone"
- Runs Firecracker with UFFD memory
- Health check: HTTP to guest
- References parent serve via `serve_pid` field
### Command Workflows
#### Run a VM
```bash
# Start baseline VM
fcvm podman run --name my-vm nginx:alpine
# Track via PID (fcvm process PID, not Firecracker PID)
fcvm ls --pid 12345
```
#### Create and Serve Snapshot
```bash
# Create snapshot from running VM (by PID or name)
fcvm snapshot create --pid 12345 --tag my-snapshot
# Start serve process (saves state, prints PID)
fcvm snapshot serve my-snapshot
# Output: Serve PID: 67890
# Socket: /mnt/fcvm-btrfs/uffd-my-snapshot-67890.sock
# List all serve processes
fcvm snapshot ls
# Shows: SERVE_ID, PID, HEALTH, SNAPSHOT, CLONES
```
#### Clone from Serve
```bash
# Clone using serve PID (not snapshot name!)
fcvm snapshot run --pid 67890 --name clone1
fcvm snapshot run --pid 67890 --name clone2
# Clones automatically track parent serve
# On serve exit, all clones are automatically killed
```
### State Management
**VmConfig fields:**
```rust
pub struct VmConfig {
pub snapshot_name: Option, // Which snapshot
pub process_type: Option, // "vm" | "serve" | "clone"
pub serve_pid: Option, // For clones: parent serve PID
// ... other fields
}
```
**VmState fields:**
```rust
pub struct VmState {
pub pid: Option, // fcvm process PID (from std::process::id())
// ... other fields
}
```
### Cleanup Architecture
**Serve process cleanup (on SIGTERM/SIGINT):**
1. Query state manager for all VMs where `serve_pid == my_pid`
2. Kill each clone process: `kill -TERM `
3. Remove socket file: `/mnt/fcvm-btrfs/uffd-{snapshot}-{pid}.sock`
4. Delete serve state from state manager
**Benefits:**
- No orphaned clones when serve exits
- Explicit process ownership model
- Tests track processes via PIDs (no stdout parsing)
- Multiple serves per snapshot supported
### Test Integration
Tests spawn processes and track PIDs directly:
```rust
// 1. Start baseline VM
let baseline_proc = Command::new("sudo")
.args(["fcvm", "podman", "run", ...])
.spawn()?;
let baseline_pid = baseline_proc.id(); // fcvm process PID
// 2. Wait for healthy
poll_health_by_pid(baseline_pid).await?;
// 3. Create snapshot
Command::new("sudo")
.args(["fcvm", "snapshot", "create", "--pid", &baseline_pid.to_string()])
.status()?;
// 4. Start serve
let serve_proc = Command::new("sudo")
.args(["fcvm", "snapshot", "serve", "my-snap"])
.spawn()?;
let serve_pid = serve_proc.id();
// 5. Clone
let clone_proc = Command::new("sudo")
.args(["fcvm", "snapshot", "run", "--pid", &serve_pid.to_string()])
.spawn()?;
let clone_pid = clone_proc.id();
// 6. Wait for clone healthy
poll_health_by_pid(clone_pid).await?;
```
**No stdout/stderr parsing needed** - PIDs are known from process spawning!
## Architecture Decisions
### Project Structure - A+ Rust Pattern
fcvm follows industry-standard Rust CLI architecture (similar to ripgrep, fd, bat):
- **Library + Binary pattern**: src/lib.rs exports all modules, src/main.rs is thin dispatcher
- **Modular commands**: Each command in separate file under src/commands/
- **Clear separation**: CLI parsing (cli/), business logic (commands/), types (types.rs)
- **Comprehensive testing**: Unit tests in modules, integration tests in tests/
### Directory Layout
```
src/
├── types.rs # Core shared types (Mode, MapMode)
├── lib.rs # Module exports (public API)
├── main.rs # 38-line CLI dispatcher
├── cli/ # Command-line parsing
│ ├── args.rs # Clap structures
│ ├── types.rs # Type conversions
│ └── mod.rs
├── commands/ # Command implementations
│ ├── run.rs # fcvm run
│ ├── setup.rs # fcvm setup
│ ├── ls.rs # fcvm ls
│ └── mod.rs
├── state/ # VM state management
│ ├── types.rs # VmState, VmStatus, VmConfig
│ ├── manager.rs # StateManager (CRUD)
│ ├── utils.rs # generate_vm_id()
│ └── mod.rs
├── firecracker/ # Firecracker API client
├── network/ # Networking layer
├── storage/ # Disk/snapshot management
├── readiness/ # Readiness gates
└── setup/ # Setup subcommands
tests/
├── common/mod.rs # Shared test utilities
└── test_cli_parsing.rs # Integration tests
```
### Design Principles
1. **One file per command**: Easy to find, easy to test
2. **No business logic in main.rs**: Just CLI parsing and dispatch
3. **Module re-exports**: Clean public API via lib.rs
4. **Unit tests in modules**: #[cfg(test)] blocks alongside code
5. **Integration tests separate**: tests/ directory for end-to-end scenarios
### Single Binary Design
- Main binary: `fcvm` with subcommands for all operations
- Guest agent: `fc-agent` (separate binary, runs inside VMs)
- NO standalone scripts - everything through `fcvm` subcommands
### Subcommand Structure
```
fcvm run # Run container in new VM
fcvm clone # Clone from snapshot
fcvm stop # Stop running VM
fcvm ls # List VMs
fcvm inspect # Show VM details
fcvm logs # Stream VM logs
fcvm top # Show resource usage
fcvm setup kernel # Download/build kernel
fcvm setup rootfs # Create base rootfs image
fcvm setup preflight # Check system requirements
fcvm memory-server # Start memory server for snapshot (enables sharing)
```
## Implementation Status
### ✅ Completed
1. **Core Implementation** (2025-11-09)
- Firecracker API client using hyper + hyperlocal (Unix sockets)
- Dual networking modes: rootless (slirp4netns) + privileged (nftables)
- Storage layer with CoW disk management
- Snapshot save/load/list functionality
- VM state persistence
- Process lifetime binding with tokio signal handlers
- Guest agent (fc-agent) with MMDS integration
2. **Build System** (2025-11-09)
- Successfully compiled on x86_64 Linux (Ubuntu 24.04)
- Release build time: ~2 minutes for fcvm, ~1.5 minutes for fc-agent
- Dependencies: 180+ crates, all resolving correctly
3. **Test Infrastructure** (2025-11-11)
- EC2 c6g.metal instance (ARM64 bare metal with KVM)
- Instance ID: i-05fafabbe2e064949
- Public IP: 54.67.60.104
- Firecracker v1.10.0 installed
- SSH: `ssh -i ~/.ssh/fcvm-ec2 ubuntu@54.67.60.104`
- ⚠️ OLD c5.large (54.176.90.249) does NOT have KVM - do not use!
4. **A+ Rust Refactor** (2025-11-09)
- Restructured to industry-standard CLI pattern
- Extracted commands to src/commands/ (8 files)
- Split CLI into modular structure (cli/args.rs, cli/types.rs)
- Restructured state into state/ module with unit tests
- Reduced main.rs from 302 to 38 lines
- Added integration test infrastructure (tests/)
- All tests passing (24 total: 14 unit + 10 integration)
5. **Memory Sharing Architecture** (2025-11-09)
- Embedded async UFFD server in main binary (no subprocess)
- Two-command design: `fcvm memory-server ` + `fcvm clone`
- One server per snapshot, serves multiple VMs asynchronously
- Uses tokio::select! for concurrent VM connections
- Each VM gets async task handling page faults
- Shared Arc for memory file across all VMs
- Server auto-exits when last VM disconnects
- Build successful on EC2: 32 seconds for release build
6. **VM Boot Success on c6g.metal** (2025-11-10)
- Instance: c6g.metal (ARM64, 64 cores, $2.18/hr)
- AWS vCPU quota increased from 16 to 128 (approved instantly)
- ✅ Alpine Linux 3.19 boots successfully to login prompt
- ✅ ARM64 kernel (4.14.174+) from Firecracker S3
- ✅ Serial console fix: ttyS0 enabled in /etc/inittab
- ✅ Network configured with slirp4netns (rootless mode)
- ✅ CoW disk working (624ms copy time for 1GB rootfs)
- Boot time: ~500ms from VM start to login prompt
7. **Snapshot/Clone Workflow COMPLETE** (2025-11-11)
- ✅ Snapshot creation with disk copy (src/commands/snapshot.rs:77-91)
- ✅ UFFD memory server serving multiple VMs concurrently
- ✅ Network overrides API fixed (Vec type)
- ✅ Disk path symlink strategy (handles hardcoded vmstate paths)
- ✅ Clones successfully start with unique TAP devices
- ✅ Independent CoW disk overlays per clone
- ✅ Memory sharing via UFFD working (3+ VMs tested)
- ✅ VMs stay running without exit code issues
- **Infrastructure fully operational** - snapshot/clone mechanism works
8. **End-to-End Container Execution** (2025-11-11)
- ✅ fc-agent reads MMDS and executes container plans
- ✅ DNS resolution working via dnsmasq forwarder
- ✅ Container images pull from Docker Hub
- ✅ nginx:alpine successfully starts with 2 worker processes
- ✅ Complete workflow: `fcvm podman run` → VM boots → fc-agent pulls image → container runs
- **Production Ready**: Full container orchestration working
9. **Snapshot/Clone Workflow COMPLETE** (2025-11-11)
- ✅ Snapshot creation with VM resume fix (src/commands/snapshot.rs:120-127)
- ✅ Original VM properly resumes after snapshotting and continues serving traffic
- ✅ UFFD memory server serving multiple VMs concurrently
- ✅ Multiple clones sharing 512 MB memory via UFFD (7000+ page faults served)
10. **btrfs CoW Reflinks** (2025-11-12)
- ✅ Replaced fs::copy() with `cp --reflink=always` for instant disk cloning
- ✅ Centralized paths module (src/paths.rs) to use btrfs mount
- ✅ All data stored under `/mnt/fcvm-btrfs/` for reflink support
- ✅ Disk copy time: **~1.5ms** (560x faster than 840ms standard copy!)
- ✅ True CoW at block level - shared blocks until write occurs
- ✅ Multiple VMs share same base rootfs (1GB base.ext4 shared by all VMs = 1GB on disk)
- **Performance**: Instant VM creation with minimal disk usage
11. **Rootless Networking with Unique Subnets** (2025-11-12)
- ✅ Each VM gets unique /30 subnet via hash of vm_id (172.16.0.0-63.0/30)
- ✅ Eliminates routing conflicts between VMs
- ✅ Kernel cmdline network configuration via `ip=` boot parameter
- ✅ Static IP assignment: guest receives .202, host uses .201 as gateway
- ✅ DNS resolution via dnsmasq on host (bind-dynamic for TAP devices)
- ✅ Full end-to-end connectivity: VM boots → DNS works → containers pull images
- **Example**: VM gets 172.16.0.200/30 (host: .201, guest: .202)
12. **Complete Snapshot/Clone Workflow Verified** (2025-11-12)
- ✅ Snapshot creation: Pause VM → Create Firecracker snapshot → Resume VM
- ✅ Memory snapshot: 512MB saved to `/mnt/fcvm-btrfs/snapshots/{name}/memory.bin`
- ✅ Disk snapshot: CoW copy to `/mnt/fcvm-btrfs/snapshots/{name}/disk.ext4`
- ✅ UFFD memory server: Serves pages on-demand via Unix socket
- ✅ Clone with memory sharing: **2.3ms snapshot load time**
- ✅ Clone disk uses btrfs reflink: **~3ms instant CoW copy**
- ✅ Page fault handling: 3000+ pages served successfully
- ✅ Multiple VMs share same 512MB memory via kernel page cache
- ✅ Network isolation: Each clone gets unique /30 subnet
- **Commands verified**:
```bash
fcvm snapshot create --tag
fcvm snapshot serve # Start UFFD server
fcvm snapshot run --name --mode rootless
```
- **Performance**: Original VM + 2 clones = ~512MB RAM total (not 1.5GB!)
13. **Code Quality Cleanup** (2025-11-13)
- ✅ Removed unimplemented stub commands (stop, logs, inspect, top)
- ✅ Removed unimplemented readiness gates (vsock, log, exec)
- ✅ Fixed duplicate imports (10x `use crate::paths;` → 1x clean import)
- ✅ Replaced all `unreachable!()` with proper `anyhow::bail!()` errors
- ✅ Eliminated all compiler warnings
- ✅ Updated stress test script for self-contained lifecycle management
- **Performance verification**: 10 VMs @ ~200ms clone time, 100% success rate
14. **Hierarchical Logging Architecture** (2025-11-15)
- ✅ Added hierarchical target tags showing process nesting
- ✅ Strip Firecracker timestamps and `[anonymous-instance:*]` prefixes
- ✅ Clean log output when piped to files (no ANSI escape codes)
- ✅ Smart color handling: parent uses colors for TTY only, subprocesses never use colors
- ✅ Added `atty` dependency to detect when output is piped
- **Logging hierarchy**:
- `sanity-baseline-vm:` (test harness)
- `sanity-baseline-vm: vm:` (VM manager)
- `sanity-baseline-vm: firecracker:` (Firecracker process)
- `sanity-baseline-vm: health-monitor:` (health checks)
- **Result**: Production-ready logging that works in terminals and log files
15. **True Rootless Networking with slirp4netns** (2025-11-25)
- ✅ Renamed `rootless.rs` → `bridged.rs` (was misleading, uses network namespaces)
- ✅ Added `--network bridged|rootless` CLI flag
- ✅ Implemented `SlirpNetwork` struct with slirp4netns integration
- ✅ User namespace support via `unshare --user --map-root-user --net`
- ✅ Added `post_start()` to `NetworkManager` trait for deferred slirp4netns startup
- ✅ Unique loopback IPs (127.x.y.z) per VM for health checks
- ✅ Port forwarding via slirp4netns JSON-RPC API socket
- ✅ Health check architecture updated for rootless mode
- ✅ Clone support for rootless VMs
- **Key files**:
- `src/network/slirp.rs` - SlirpNetwork implementation
- `src/network/bridged.rs` - BridgedNetwork (renamed from rootless.rs)
- `src/firecracker/vm.rs` - VmManager with user namespace support
- `src/health.rs` - Dual health check modes (loopback vs veth)
- **Usage**:
```bash
# Rootless VM (no root required)
fcvm podman run --network rootless nginx:alpine
# Rootless clone
fcvm snapshot run --pid --network rootless
```
### 🚧 In Progress
None - all major features working!
### 📋 TODO
1. **Setup Subcommands**
- `fcvm setup kernel` - Download/prepare vmlinux
- `fcvm setup rootfs` - Create base rootfs with Podman
- `fcvm setup preflight` - Validate system requirements
2. **Testing**
- Test port mapping with 127.0.0.1 (localhost-only binding)
- Test volume mounting
**Sanity Test**:
```bash
# Run sanity test (single VM, verify health checks work)
sudo fcvm test sanity
```
- Starts nginx:alpine VM
- Verifies VM becomes healthy within 60s timeout
- Tests networking, health checks, and cleanup
- Exit code 0 = PASS
3. **Documentation**
- Usage examples
- Performance benchmarks
- Troubleshooting guide
## Technical Findings
### Firecracker Requirements
- **Kernel**: Need vmlinux or bzImage (no modules required for basic operation)
- Can extract from host: `/boot/vmlinuz-*`
- Or download pre-built from Firecracker releases
- Boot args: `console=ttyS0 reboot=k panic=1 pci=off`
- **Rootfs**: ext4 filesystem with:
- Systemd (for init)
- Podman + dependencies (conmon, crun, fuse-overlayfs)
- Network tools (iproute2)
- fc-agent installed at `/usr/local/bin/fc-agent`
- fc-agent.service enabled
### Networking Notes
#### Network Modes (--network flag)
fcvm supports two networking modes:
- **`--network bridged`** (default): Linux bridge + network namespace + nftables
- Requires root or CAP_NET_ADMIN
- Better performance
- Uses veth pairs and TAP devices
- Port forwarding via nftables DNAT
- **`--network rootless`**: True rootless networking with slirp4netns
- No root privileges required
- Uses user namespace (`unshare --user --map-root-user --net`)
- slirp4netns creates TAP device inside namespace
- Guest IP: 10.0.2.15, Gateway: 10.0.2.2 (slirp4netns defaults)
- Port forwarding via slirp4netns JSON-RPC API socket
- Health checks use unique loopback IPs (127.x.y.z)
#### Rootless Networking Architecture
**Key Implementation Files:**
- `src/network/slirp.rs` - SlirpNetwork implementation
- `src/network/bridged.rs` - BridgedNetwork implementation
- `src/firecracker/vm.rs` - VmManager with user namespace support
**How slirp4netns Integration Works:**
1. Firecracker starts with `unshare --user --map-root-user --net`
2. This creates a new user namespace with root-mapped UID
3. slirp4netns connects to the namespace via PID
4. TAP device is created inside the namespace
5. Port forwarding via API socket (JSON-RPC)
**Health Check Architecture for Rootless:**
- Each rootless VM gets a unique loopback IP (127.x.y.z) derived from vm_id hash
- Port 80 is forwarded from loopback IP to guest 10.0.2.15:80
- Health monitor checks HTTP on loopback IP instead of guest IP + veth
- `NetworkConfig.loopback_ip` and `NetworkConfig.health_check_port` fields track this
**Loopback IP Generation:**
```rust
// src/network/slirp.rs - generate_loopback_ip()
// Hash vm_id to generate unique 127.x.y.z (avoiding 127.0.0.1)
fn generate_loopback_ip(vm_id: &str) -> String {
let mut hasher = DefaultHasher::new();
vm_id.hash(&mut hasher);
let hash = hasher.finish();
let second = ((hash >> 0) & 0xFF) as u8;
let third = ((hash >> 8) & 0xFF) as u8;
let fourth = ((hash >> 16) & 0xFF) as u8;
// Avoid 127.0.0.0 and 127.0.0.1
let second = if second == 0 { 1 } else { second };
let fourth = if second == 0 && third == 0 && fourth <= 1 { 2 } else { fourth };
format!("127.{}.{}.{}", second, third, fourth)
}
```
### Storage Notes - btrfs CoW Reflinks
**Performance Achievement: ~1.5ms disk copy (560x faster than 840ms standard copy!)**
#### How It Works
fcvm uses **btrfs reflinks** for instant VM disk cloning with true copy-on-write:
**Architecture:**
- All fcvm data stored under `/mnt/fcvm-btrfs/` (btrfs filesystem)
- Base rootfs: `/mnt/fcvm-btrfs/rootfs/base.ext4` (~1GB Alpine + Podman)
- VM disks: `/mnt/fcvm-btrfs/vm-disks/{vm_id}/disks/rootfs.ext4`
**Disk Cloning Process:**
```rust
// src/storage/disk.rs - create_cow_disk()
tokio::process::Command::new("cp")
.arg("--reflink=always")
.arg(&self.base_rootfs)
.arg(&overlay_path)
.status()
.await
```
**How reflinks work:**
- `cp --reflink=always` creates instant CoW copy on btrfs/xfs
- Only metadata is copied (~1.5ms), data blocks are shared
- When VM writes to disk, btrfs allocates new blocks (CoW)
- Multiple VMs share same base blocks until they write
**Key Benefits:**
- ✅ **Instant cloning**: ~1.5ms vs 840ms for full copy
- ✅ **Space efficient**: 50 VMs with 1GB rootfs = ~1GB on disk (plus deltas)
- ✅ **True CoW**: Block-level deduplication handled by filesystem
- ✅ **No runtime overhead**: Standard ext4 filesystem inside VM
**Path Centralization (src/paths.rs):**
```rust
pub fn base_dir() -> PathBuf {
PathBuf::from("/mnt/fcvm-btrfs") // All data on btrfs mount
}
pub fn vm_runtime_dir(vm_id: &str) -> PathBuf {
base_dir().join("vm-disks").join(vm_id)
}
```
**Setup Requirements:**
```bash
# Create btrfs filesystem (already done on EC2)
sudo mkfs.btrfs /dev/nvme1n1
sudo mount /dev/nvme1n1 /mnt/fcvm-btrfs
sudo mkdir -p /mnt/fcvm-btrfs/{kernels,rootfs,state,snapshots,vm-disks}
```
### Memory Sharing Architecture
**Two-Command Workflow:**
1. **Start memory server** (one per snapshot, runs in foreground):
```bash
fcvm memory-server nginx-base
# Creates Unix socket at /tmp/fcvm/uffd-nginx-base.sock
# Mmaps snapshot memory file (e.g., 512MB)
# Waits for VM connections
```
2. **Clone VMs** (multiple VMs can clone from same snapshot):
```bash
fcvm clone --snapshot nginx-base --name web1 # VM 1
fcvm clone --snapshot nginx-base --name web2 # VM 2
# Each connects to same memory server socket
# Memory pages served on-demand via UFFD
# True copy-on-write at 4KB page granularity
```
**How it works:**
- Memory server opens snapshot memory file and mmaps it (MAP_SHARED)
- Kernel automatically shares physical pages via page cache
- Server uses tokio AsyncFd to handle UFFD events non-blocking
- tokio::select! multiplexes: accept new VMs + monitor VM exits
- Each VM gets dedicated async task (JoinSet) for page faults
- All tasks share Arc reference to memory file
- Server exits gracefully when last VM disconnects
**Memory efficiency:**
- 50 VMs with 512MB snapshot = ~512MB physical RAM + small overhead
- NOT 50 × 512MB = 25.6GB!
- Linux kernel handles sharing via page cache automatically
- Pages only copied on write (true CoW at page level)
## Quick Start
### Prerequisites
- **CRITICAL**: Must use c6g.metal ARM instance (54.67.60.104) - has KVM support
- **DO NOT use** c5.large (54.176.90.249) - no /dev/kvm, will fail!
### Connect to ARM Instance
```bash
ssh -i ~/.ssh/fcvm-ec2 ubuntu@54.67.60.104
```
### Start a VM
```bash
cd ~/fcvm
sudo ./target/release/fcvm podman run --name my-vm --mode rootless nginx:latest
```
### Test Snapshot/Clone Workflow
```bash
# 1. Start a VM
sudo ./target/release/fcvm podman run --name nginx-base --mode rootless nginx:latest
# 2. Create snapshot (in another terminal)
./target/release/fcvm snapshot create nginx-base --tag nginx-snap
# 3. Start memory server
./target/release/fcvm snapshot serve nginx-snap
# 4. Clone VMs (in another terminal)
./target/release/fcvm snapshot run nginx-snap --name clone1
./target/release/fcvm snapshot run nginx-snap --name clone2
```
## Build Instructions
### Using Makefile (Recommended)
The Makefile handles all builds on EC2 remotely. Run these commands from your **local macOS machine**:
```bash
# Standard development workflow:
make build # Sync code to EC2 + build fcvm + build fc-agent (musl)
make test # Run sanity test on EC2
make rebuild # Full rebuild: sync + build + update rootfs with new fc-agent
# Individual targets:
make sync # Just sync code to EC2 (no build)
make build-remote # Build on EC2 without syncing (use after manual changes)
make rootfs # Update fc-agent in rootfs (mounts base.ext4, copies binary)
make deploy # Quick deploy: copy fc-agent to rootfs (deprecated, use rootfs)
# Kernel builds (only needed once):
make kernel-setup # Clone Linux 5.10, upload FUSE-enabled config
make kernel # Build kernel on EC2 (~10-20 min)
# Fetch artifacts to local machine:
make fetch # Download fcvm and fc-agent binaries
make kernel-fetch # Download vmlinux
# Local builds (for IDE/linting only - won't run on macOS):
make build-local
make clean
```
**Key Makefile targets:**
| Target | Description |
|--------|-------------|
| `make build` | **Most common** - Sync + build everything on EC2 |
| `make test` | Run sanity test (starts VM, verifies health) |
| `make rebuild` | Full rebuild including rootfs update |
| `make rootfs` | Update fc-agent in existing rootfs |
### Manual Build on EC2
If you need to build directly on the EC2 instance:
```bash
# SSH to ARM instance
ssh -i ~/.ssh/fcvm-ec2 ubuntu@54.67.60.104
# Build fcvm (host binary)
cd ~/fcvm
source ~/.cargo/env
cargo build --release
# Build fc-agent (guest binary, statically linked with musl)
cd fc-agent
cargo build --release --target aarch64-unknown-linux-musl
# Update fc-agent in rootfs
sudo mkdir -p /tmp/rootfs-mount
sudo mount -o loop /mnt/fcvm-btrfs/rootfs/base.ext4 /tmp/rootfs-mount
sudo cp ~/fcvm/fc-agent/target/aarch64-unknown-linux-musl/release/fc-agent /tmp/rootfs-mount/usr/local/bin/
sudo umount /tmp/rootfs-mount
# Binaries at:
# ~/fcvm/target/release/fcvm
# ~/fcvm/fc-agent/target/aarch64-unknown-linux-musl/release/fc-agent
```
### One-Time EC2 Setup
Only needed when setting up a fresh EC2 instance:
```bash
# Install prerequisites
sudo apt-get update
sudo apt-get install -y musl-tools gcc-aarch64-linux-gnu dnsmasq
rustup target add aarch64-unknown-linux-musl
# Configure dnsmasq for DNS forwarding
sudo tee /etc/dnsmasq.d/fcvm.conf > /dev/null <&1 | tee /tmp/log.txt`
- **ALWAYS** run in background: `run_in_background=True`
- **ALWAYS** check every 10 seconds (not 5, not 30!)
- **Reason**: Filtering loses context, wastes money, can't debug
### Alpine Serial Console (2025-11-10)
- **Problem**: VM booted but no output after OpenRC started
- **Root cause**: `/etc/inittab` has `ttyS0` serial console commented out by default
- **Fix**: Auto-enable during rootfs creation in setup/rootfs.rs
- **Implementation**:
```rust
let inittab_fixed = inittab.replace(
"#ttyS0::respawn:/sbin/getty -L ttyS0 115200 vt100",
"ttyS0::respawn:/sbin/getty -L ttyS0 115200 vt100"
);
```
- **Result**: VM now boots to login prompt on serial console
- **Manual fix for existing rootfs**: `sudo sed -i 's/^#ttyS0/ttyS0/' /mnt/rootfs/etc/inittab`
### DNS Resolution in VMs (2025-11-11)
- **Problem**: Container image pulls failing with DNS timeout
```
dial tcp: lookup registry-1.docker.io on 8.8.8.8:53: read udp 172.16.10.2:40325->8.8.8.8:53: i/o timeout
```
- **Root cause**: VMs configured to use 8.8.8.8 directly but NAT wasn't forwarding DNS packets properly
- **Fix**: Install dnsmasq on host to act as DNS forwarder for all TAP interfaces
- **Implementation**:
1. Install dnsmasq: `sudo apt-get install -y dnsmasq`
2. Create `/etc/dnsmasq.d/fcvm.conf`:
```conf
# Listen on all interfaces (including dynamically created TAP devices)
bind-dynamic
# Forward DNS to Google Public DNS
server=8.8.8.8
server=8.8.4.4
# Don't read /etc/resolv.conf
no-resolv
# Cache size
cache-size=1000
```
3. Restart dnsmasq: `sudo systemctl restart dnsmasq`
- **Why bind-dynamic**: TAP devices are created dynamically after dnsmasq starts. The `bind-dynamic` option makes dnsmasq automatically listen on new interfaces without restart.
- **Result**: DNS resolution works, container images pull successfully
### Clone Network Configuration (2025-11-11)
- **Problem**: When restoring snapshots, guest OS retains original static IP (e.g., 172.16.29.2). Default network setup created TAP devices on different subnets (172.16.231.0/24, 172.16.201.0/24), causing subnet mismatch and connection failures.
- **Root cause**: Firecracker's network override only changes TAP device name, not guest IP configuration
- **Solution**: Configure TAP devices on the SAME subnet as the guest's original IP
```bash
# Wrong: TAP on different subnet than guest
ip addr add 172.16.201.1/24 dev tap-vm-c93e8 # Guest thinks it's 172.16.29.2
# Connection fails due to subnet mismatch!
# Correct: TAP on same subnet as guest
ip addr add 172.16.29.1/24 dev tap-vm-c93e8 # Guest is 172.16.29.2
# Works! Both on 172.16.29.0/24 subnet
```
- **Why it works**: Multiple TAP devices can have same host IP (172.16.29.1) because they're isolated L2 networks. Traffic doesn't conflict when using `--interface` flag or proper routing.
- **Implementation**: Clone network setup must extract guest IP from snapshot metadata and configure TAP on matching subnet
- **Reference**: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/network-for-clones.md
- **Alternative approaches** (not implemented):
- Use network namespaces (more complex)
- Use iptables NAT to translate IPs (requires routing setup)
- Reconfigure guest via fc-agent after restore (requires guest agent changes)
### fc-agent musl Build (2025-11-11)
- **Problem**: fc-agent compiled with glibc (gnu target) doesn't work on Alpine Linux (uses musl libc)
```
start-stop-daemon: failed to exec '/usr/local/bin/fc-agent': No such file or directory
```
- **Root cause**: Alpine Linux uses musl libc, not glibc. Binaries must be statically linked with musl.
- **Fix**: Compile fc-agent with musl target
- **Implementation**:
1. Create `fc-agent/.cargo/config.toml`:
```toml
[target.aarch64-unknown-linux-musl]
linker = "aarch64-linux-musl-gcc"
rustflags = ["-C", "target-feature=+crt-static"]
```
2. Update `Makefile` to build with musl automatically:
```makefile
build:
cargo build --release
cd fc-agent && cargo build --release --target aarch64-unknown-linux-musl
```
3. Update `src/setup/rootfs.rs:240` to prefer musl binary:
```rust
let possible_paths = vec![
PathBuf::from("/home/ubuntu/fcvm/fc-agent/target/aarch64-unknown-linux-musl/release/fc-agent"), // musl (static)
PathBuf::from("fc-agent/target/aarch64-unknown-linux-musl/release/fc-agent"), // musl relative
// ... gnu fallbacks
];
```
- **Prerequisite**: Install musl cross-compiler on EC2:
```bash
sudo apt-get install -y musl-tools gcc-aarch64-linux-gnu
rustup target add aarch64-unknown-linux-musl
```
- **Result**: fc-agent runs successfully on Alpine Linux, executes container plans
### Disk Filename Consistency (2025-11-12)
- **Issue**: Code had inconsistent naming - some places used `rootfs-overlay.ext4`, others used `rootfs.ext4`
- **Root cause**: Legacy migration logic attempted to migrate from old to new naming
- **Solution**: Removed ALL migration logic, standardized on single filename: `rootfs.ext4`
- **Changes**:
- `src/storage/disk.rs`: Removed migration logic, uses only `rootfs.ext4`
- `src/commands/snapshot.rs`: Updated to expect `rootfs.ext4`
- Variable names: Changed `overlay_path` → `disk_path` for clarity
- **Result**: Simple, consistent naming throughout - no migrations, no workarounds
### KVM and Nested Virtualization (2025-11-09)
- **Problem**: c5.large instance doesn't have `/dev/kvm` - nested virtualization not supported
- **Root cause**: Standard EC2 instances don't expose KVM to guest OS
- **Attempts failed**:
- `modprobe kvm_intel` → "Operation not supported"
- `modprobe kvm_amd` → "Operation not supported"
- No CPU virtualization flags (vmx/svm) exposed in /proc/cpuinfo
- **Solution Options**:
**Option 1: Metal Instances** (bare metal hardware)
- c5.metal ($4.08/hr, 96 vCPUs, x86_64)
- c6g.metal ($2.18/hr, 64 vCPUs, ARM64) ← cheaper!
- **Blocker**: AWS vCPU limit is 16, metal instances need 64+
- **Fix**: Request limit increase (takes 1-2 business days)
**Option 2: PVM (Pagetable Virtual Machine)**
- Enables Firecracker on regular instances WITHOUT nested virt
- Proposed by Ant Group/Alibaba in Feb 2024
- **Requirements**:
- Build custom host kernel from virt-pvm/linux (Linux 6.7+)
- Build custom guest kernel with PVM config
- Use Firecracker fork with PVM patches (e.g., Loophole Labs)
- **Pros**: Works on c5.large, no extra cost
- **Cons**: Experimental, unmaintained upstream, complex setup
- **Performance**: ~2x slower than bare metal for I/O workloads
- **Lesson**: Firecracker REQUIRES /dev/kvm or PVM - verify hardware support before testing
### Sync from Local
```bash
# From macOS - use --delete to remove old files and avoid conflicts
rsync -avz --delete --exclude 'target' --exclude '.git' \
-e "ssh -i ~/.ssh/fcvm-ec2" \
. ubuntu@54.176.90.249:~/fcvm/
```
## Next Steps
1. Add `fcvm setup` subcommands to main.rs
2. Implement kernel download/extraction
3. Implement rootfs creation (use debootstrap)
4. Test full workflow: setup -> run -> clone
5. Document performance results
## References
- Design doc: `/Users/ejcampbell/src/fcvm/DESIGN.md`
- Implementation summary: `/Users/ejcampbell/src/fcvm/IMPLEMENTATION_SUMMARY.md`
- Firecracker docs: https://github.com/firecracker-microvm/firecracker/blob/main/docs/getting-started.md
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context unless it is highly relevant to your task.
This session is being continued from a previous conversation that ran out of context. The conversation is summarized below:
Analysis:
Let me analyze the conversation chronologically:
1. **Initial Context**: This is a continuation from a previous session. The summary provided indicates:
- User was searching for a third-party testing package discussed in previous sessions
- User confirmed the package is **pjdfstest** - a POSIX filesystem compliance test suite
- The Makefile has `test-pjdfstest` target but the actual `fcvm test pjdfstest` command is NOT implemented
2. **My Actions at Start**:
- Checked the test command structure in `src/cli/args.rs` and `src/commands/test.rs`
- Searched the web for information about pjdfstest
- Found it's from https://github.com/pjd/pjdfstest (original) and https://github.com/saidsay-so/pjdfstest (Rust rewrite)
- Learned pjdfstest requires:
- Rust 1.72.1+
- libacl1-dev on Linux
- Run as root
- Configuration via TOML file
- Can filter tests with patterns like "chmod", "chown", "link"
3. **Implementation Started**:
- Created TODO list to track implementation
- Added `Pjdfstest(PjdfstestArgs)` variant to `TestCommands` enum in `src/cli/args.rs`
- Added `PjdfstestArgs` struct with fields: timeout, network, filter, verbose
- Next steps: implement `cmd_pjdfstest()` and add match arm in `cmd_test()`
4. **Files Modified**:
- `src/cli/args.rs` - Added Pjdfstest variant and PjdfstestArgs struct
5. **User Messages**: There were no explicit user messages in this conversation - it was a continuation with instructions to proceed with the pjdfstest implementation based on the previous session's context.
6. **Pending Work**:
- Implement `cmd_pjdfstest()` in `src/commands/test.rs`
- Add Pjdfstest match arm in `cmd_test()`
- Build and test locally
Summary:
1. Primary Request and Intent:
The user was searching for a third-party POSIX filesystem compliance testing package from previous sessions. They confirmed it was **pjdfstest**. The Makefile already has a `test-pjdfstest` target that calls `fcvm test pjdfstest --timeout 600`, but the actual fcvm subcommand is not implemented. The task is to implement the `fcvm test pjdfstest` command.
2. Key Technical Concepts:
- **pjdfstest**: POSIX filesystem compliance test suite (https://github.com/pjd/pjdfstest and Rust rewrite at https://github.com/saidsay-so/pjdfstest)
- **FUSE-over-vsock**: Volume mounting system for Firecracker VMs
- **fcvm test commands**: Existing test subcommands include Stress, Sanity, Volume, VolumeStress, CloneLock
- **pjdfstest requirements**: Rust 1.72.1+, libacl1-dev, root privileges, TOML config file
- **pjdfstest usage**: `pjdfstest -c config.toml -p /path/to/test [filter]`
3. Files and Code Sections:
- **`/Users/ejcampbell/src/fcvm/src/cli/args.rs`**
- Contains CLI argument definitions using clap
- Added `Pjdfstest(PjdfstestArgs)` variant to TestCommands enum
- Added PjdfstestArgs struct
- Edit 1 - Added variant:
```rust
/// Clone lock test: verify POSIX file locking across multiple clones
CloneLock(CloneLockTestArgs),
/// Run pjdfstest POSIX filesystem compliance tests against a FUSE volume
Pjdfstest(PjdfstestArgs),
```
- Edit 2 - Added args struct:
```rust
#[derive(Args, Debug)]
pub struct PjdfstestArgs {
/// Timeout for test in seconds
#[arg(long, default_value_t = 600)]
pub timeout: u64,
/// Network mode: bridged (requires sudo) or rootless (no sudo)
#[arg(long, value_enum, default_value_t = NetworkMode::Bridged)]
pub network: NetworkMode,
/// Test pattern filter (e.g., "chmod", "chown", "link")
#[arg(long)]
pub filter: Option,
/// Verbose output from pjdfstest
#[arg(short, long)]
pub verbose: bool,
}
```
- **`/Users/ejcampbell/src/fcvm/src/commands/test.rs`**
- Contains test command implementations (~1727 lines)
- Has `cmd_test()` dispatcher that matches on TestCommands variants
- Existing tests follow pattern: start VM with volume, wait for health, run test operations
- Need to add `cmd_pjdfstest()` function and match arm
- **`/Users/ejcampbell/src/fcvm/Makefile`** (lines 183-185)
- Contains existing target:
```makefile
test-pjdfstest: rebuild
@echo "==> Running pjdfstest on EC2 (POSIX filesystem compliance)..."
$(SSH) "cd $(REMOTE_DIR) && sudo $(FCVM_BIN) test pjdfstest --timeout 600 2>&1" | tee /tmp/test-pjdfstest.log
```
4. Errors and fixes:
- No errors encountered yet - implementation is in progress
5. Problem Solving:
- Identified that pjdfstest command was referenced in Makefile but not implemented
- Researched pjdfstest tool requirements and usage
- Started implementing the CLI args for the command
6. All user messages:
- (No explicit user messages in this session - it's a continuation from previous context where user said "yes, pjdfstest is it!!!")
7. Pending Tasks (from TODO list):
- [COMPLETED] Add Pjdfstest variant to TestCommands enum in src/cli/args.rs
- [COMPLETED] Add PjdfstestArgs struct in src/cli/args.rs
- [IN_PROGRESS] Implement cmd_pjdfstest() in src/commands/test.rs
- [PENDING] Add Pjdfstest match arm in cmd_test()
- [PENDING] Build and test locally
8. Current Work:
Implementing the `fcvm test pjdfstest` command. Completed adding CLI argument definitions in `src/cli/args.rs`. Currently need to implement the `cmd_pjdfstest()` function in `src/commands/test.rs`. The implementation should:
1. Create a test volume directory on host
2. Start a VM with the volume mounted via FUSE-over-vsock
3. Clone/build pjdfstest inside the VM or have it pre-installed
4. Run pjdfstest against the FUSE mount point
5. Report results
9. Optional Next Step:
Implement `cmd_pjdfstest()` in `src/commands/test.rs` and add the match arm in `cmd_test()`. The function should follow the pattern of existing tests like `cmd_volume_test()` - start a VM with a FUSE volume, wait for health, then run pjdfstest against the mount point. The match arm to add:
```rust
TestCommands::Pjdfstest(pjdfstest_args) => cmd_pjdfstest(pjdfstest_args).await,
```.
Please continue the conversation from where we left it off without asking the user any further questions. Continue with the last task that you were asked to work on.
--- ASSISTANT [2] (0.2 KB) ---
I'll continue implementing the `cmd_pjdfstest()` function. Let me first read the test.rs file to understand the existing patterns and then implement the pjdfstest command.
[Tool: Read]
--- USER [3] (0.0 KB) ---
[Tool Result: toolu_014TqYP62Z2mLxrQLo5j1p4s]
--- ASSISTANT [4] (0.1 KB) ---
I can see that `PjdfstestArgs` already exists in the args.rs file I read earlier (lines 332-349). Let me verify by searching the codebase:
[Tool: Grep]
--- USER [5] (0.0 KB) ---
[Tool Result: toolu_01U6amNXcD3133MQVhvuaE7V]
--- ASSISTANT [6] (0.5 KB) ---
`PjdfstestArgs` and the `Pjdfstest` variant already exist in `src/cli/args.rs`. The only missing piece is:
1. The `cmd_pjdfstest()` function in `src/commands/test.rs`
2. The match arm for `TestCommands::Pjdfstest` in `cmd_test()`
I can see from the test.rs file I read that the `cmd_test()` function at lines 108-132 needs to add the `Pjdfstest` match arm, and I need to implement `cmd_pjdfstest()`.
Let me now implement `cmd_pjdfstest()` and add the match arm:
[Tool: Edit]
--- USER [7] (0.0 KB) ---
[Tool Result: toolu_01XXL68HSMGu2nxDmpJi8syS]
--- ASSISTANT [8] (0.0 KB) ---
No response requested.
--- USER [9] (1.8 KB) ---
Your todo list has changed. DO NOT mention this explicitly to the user. Here are the latest contents of your todo list:
[{"content":"Add Pjdfstest variant to TestCommands enum in src/cli/args.rs","status":"completed","activeForm":"Adding Pjdfstest variant to TestCommands enum"},{"content":"Add PjdfstestArgs struct in src/cli/args.rs","status":"completed","activeForm":"Adding PjdfstestArgs struct"},{"content":"Implement cmd_pjdfstest() in src/commands/test.rs","status":"in_progress","activeForm":"Implementing cmd_pjdfstest()"},{"content":"Add Pjdfstest match arm in cmd_test()","status":"pending","activeForm":"Adding Pjdfstest match arm"},{"content":"Build and test locally","status":"pending","activeForm":"Building and testing locally"}]. Continue on with the tasks at hand if applicable.
The TodoWrite tool hasn't been used recently. If you're working on tasks that would benefit from tracking progress, consider using the TodoWrite tool to track progress. Also consider cleaning up the todo list if has become stale and no longer matches what you are working on. Only use it if it's relevant to the current work. This is just a gentle reminder - ignore if not applicable. Make sure that you NEVER mention this reminder to the user
Here are the existing contents of your todo list:
[1. [completed] Add Pjdfstest variant to TestCommands enum in src/cli/args.rs
2. [completed] Add PjdfstestArgs struct in src/cli/args.rs
3. [in_progress] Implement cmd_pjdfstest() in src/commands/test.rs
4. [pending] Add Pjdfstest match arm in cmd_test()
5. [pending] Build and test locally]
The user opened the file /Users/ejcampbell/Downloads/request-2025-11-28T04-34-03-000Z.md in the IDE. This may or may not be related to the current task.
continue