When a container is deleted but its snapshot survives (dev image rebuild,
containerd metadata loss, manual ctr deletion), the reconciliation path
previously created a fresh container and unconditionally destroyed the
old snapshot via prepareSnapshot, causing complete data loss.
Manager.Start now detects orphaned snapshots before EnsureBot runs,
exports /data to a backup archive, and restores it into the new
container's snapshot before the task starts.
* fix(containerd): prevent silent network failures from leaving containers unreachable (#202)
* fix(containerd): prevent silent network failures from leaving containers unreachable
Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:
- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s
* fix(containerd): clean stale cni0 bridge on startup to prevent MAC error
After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.
Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
before retrying, as defense-in-depth for runtime recovery
* fix(containerd): use exec.CommandContext to satisfy noctx linter
* fix(mcp): propagate network errors from replaceContainerSnapshot
Network setup failure after snapshot replace (rollback/commit) was
silently swallowed — the container would start but remain unreachable
via gRPC. Return the error so callers (CreateSnapshot, RollbackVersion,
etc.) surface the failure instead of reporting success.
* feat(container): add explicit data workflows and snapshot rollback
Make container upgrades and recreation data-safe by adding explicit preserve, export, import, restore, and rollback flows across the backend, SDK, and web UI.
* fix(container): resolve go lint issues
Fix formatting and lint violations introduced by the container data workflow changes so the Go CI lint job passes cleanly.
Replace the host bind-mount + containerd exec approach with a per-bot
in-container gRPC server (ContainerService, port 9090). All file I/O,
exec, and MCP stdio sessions now go through gRPC instead of running
shell commands or reading host-mounted directories.
Architecture changes:
- cmd/mcp: rewritten as a gRPC server (ContainerService) with full
file and exec API (ReadFile, WriteFile, ListDir, ReadRaw, WriteRaw,
Exec, Stat, Mkdir, Rename, DeleteFile)
- internal/mcp/mcpcontainer: protobuf definitions and generated stubs
- internal/mcp/mcpclient: gRPC client wrapper with connection pool
(Pool) and Provider interface for dependency injection
- mcp.Manager: add per-bot IP cache, gRPC connection pool, and
SetContainerIP/MCPClient methods; remove DataDir/Exec helpers
- containerd.Service: remove ExecTask/ExecTaskStreaming; network setup
now returns NetworkResult{IP} for pool routing
- internal/fs/service.go: deleted (replaced by mcpclient)
- handlers/fs.go: deleted; MCP stdio session logic moved to mcp_stdio.go
- container provider Executor: all tools (read/write/list/edit/exec)
now call gRPC client instead of running shell via exec
- storefs, containerfs, media, skills, memory: all I/O ported to
mcpclient.Provider
Database:
- migration 0022: drop host_path column from containers table
One-time data migration:
- migrateBindMountData: on first Start() after upgrade, copies old
bind-mount data into the container via gRPC, then renames src dir
to prevent re-migration; runs in background goroutine
Bug fixes:
- mcp_stdio: callRaw now returns full JSON-RPC envelope
{"jsonrpc","id","result"|"error"} matching protocol spec;
explicit "initialize" call now advances session init state to
prevent duplicate handshake on next non-initialize call
- mcpclient Pool: properly evict stale gRPC connection after snapshot
replace (container process recreated); use SetContainerIP instead
of direct map write so IP changes always evict pool entry
- migrateBindMountData: walkErr on directories now counted as failure
so partially-walked trees don't get incorrectly marked as migrated
- cmd/mcp/Dockerfile: removed dead file (docker/Dockerfile.mcp is the
canonical production build)
Tests:
- provider_test.go: restored with bufconn in-process gRPC mock
(fakeContainerService + staticProvider), 14 cases covering all 5
tools plus edge cases
- mcp_session_test.go: new, covers JSON-RPC envelope, init state
machine, pending cleanup on cancel/close, readLoop cancel
- storefs/service_test.go: restored (pure function roundtrip tests)
Add mcp-build.sh that compiles the MCP binary and packages it as an
OCI image layer on top of the base rootfs, imported directly into
containerd. Air triggers rebuild on code changes, cleaning stale
containers automatically.
Consolidate dev-only files (Dockerfiles, entrypoint, config, build
script) into devenv/ to separate dev tooling from production artifacts.
Skip image pull when already imported to speed up dev startup.
* feat(devenv): add containerized development environment
Replace local-process dev workflow with a fully containerized stack
using docker compose. This enables consistent development across
machines without requiring local Go/Node toolchains or containerd.
- Add Dockerfile.server.dev with containerd + CNI networking support
- Add Dockerfile.web.dev for frontend dev server
- Add server-dev-entrypoint.sh for containerd lifecycle management
- Expand devenv/docker-compose.yml with server, agent, web, migrate
and deps services with proper health checks and dependency ordering
- Update app.dev.toml to use container service names instead of localhost
- Refactor mise.toml dev tasks to drive docker compose workflow
- Support agent_gateway.server_addr in config package for inter-container
communication
* feat(devenv): add hot-reload and registry mirror support
- Add air for Go server hot-reload in dev containers
- Fix agent_gateway host in dev config (0.0.0.0 -> agent)
- Add configurable registry mirror for China mainland users
- Unify MCP image refs via MCPConfig.ImageRef()
* feat(scripts): add China mainland mirror option to install script
Prompt users to opt-in to memoh.cn mirror during installation,
which applies docker-compose.cn.yml overlay and sets registry
in config.toml for MCP image pulls.
Merge upstream fx refactor and adapt all services to use go.uber.org/fx
for dependency injection. Resolve conflicts in main.go, server.go,
and service constructors while preserving our domain model changes.
- Fix telegram adapter panic on shutdown (double close channel)
- Fix feishu adapter processing messages after stop
- Increase directory lookup timeout from 2s to 5s
- Rename chat module to conversation with flow-based architecture
- Move channelidentities into channel/identities subpackage
- Add channel/route for routing logic
- Add message service with event hub
- Add MCP providers: container, directory, schedule
- Refactor Feishu/Telegram adapters with directory and stream support
- Add platform management page and channel badges in web UI
- Update database schema for conversations, messages and channel routes
- Add @memoh/shared package for cross-package type definitions
Align channel identity and bind flow across backend and app-facing layers, including generated swagger artifacts and package lock updates while excluding docs content changes.
- Refactor channel manager with support for Sender/Receiver interfaces and hot-swappable adapters.
- Implement identity routing and pre-authentication logic for inbound messages.
- Update database schema to support bot pre-auth keys and extended channel session metadata.
- Add Telegram and Feishu channel configuration and adapter enhancements.
- Update Swagger documentation and internal handlers for channel management.
Co-authored-by: Cursor <cursoragent@cursor.com>
Major changes:
1. Core Architecture: Decoupled Bots from Users. Bots now have independent lifecycles, member management (bot_members), and dedicated configurations.
2. Channel Gateway:
- Implemented a unified Channel Manager supporting Feishu, Telegram, and Local (Web/CLI) adapters.
- Added message processing pipeline to normalize interactions across different platforms.
- Introduced a Contact system for identity binding and guest access policies.
3. Database & Tooling:
- Consolidated all migrations into 0001_init with updated schema for bots, channels, and contacts.
- Optimized sqlc.yaml to automatically track the migrations directory.
4. Agent Enhancements:
- Introduced ToolContext to provide Agents with platform-aware execution capabilities (e.g., messaging, contact lookups).
- Added tool logging and fallback mechanisms for toolChoice execution.
5. UI & Docs: Updated frontend stores, UI components, and Swagger documentation to align with the new Bot-centric model.