* fix(containerd): prevent silent network failures from leaving containers unreachable
Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:
- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s
* fix(containerd): clean stale cni0 bridge on startup to prevent MAC error
After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.
Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
before retrying, as defense-in-depth for runtime recovery
* fix(containerd): use exec.CommandContext to satisfy noctx linter
* feat(container): add explicit data workflows and snapshot rollback
Make container upgrades and recreation data-safe by adding explicit preserve, export, import, restore, and rollback flows across the backend, SDK, and web UI.
* fix(container): resolve go lint issues
Fix formatting and lint violations introduced by the container data workflow changes so the Go CI lint job passes cleanly.
Replace the host bind-mount + containerd exec approach with a per-bot
in-container gRPC server (ContainerService, port 9090). All file I/O,
exec, and MCP stdio sessions now go through gRPC instead of running
shell commands or reading host-mounted directories.
Architecture changes:
- cmd/mcp: rewritten as a gRPC server (ContainerService) with full
file and exec API (ReadFile, WriteFile, ListDir, ReadRaw, WriteRaw,
Exec, Stat, Mkdir, Rename, DeleteFile)
- internal/mcp/mcpcontainer: protobuf definitions and generated stubs
- internal/mcp/mcpclient: gRPC client wrapper with connection pool
(Pool) and Provider interface for dependency injection
- mcp.Manager: add per-bot IP cache, gRPC connection pool, and
SetContainerIP/MCPClient methods; remove DataDir/Exec helpers
- containerd.Service: remove ExecTask/ExecTaskStreaming; network setup
now returns NetworkResult{IP} for pool routing
- internal/fs/service.go: deleted (replaced by mcpclient)
- handlers/fs.go: deleted; MCP stdio session logic moved to mcp_stdio.go
- container provider Executor: all tools (read/write/list/edit/exec)
now call gRPC client instead of running shell via exec
- storefs, containerfs, media, skills, memory: all I/O ported to
mcpclient.Provider
Database:
- migration 0022: drop host_path column from containers table
One-time data migration:
- migrateBindMountData: on first Start() after upgrade, copies old
bind-mount data into the container via gRPC, then renames src dir
to prevent re-migration; runs in background goroutine
Bug fixes:
- mcp_stdio: callRaw now returns full JSON-RPC envelope
{"jsonrpc","id","result"|"error"} matching protocol spec;
explicit "initialize" call now advances session init state to
prevent duplicate handshake on next non-initialize call
- mcpclient Pool: properly evict stale gRPC connection after snapshot
replace (container process recreated); use SetContainerIP instead
of direct map write so IP changes always evict pool entry
- migrateBindMountData: walkErr on directories now counted as failure
so partially-walked trees don't get incorrectly marked as migrated
- cmd/mcp/Dockerfile: removed dead file (docker/Dockerfile.mcp is the
canonical production build)
Tests:
- provider_test.go: restored with bufconn in-process gRPC mock
(fakeContainerService + staticProvider), 14 cases covering all 5
tools plus edge cases
- mcp_session_test.go: new, covers JSON-RPC envelope, init state
machine, pending cleanup on cancel/close, readLoop cancel
- storefs/service_test.go: restored (pure function roundtrip tests)
- Fix DeleteContainer FAILED_PRECONDITION by cleaning up stopped task
entries before container deletion
- Fix CreateSnapshot leaving container in broken state: commit turns
the active snapshot read-only, so the full cycle (stop → commit →
prepare → delete → recreate → start) is now applied consistently
- Use context.WithoutCancel for atomic container replacement sequences
to prevent cancelled HTTP requests from corrupting container state
- Use dctx for DB operations (recordSnapshotVersion/insertEvent) to
avoid orphan snapshots in containerd without matching DB records
- Restart task + network after snapshot replacement, fixing Exec after
CreateVersion where the container had no running task
- Extract replaceContainerSnapshot helper to deduplicate the prepare →
delete → recreate → start pattern across three call sites
- Move snapshot list data fetching into Manager.ListBotSnapshotData to
encapsulate per-container locking; remove exported LockBot method
- Use UnixNano for snapshot names to avoid second-precision collisions
* feat(models): add per-model probe testing and auto-detect in UI
Move health probes from provider level to model level for precise
testing with real model_id and client_type. Provider test is now a
simple reachability check.
Backend:
- Add POST /models/:id/test endpoint that probes the model's provider
using its actual model_id and client_type
- Add model healthcheck checker for bot health checks (chat/memory/embedding)
- Simplify provider test to reachability-only
Frontend:
- Auto-probe models on mount with status indicator (green/yellow/red dot + latency)
- Auto-probe provider reachability on load and on provider switch
- Fix missing faBolt icon registration
- Manual re-probe via refresh button
Closes#117
* fix(models): increase probe timeout to 15s for slow providers
Some providers (e.g. DashScope) exceed the 5s probe timeout, causing
false-negative "context deadline exceeded" errors. Increase per-probe
timeout to 15s and healthcheck overall timeout to 30s.
* fix(sdk): regenerate exports after merge conflict
Resolve duplicate SDK exports introduced by merge conflict resolution so the web build can compile again while preserving new model probe endpoints.
* feat: add email service with multi-adapter support
Implement a full-stack email service with global provider management,
per-bot bindings with granular read/write permissions, outbox audit
storage, and MCP tool integration for direct mailbox access.
Backend:
- Email providers: CRUD with dynamic config schema (generic SMTP/IMAP, Mailgun)
- Generic adapter: go-mail (SMTP) + go-imap/v2 (IMAP IDLE real-time push via
UnilateralDataHandler + UID-based tracking + periodic check fallback)
- Mailgun adapter: mailgun-go/v5 with dual inbound mode (webhook + poll)
- Bot email bindings: per-bot provider binding with independent r/w permissions
- Outbox: outbound email audit log with status tracking
- Trigger: inbound emails push notification to bot_inbox (from/subject only,
LLM reads full content on demand via MCP tools)
- MailboxReader interface: on-demand IMAP queries for listing/reading emails
- MCP tools: email_accounts, email_send, email_list (paginated mailbox),
email_read (by UID) — all with multi-binding and provider_id selection
- Webhook: /email/mailgun/webhook/:config_id (JWT-skipped, signature-verified)
- DB migration: 0019_add_email (email_providers, bot_email_bindings, email_outbox)
Frontend:
- Email Providers page: /email-providers with MasterDetailSidebarLayout
- Dynamic config form rendered from ordered provider meta schema with i18n keys
- Bot detail: Email tab with bindings management + outbox audit table
- Sidebar navigation entry
- Full i18n support (en + zh)
- Auto-generated SDK from Swagger
Closes#17
* feat(email): trigger bot conversation immediately on inbound email
Instead of only storing an inbox item and waiting for the next chat,
the email trigger now proactively invokes the conversation resolver
so the bot processes new emails right away — aligned with the
schedule/heartbeat trigger pattern.
* fix: lint
---------
Co-authored-by: Acbox <acbox0328@gmail.com>
* feat(devenv): add containerized development environment
Replace local-process dev workflow with a fully containerized stack
using docker compose. This enables consistent development across
machines without requiring local Go/Node toolchains or containerd.
- Add Dockerfile.server.dev with containerd + CNI networking support
- Add Dockerfile.web.dev for frontend dev server
- Add server-dev-entrypoint.sh for containerd lifecycle management
- Expand devenv/docker-compose.yml with server, agent, web, migrate
and deps services with proper health checks and dependency ordering
- Update app.dev.toml to use container service names instead of localhost
- Refactor mise.toml dev tasks to drive docker compose workflow
- Support agent_gateway.server_addr in config package for inter-container
communication
* feat(devenv): add hot-reload and registry mirror support
- Add air for Go server hot-reload in dev containers
- Fix agent_gateway host in dev config (0.0.0.0 -> agent)
- Add configurable registry mirror for China mainland users
- Unify MCP image refs via MCPConfig.ImageRef()
* feat(scripts): add China mainland mirror option to install script
Prompt users to opt-in to memoh.cn mirror during installation,
which applies docker-compose.cn.yml overlay and sets registry
in config.toml for MCP image pulls.
Server container restart drops cni0 bridge, veth and iptables masquerade
in its network namespace while MCP tasks keep running in containerd.
Reconcile and ensureContainerAndTask now re-run SetupNetwork for already-
running tasks so outbound connectivity is restored.
Propagate conversation type (direct/group/thread) from channel adapters
all the way to the agent prompt. Store conversation_type on bot_channel_routes
so the bot knows whether a message originates from a p2p chat, group, or thread.
Schema changes are folded into the 0001 init migration (destructive update).
- Refactor RuntimeChecker interface: CheckKeys() + RunCheck() for
individual check dispatch instead of batch-all
- Add GET /bots/:id/checks/keys to list all available check keys
- Add GET /bots/:id/checks/run/:key to evaluate a single check
- MCP ConnectionChecker probes each active connection independently
via tools/list with 8s timeout
- Keep container checks (init/record/task/data_path) as fast builtins
- Graceful network setup failure in containerd handler (log warning
instead of killing task) for containerd-in-docker compatibility
In containerd-in-docker mode, SetupNetwork fails because netns is
unavailable. Previously this killed the task, making stdio MCP tools
unusable. Now the task continues running with a warning log, since
stdio MCP communication does not require networking.
- Accept standard mcpServers item format (command/args/env/url/headers)
- Auto-infer connection type: command -> stdio, url -> http/sse
- Add PUT /bots/:bot_id/mcp/import for batch import from mcpServers dict
- Add GET /bots/:bot_id/mcp/export for standard format export
- Add UpsertMCPConnectionByName SQL for import upsert by name
- Preserve is_active state on import upsert