Commit Graph

23 Commits

Author SHA1 Message Date
BBQ abbb14c59f fix(containerd): prevent silent network failures from leaving containers unreachable (#202)
* fix(containerd): prevent silent network failures from leaving containers unreachable

Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:

- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
  returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
  propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
  CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s

* fix(containerd): clean stale cni0 bridge on startup to prevent MAC error

After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.

Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
  before retrying, as defense-in-depth for runtime recovery

* fix(containerd): use exec.CommandContext to satisfy noctx linter
2026-03-07 17:50:01 +08:00
BBQ 21999b49f4 feat(container): add explicit data workflows and snapshot rollback (#193)
* feat(container): add explicit data workflows and snapshot rollback

Make container upgrades and recreation data-safe by adding explicit preserve, export, import, restore, and rollback flows across the backend, SDK, and web UI.

* fix(container): resolve go lint issues

Fix formatting and lint violations introduced by the container data workflow changes so the Go CI lint job passes cleanly.
2026-03-06 17:57:48 +08:00
BBQ 3feb03aca7 ci: add go lint and race test workflow (#187) 2026-03-05 11:25:33 +08:00
BBQ 9ceabf68c4 feat(mcp): replace bind-mount+exec with in-container gRPC service (#179)
Replace the host bind-mount + containerd exec approach with a per-bot
in-container gRPC server (ContainerService, port 9090). All file I/O,
exec, and MCP stdio sessions now go through gRPC instead of running
shell commands or reading host-mounted directories.

Architecture changes:
- cmd/mcp: rewritten as a gRPC server (ContainerService) with full
  file and exec API (ReadFile, WriteFile, ListDir, ReadRaw, WriteRaw,
  Exec, Stat, Mkdir, Rename, DeleteFile)
- internal/mcp/mcpcontainer: protobuf definitions and generated stubs
- internal/mcp/mcpclient: gRPC client wrapper with connection pool
  (Pool) and Provider interface for dependency injection
- mcp.Manager: add per-bot IP cache, gRPC connection pool, and
  SetContainerIP/MCPClient methods; remove DataDir/Exec helpers
- containerd.Service: remove ExecTask/ExecTaskStreaming; network setup
  now returns NetworkResult{IP} for pool routing
- internal/fs/service.go: deleted (replaced by mcpclient)
- handlers/fs.go: deleted; MCP stdio session logic moved to mcp_stdio.go
- container provider Executor: all tools (read/write/list/edit/exec)
  now call gRPC client instead of running shell via exec
- storefs, containerfs, media, skills, memory: all I/O ported to
  mcpclient.Provider

Database:
- migration 0022: drop host_path column from containers table

One-time data migration:
- migrateBindMountData: on first Start() after upgrade, copies old
  bind-mount data into the container via gRPC, then renames src dir
  to prevent re-migration; runs in background goroutine

Bug fixes:
- mcp_stdio: callRaw now returns full JSON-RPC envelope
  {"jsonrpc","id","result"|"error"} matching protocol spec;
  explicit "initialize" call now advances session init state to
  prevent duplicate handshake on next non-initialize call
- mcpclient Pool: properly evict stale gRPC connection after snapshot
  replace (container process recreated); use SetContainerIP instead
  of direct map write so IP changes always evict pool entry
- migrateBindMountData: walkErr on directories now counted as failure
  so partially-walked trees don't get incorrectly marked as migrated
- cmd/mcp/Dockerfile: removed dead file (docker/Dockerfile.mcp is the
  canonical production build)

Tests:
- provider_test.go: restored with bufconn in-process gRPC mock
  (fakeContainerService + staticProvider), 14 cases covering all 5
  tools plus edge cases
- mcp_session_test.go: new, covers JSON-RPC envelope, init state
  machine, pending cleanup on cancel/close, readLoop cancel
- storefs/service_test.go: restored (pure function roundtrip tests)
2026-03-04 21:50:08 +08:00
BBQ 7730096696 fix(containerd): restore CNI MASQUERADE after container restart (#167)
cni.Remove() failure on stale iptables state blocked the retry
cni.Setup(), leaving bot containers without SNAT/MASQUERADE.

- Ignore cni.Remove() error so retry Setup always runs
- Add global MASQUERADE rule in entrypoints as belt-and-suspenders

Closes #161
2026-03-03 16:00:46 +08:00
BBQ ee587b8ef5 fix(mcp): fix snapshot management and encapsulate locking (#169)
- Fix DeleteContainer FAILED_PRECONDITION by cleaning up stopped task
  entries before container deletion
- Fix CreateSnapshot leaving container in broken state: commit turns
  the active snapshot read-only, so the full cycle (stop → commit →
  prepare → delete → recreate → start) is now applied consistently
- Use context.WithoutCancel for atomic container replacement sequences
  to prevent cancelled HTTP requests from corrupting container state
- Use dctx for DB operations (recordSnapshotVersion/insertEvent) to
  avoid orphan snapshots in containerd without matching DB records
- Restart task + network after snapshot replacement, fixing Exec after
  CreateVersion where the container had no running task
- Extract replaceContainerSnapshot helper to deduplicate the prepare →
  delete → recreate → start pattern across three call sites
- Move snapshot list data fetching into Manager.ListBotSnapshotData to
  encapsulate per-container locking; remove exported LockBot method
- Use UnixNano for snapshot names to avoid second-precision collisions
2026-03-03 15:59:57 +08:00
BBQ f68b675efd fix(containerd): normalize image references for containerd compatibility (#138)
Containerd does not auto-prepend "docker.io/" to short Docker Hub names
like "memohai/mcp:latest", causing it to treat "memohai" as a registry
host and fail with EOF. Add NormalizeImageRef() to ensure all image
references are fully qualified before being passed to containerd.
2026-02-26 20:22:36 +08:00
BBQ d2878d841b fix(containerd): use image RootFS for snapshot parent chain ID (#132)
The previous snapshotParentFromLayers manually decompressed layer
blobs to compute diffIDs, which could diverge from the IDs that
containerd's Unpack uses (e.g. gzip vs zstd). Use image.RootFS()
instead — the canonical source containerd itself relies on.
2026-02-26 19:55:46 +08:00
Ran 65c4d6f793 feat(container): support for apple container 2026-02-23 22:40:46 +08:00
BBQ 4fc0ca5110 fix(container): propagate host timezone to all containers
Replace TZ env var with /etc/localtime bind-mount in docker-compose
and inject timezone spec opts into containerd bot containers.
2026-02-20 03:22:31 +08:00
BBQ bc374fe8cd refactor: content-addressed assets, cross-channel multimodal, infra simplification (#63)
* refactor(attachment): multimodal attachment refactor with snapshot schema and storage layer

- Add snapshot schema migration (0008) and update init/versions/snapshots
- Add internal/attachment and internal/channel normalize for unified attachment handling
- Move containerfs provider from internal/media to internal/storage
- Update agent types, channel adapters (Telegram/Feishu), inbound and handlers
- Add containerd snapshot lineage and local_channel tests
- Regenerate sqlc, swagger and SDK

* refactor(media): content-addressed asset system with unified naming

- Replace asset_id foreign key with content_hash as sole identifier
  for bot_history_message_assets (pure soft-link model)
- Remove mime, size_bytes, storage_key from DB; derive at read time
  via media.Resolve from actual storage
- Merge migrations 0008/0009 into single 0008; keep 0001 as canonical schema
- Add Docker initdb script for deterministic migration execution order
- Fix cross-channel real-time image display (Telegram → WebUI SSE)
- Fix message disappearing on refresh (null assets fallback)
- Fix file icon instead of image preview (mime derivation from storage)
- Unify AssetID → ContentHash naming across Go, Agent, and Frontend
- Change storage key prefix from 4-char to 2-char for directory sharding
- Add server-entrypoint.sh for Docker deployment migration handling

* refactor(infra): embedded migrations, Docker simplification, and config consolidation

- Embed SQL migrations into Go binary, removing shell-based migration scripts
- Consolidate config files into conf/ directory (app.example.toml, app.docker.toml, app.dev.toml)
- Simplify Docker setup: remove initdb.d scripts, streamline nginx config and entrypoint
- Remove legacy CLI, feishu-echo commands, and obsolete incremental migration files
- Update install script and docs to require sudo for one-click install
- Add mise tasks for dev environment orchestration

* chore: recover migrations

---------

Co-authored-by: Acbox <acbox0328@gmail.com>
2026-02-19 00:20:27 +08:00
斬風千雪 0bdc31311c improvement(mcp): make CNI binary & data path configurable (#55) 2026-02-17 17:57:13 +08:00
晨苒 5b05f13f1a Merge pull request #36 from ix64/refactor/fx
refactor: using fx
2026-02-12 15:35:01 +08:00
Ran 01cb6c85db fix(deploy): many docker compose bug 2026-02-12 08:23:25 +08:00
MengYX 6548c31597 refactor: using fx 2026-02-11 10:25:40 +08:00
Ran 8b0d90d7b4 fix: cni allocation bug 2026-02-09 08:47:18 +08:00
Ran 26dd8651b7 feat: go cni lifecycle manage 2026-02-08 21:39:34 +08:00
Ran 4e661bae76 fix: mcp containerd fifo 2026-02-07 22:14:38 +08:00
BBQ 46d2968e2c feat: refactor logging system to slog with DI and component tagging 2026-02-01 15:23:57 +08:00
Ran ce825d51c4 feat: mcp fs operate 2026-02-01 01:41:55 +08:00
Ran bb5482b982 refact: go mcp tool in containerd 2026-01-28 04:48:32 +07:00
Ran 0edaba4e74 fix: update go dependencies 2026-01-20 23:23:07 +07:00
Ran d40cc581d2 refactor: initial go service 2026-01-20 00:04:23 +07:00