Files
Memoh/internal/handlers/containerd.go
T
Menci d5b410d7e3 refactor(workspace): new workspace v3 container architecture (#244)
* feat(mcp): workspace container with bridge architecture

Migrate MCP containers to use UDS-based bridge communication instead of
TCP gRPC. Containers now mount runtime binaries and Unix domain sockets
from the host, eliminating the need for a dedicated MCP Docker image.

- Remove Dockerfile.mcp and entrypoint.sh in favor of standard base images
- Add toolkit Dockerfile for building MCP binary separately
- Containers use bind mounts for /opt/memoh (runtime) and /run/memoh (UDS)
- Update all config files with new runtime_path and socket_dir settings
- Support custom base images per bot (debian, alpine, ubuntu, etc.)
- Legacy container detection and TCP fallback for pre-bridge containers
- Frontend: add base image selector in container creation UI

* feat(container): SSE progress bar for container creation

Add real-time progress feedback during container image pull and creation
using Server-Sent Events, without breaking the existing synchronous JSON
API (content negotiation via Accept header).

Backend:
- Add PullProgress/LayerStatus types and OnProgress callback to
  PullImageOptions (containerd service layer)
- DefaultService.PullImage polls ContentStore.ListStatuses every 500ms
  when OnProgress is set; AppleService ignores it
- CreateContainer handler checks Accept: text/event-stream and switches
  to SSE branch: pulling → pull_progress → creating → complete/error

Frontend:
- handleCreateContainer/handleRecreateContainer use fetch + SSE instead
  of the SDK's synchronous postBotsByBotIdContainer
- Progress bar shows layer-level pull progress (offset/total) during
  pulling phase and indeterminate animation during creating phase
- i18n keys added for pullingImage and creatingContainer (en/zh)

* fix(container): clear stale legacy route and type create SSE

* fix(ci): resolve lint errors and arm64 musl node.js download

- Fix unused-receiver lint: rename `s` to `_` on stub methods in
  manager_legacy_test.go
- Fix sloglint: use slog.DiscardHandler instead of
  slog.NewTextHandler(io.Discard, nil)
- Handle missing arm64 musl Node.js builds: unofficial-builds.nodejs.org
  does not provide arm64 musl binaries, fall back to glibc build

* fix(lint): address errcheck, staticcheck, and gosec findings

- Discard os.Setenv/os.Remove return values explicitly with _
- Use omitted receiver name instead of _ (staticcheck ST1006)
- Tighten directory permissions from 0o755 to 0o750 (gosec G301)

* fix(lint): sanitize socket path to satisfy gosec G703

filepath.Clean the env-sourced socket path before os.Remove
to avoid path-traversal taint warning.

* fix(lint): use nolint directive for gosec G703 on socket path

filepath.Clean does not satisfy gosec's taint analysis. The socket
path comes from MCP_SOCKET_PATH env (operator-configured) or a
compiled-in default, not from end-user input.

* refactor: rename MCP container/bridge to workspace/bridge

Split internal/mcp/ to separate container lifecycle management from
Model Context Protocol connections, eliminating naming confusion:

- internal/mcp/ (container mgmt) → internal/workspace/
- internal/mcp/mcpclient/ → internal/workspace/bridge/
- internal/mcp/mcpcontainer/ → internal/workspace/bridgepb/
- cmd/mcp/ → cmd/bridge/
- config: MCPConfig → WorkspaceConfig, [mcp] → [workspace]
- container prefix: mcp-{id} → workspace-{id}
- labels: mcp.bot_id → memoh.bot_id, add memoh.workspace=v1
- socket: mcp.sock → bridge.sock, env BRIDGE_SOCKET_PATH
- runtime: /opt/memoh/runtime/mcp → /opt/memoh/runtime/bridge
- devenv: mcp-build.sh → bridge-build.sh

Legacy containers (mcp- prefix) detected by container name prefix
and handled via existing fallback path.

* fix(container): use memoh.workspace=v3 label value

* refactor(container): drop LegacyBotLabelKey, infer bot ID from container name

Legacy containers use mcp-{botID} naming, so bot ID can be derived
via TrimPrefix instead of looking up the mcp.bot_id label.

* fix(workspace): resolve containers via manager and drop gateway container ID

* docs: fix stale mcp references in AGENTS.md and DEPLOYMENT.md

* refactor(workspace): move container lifecycle ownership into manager

* dev: isolate local devenv from prod config

* toolkit: support musl node runtime

* containerd: fix fallback resolv.conf permissions

* web: preserve container create progress on completion

* web: add bot creation wait hint

* fix(workspace): preserve image selection across recreate

* feat(web): shorten default docker hub image refs

* fix(container): address code review findings

- Remove synchronous CreateContainer path (SSE-only now)
- Move flusher check before WriteHeader to avoid committed 200 on error
- Fix legacy container IP not cached via ensureContainerAndTask path
- Add atomic guard to prevent stale pull_progress after PullImage returns
- Defensive copy for tzEnv slice to avoid mutating shared backing array
- Restore network failure severity in restartContainer (return + Error)
- Extract duplicate progress bar into ContainerCreateProgress component
- Fix codesync comments to use repo-relative paths
- Add SaaS image validation note and kernel version comment on reaper

* refactor(devenv): extract toolkit install into shared script

Unify the Node.js + uv download logic into docker/toolkit/install.sh,
used by the production Dockerfile and runnable locally for dev.

Dev environment no longer bakes toolkit into the Docker image — it is
volume-mounted from .toolkit/ instead, so wrapper script changes take
effect immediately without rebuilding. The entrypoint checks for the
toolkit directory and prints a clear error if missing.

* fix(ci): address go ci failures

* chore(docker): remove unused containerd image

* refactor(config): rename workspace image key

* fix(workspace): fix legacy container data loss on migration and stop swallowing errors

Three root causes were identified and fixed:

1. Delete() used hardcoded "workspace-" prefix to look up legacy "mcp-"
   containers, causing GetContainer to return NotFound. CleanupBotContainer
   then silently skipped the error and deleted the DB record without ever
   calling PreserveData. Fix: resolve the actual container ID via
   ContainerID() (DB → label → scan) before operating.

2. Multiple restore error paths were silently swallowed (logged as Warn
   but not returned), so the user saw HTTP 200/204 with no data and no
   error. Fix: all errors in the preserve/restore chain now block the
   workflow and propagate to the caller.

3. tarGzDir used cached DirEntry.Info() for tar header size, which on
   overlayfs can differ from the actual file size, causing "archive/tar:
   write too long". Fix: open the file first, Fstat the fd for a
   race-free size, and use LimitReader as a safeguard.

Also adds a "restoring" SSE phase so the frontend shows a progress
indicator ("Restoring data, this may take a while...") during data
migration on container recreation.

* refactor(workspace): single-point container ID resolution

Replace the `containerID func(string) string` field with a single
`resolveContainerID(ctx, botID)` method that resolves the actual
container ID via DB → label → scan → fallback. All ~16 lookup
callsites across manager.go, dataio.go, versioning.go, and
manager_lifecycle.go now go through this single resolver, which
correctly handles both legacy "mcp-" and new "workspace-" containers.

Only `ensureBotWithImage` inlines `ContainerPrefix + botID` for
creating brand-new containers — every other path resolves dynamically.

* fix(web): show progress during data backup phase of container recreate

The recreate flow (delete with preserve_data + create with restore_data)
blocked on the DELETE call while backing up /data with no progress
indication. Add a 'preserving' phase to the progress component so
users see "正在备份数据..." instead of an unexplained hang.

* chore: remove [MYDEBUG] debug logging

Clean up all 112 temporary debug log statements added during the
legacy container migration investigation. Kept only meaningful
warn-level logs for non-fatal errors (network teardown, rename
failures).
2026-03-18 15:19:09 +08:00

807 lines
26 KiB
Go

package handlers
import (
"bufio"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"log/slog"
"net/http"
"sort"
"strings"
"sync"
"sync/atomic"
"time"
"github.com/containerd/errdefs"
"github.com/labstack/echo/v4"
"github.com/memohai/memoh/internal/accounts"
"github.com/memohai/memoh/internal/bots"
"github.com/memohai/memoh/internal/config"
ctr "github.com/memohai/memoh/internal/containerd"
"github.com/memohai/memoh/internal/mcp"
"github.com/memohai/memoh/internal/policy"
"github.com/memohai/memoh/internal/workspace"
)
type ContainerdHandler struct {
manager *workspace.Manager
cfg config.WorkspaceConfig
containerBackend string
logger *slog.Logger
toolGateway *mcp.ToolGatewayService
mcpSess map[string]*mcpSession
mcpStdioMu sync.Mutex
mcpStdioSess map[string]*mcpStdioSession
botService *bots.Service
accountService *accounts.Service
policyService *policy.Service
}
type CreateContainerRequest struct {
Snapshotter string `json:"snapshotter,omitempty"`
RestoreData bool `json:"restore_data,omitempty"`
Image string `json:"image,omitempty"`
}
type CreateContainerResponse struct {
ContainerID string `json:"container_id"`
Image string `json:"image"`
Snapshotter string `json:"snapshotter"`
Started bool `json:"started"`
DataRestored bool `json:"data_restored"`
HasPreservedData bool `json:"has_preserved_data"`
}
// codesync(container-create-stream): keep these SSE payloads in sync with
// packages/sdk/src/container-stream.ts.
type createContainerPullingEvent struct {
Type string `json:"type"`
Image string `json:"image"`
}
type createContainerPullProgressEvent struct {
Type string `json:"type"`
Layers []ctr.LayerStatus `json:"layers"`
}
type createContainerCreatingEvent struct {
Type string `json:"type"`
}
type createContainerCompleteEvent struct {
Type string `json:"type"`
Container CreateContainerResponse `json:"container"`
}
type createContainerRestoringEvent struct {
Type string `json:"type"`
}
type createContainerErrorEvent struct {
Type string `json:"type"`
Message string `json:"message"`
}
type GetContainerResponse struct {
ContainerID string `json:"container_id"`
Image string `json:"image"`
Status string `json:"status"`
Namespace string `json:"namespace"`
ContainerPath string `json:"container_path"`
TaskRunning bool `json:"task_running"`
HasPreservedData bool `json:"has_preserved_data"`
Legacy bool `json:"legacy"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
type RollbackRequest struct {
Version int `json:"version"`
}
type CreateSnapshotRequest struct {
SnapshotName string `json:"snapshot_name"`
}
type CreateSnapshotResponse struct {
ContainerID string `json:"container_id"`
SnapshotName string `json:"snapshot_name"`
RuntimeSnapshotName string `json:"runtime_snapshot_name"`
DisplayName string `json:"display_name"`
Snapshotter string `json:"snapshotter"`
Version int `json:"version"`
Source string `json:"source"`
}
type SnapshotInfo struct {
Snapshotter string `json:"snapshotter"`
Name string `json:"name"`
DisplayName string `json:"display_name,omitempty"`
RuntimeName string `json:"runtime_snapshot_name"`
Parent string `json:"parent,omitempty"`
Kind string `json:"kind"`
CreatedAt time.Time `json:"created_at,omitempty"`
UpdatedAt time.Time `json:"updated_at,omitempty"`
Labels map[string]string `json:"labels,omitempty"`
Source string `json:"source"`
Managed bool `json:"managed"`
Version *int `json:"version,omitempty"`
}
type ListSnapshotsResponse struct {
Snapshotter string `json:"snapshotter"`
Snapshots []SnapshotInfo `json:"snapshots"`
}
func NewContainerdHandler(log *slog.Logger, manager *workspace.Manager, cfg config.WorkspaceConfig, containerBackend string, botService *bots.Service, accountService *accounts.Service, policyService *policy.Service) *ContainerdHandler {
h := &ContainerdHandler{
manager: manager,
cfg: cfg,
containerBackend: containerBackend,
logger: log.With(slog.String("handler", "containerd")),
mcpSess: make(map[string]*mcpSession),
mcpStdioSess: make(map[string]*mcpStdioSession),
botService: botService,
accountService: accountService,
policyService: policyService,
}
return h
}
func (h *ContainerdHandler) Register(e *echo.Echo) {
group := e.Group("/bots/:bot_id/container")
group.POST("", h.CreateContainer)
group.GET("", h.GetContainer)
group.DELETE("", h.DeleteContainer)
group.POST("/start", h.StartContainer)
group.POST("/stop", h.StopContainer)
group.POST("/snapshots", h.CreateSnapshot)
group.GET("/snapshots", h.ListSnapshots)
group.POST("/snapshots/rollback", h.RollbackSnapshot)
group.POST("/data/export", h.ExportContainerData)
group.POST("/data/import", h.ImportContainerData)
group.POST("/data/restore", h.RestorePreservedData)
group.GET("/skills", h.ListSkills)
group.POST("/skills", h.UpsertSkills)
group.DELETE("/skills", h.DeleteSkills)
// Terminal routes
group.GET("/terminal", h.GetTerminalInfo)
group.GET("/terminal/ws", h.HandleTerminalWS)
// File manager routes
group.GET("/fs", h.FSStat)
group.GET("/fs/list", h.FSList)
group.GET("/fs/read", h.FSRead)
group.GET("/fs/download", h.FSDownload)
group.POST("/fs/write", h.FSWrite)
group.POST("/fs/upload", h.FSUpload)
group.POST("/fs/mkdir", h.FSMkdir)
group.POST("/fs/delete", h.FSDelete)
group.POST("/fs/rename", h.FSRename)
root := e.Group("/bots/:bot_id")
root.POST("/mcp-stdio", h.CreateMCPStdio)
root.POST("/mcp-stdio/:connection_id", h.HandleMCPStdio)
root.POST("/tools", h.HandleMCPTools)
}
// CreateContainer godoc
// @Summary Create and start MCP container for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Param payload body CreateContainerRequest true "Create container payload"
// @Success 200 {object} CreateContainerResponse "SSE stream of container creation events"
// @Failure 400 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container [post].
func (h *ContainerdHandler) CreateContainer(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
var req CreateContainerRequest
if err := c.Bind(&req); err != nil {
return echo.NewHTTPError(http.StatusBadRequest, err.Error())
}
// Image override lets administrators specify a custom base image.
// NOTE(saas): if this becomes a multi-tenant SaaS, image override must be
// validated against an allowlist to prevent SSRF and resource abuse.
ctx := c.Request().Context()
imageOverride := strings.TrimSpace(req.Image)
image, err := h.manager.ResolveWorkspaceImage(ctx, botID)
if err != nil {
h.logger.Error("resolve workspace image failed",
slog.String("bot_id", botID), slog.Any("error", err))
return nil
}
if imageOverride != "" {
image = config.NormalizeImageRef(imageOverride)
}
snapshotter := strings.TrimSpace(req.Snapshotter)
if snapshotter == "" {
snapshotter = h.cfg.Snapshotter
}
flusher, ok := c.Response().Writer.(http.Flusher)
if !ok {
return echo.NewHTTPError(http.StatusInternalServerError, "streaming not supported")
}
c.Response().Header().Set(echo.HeaderContentType, "text/event-stream")
c.Response().Header().Set(echo.HeaderCacheControl, "no-cache")
c.Response().Header().Set(echo.HeaderConnection, "keep-alive")
c.Response().WriteHeader(http.StatusOK)
writer := bufio.NewWriter(c.Response().Writer)
var mu sync.Mutex
send := func(payload any) {
mu.Lock()
defer mu.Unlock()
data, err := json.Marshal(payload)
if err != nil {
return
}
_ = writeSSEData(writer, flusher, string(data))
}
sendError := func(msg string) {
send(createContainerErrorEvent{Type: "error", Message: msg})
}
// Phase 1: Pull image with progress
send(createContainerPullingEvent{Type: "pulling", Image: image})
var pullDone atomic.Bool
_, pullErr := h.manager.PullImage(ctx, image, &ctr.PullImageOptions{
Unpack: true,
Snapshotter: snapshotter,
OnProgress: func(p ctr.PullProgress) {
if pullDone.Load() {
return
}
send(createContainerPullProgressEvent{Type: "pull_progress", Layers: p.Layers})
},
})
pullDone.Store(true)
if pullErr != nil {
h.logger.Error("image pull failed",
slog.String("image", image), slog.Any("error", pullErr))
sendError("image pull failed: " + pullErr.Error())
return nil
}
// Phase 2: Create container (image is local, should be fast)
send(createContainerCreatingEvent{Type: "creating"})
// Notify the client before starting if data migration will happen,
// since restoring a large /data volume can take a while.
if h.manager.HasPreservedData(botID) {
send(createContainerRestoringEvent{Type: "restoring"})
}
if err := h.manager.StartWithResolvedImage(ctx, botID, image); err != nil {
h.logger.Error("container start failed",
slog.String("bot_id", botID), slog.Any("error", err))
sendError("container start failed: " + err.Error())
return nil
}
if err := h.manager.RememberWorkspaceImage(ctx, botID, image); err != nil {
h.logger.Warn("remember workspace image failed",
slog.String("bot_id", botID), slog.String("image", image), slog.Any("error", err))
}
containerID, err := h.manager.ContainerID(ctx, botID)
if err != nil {
h.logger.Error("container ID resolution failed after start",
slog.String("bot_id", botID), slog.Any("error", err))
sendError("container ID resolution failed: " + err.Error())
return nil
}
dataRestored := false
if req.RestoreData && h.manager.HasPreservedData(botID) {
if err := h.manager.RestorePreservedData(ctx, botID); err != nil {
h.logger.Error("restore preserved data failed",
slog.String("bot_id", botID), slog.Any("error", err))
sendError("restore preserved data failed: " + err.Error())
return nil
}
dataRestored = true
}
h.manager.RecordContainerRunning(ctx, botID, containerID, image)
// Phase 3: Complete
send(createContainerCompleteEvent{
Type: "complete",
Container: CreateContainerResponse{
ContainerID: containerID,
Image: image,
Snapshotter: snapshotter,
Started: true,
DataRestored: dataRestored,
HasPreservedData: h.manager.HasPreservedData(botID),
},
})
return nil
}
// GetContainer godoc
// @Summary Get container info for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Success 200 {object} GetContainerResponse
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container [get].
func (h *ContainerdHandler) GetContainer(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
status, err := h.manager.GetContainerInfo(c.Request().Context(), botID)
if err != nil {
if errors.Is(err, workspace.ErrContainerNotFound) {
return echo.NewHTTPError(http.StatusNotFound, "container not found for bot")
}
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, GetContainerResponse{
ContainerID: status.ContainerID,
Image: status.Image,
Status: status.Status,
Namespace: status.Namespace,
ContainerPath: status.ContainerPath,
TaskRunning: status.TaskRunning,
HasPreservedData: status.HasPreservedData,
Legacy: status.Legacy,
CreatedAt: status.CreatedAt,
UpdatedAt: status.UpdatedAt,
})
}
// DeleteContainer godoc
// @Summary Delete MCP container for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Param preserve_data query bool false "Export /data before deletion"
// @Success 204
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container [delete].
func (h *ContainerdHandler) DeleteContainer(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
preserveData := c.QueryParam("preserve_data") == "true"
if err := h.manager.CleanupBotContainer(c.Request().Context(), botID, preserveData); err != nil {
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.NoContent(http.StatusNoContent)
}
// StartContainer godoc
// @Summary Start container task for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Success 200 {object} object
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/start [post].
func (h *ContainerdHandler) StartContainer(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if err := h.manager.EnsureRunning(c.Request().Context(), botID); err != nil {
if errors.Is(err, workspace.ErrContainerNotFound) {
return echo.NewHTTPError(http.StatusNotFound, "container not found for bot")
}
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, map[string]bool{"started": true})
}
// StopContainer godoc
// @Summary Stop container task for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Success 200 {object} object
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/stop [post].
func (h *ContainerdHandler) StopContainer(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if err := h.manager.StopBot(c.Request().Context(), botID); err != nil {
if errors.Is(err, workspace.ErrContainerNotFound) {
return echo.NewHTTPError(http.StatusNotFound, "container not found for bot")
}
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, map[string]bool{"stopped": true})
}
// CreateSnapshot godoc
// @Summary Create container snapshot for bot
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Param payload body CreateSnapshotRequest true "Create snapshot payload"
// @Success 200 {object} CreateSnapshotResponse
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Failure 501 {object} ErrorResponse "Snapshots currently not supported on this backend"
// @Router /bots/{bot_id}/container/snapshots [post].
func (h *ContainerdHandler) CreateSnapshot(c echo.Context) error {
if h.containerBackend == "apple" {
return echo.NewHTTPError(http.StatusNotImplemented, "snapshots currently not supported on Apple Container backend")
}
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "snapshot manager not configured")
}
var req CreateSnapshotRequest
if err := c.Bind(&req); err != nil {
return echo.NewHTTPError(http.StatusBadRequest, err.Error())
}
created, err := h.manager.CreateSnapshot(c.Request().Context(), botID, req.SnapshotName, workspace.SnapshotSourceManual)
if err != nil {
if errdefs.IsNotFound(err) {
return echo.NewHTTPError(http.StatusNotFound, "container not found")
}
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, CreateSnapshotResponse{
ContainerID: created.ContainerID,
SnapshotName: created.SnapshotName,
RuntimeSnapshotName: created.RuntimeSnapshotName,
DisplayName: created.DisplayName,
Snapshotter: created.Snapshotter,
Version: created.Version,
Source: workspace.SnapshotSourceManual,
})
}
// ListSnapshots godoc
// @Summary List snapshots
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Param snapshotter query string false "Snapshotter name"
// @Success 200 {object} ListSnapshotsResponse
// @Failure 501 {object} ErrorResponse "Snapshots currently not supported on this backend"
// @Router /bots/{bot_id}/container/snapshots [get].
func (h *ContainerdHandler) ListSnapshots(c echo.Context) error {
if h.containerBackend == "apple" {
return echo.NewHTTPError(http.StatusNotImplemented, "snapshots currently not supported on Apple Container backend")
}
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "snapshot manager not configured")
}
data, err := h.manager.ListBotSnapshotData(c.Request().Context(), botID)
if err != nil {
if errdefs.IsNotFound(err) {
return echo.NewHTTPError(http.StatusNotFound, "container not found")
}
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
if req := strings.TrimSpace(c.QueryParam("snapshotter")); req != "" && req != data.Snapshotter {
return echo.NewHTTPError(http.StatusBadRequest, "snapshotter does not match container snapshotter")
}
snapshotKey := strings.TrimSpace(data.Info.SnapshotKey)
if snapshotKey == "" {
return echo.NewHTTPError(http.StatusInternalServerError, "container snapshot key is empty")
}
runtimeByName := make(map[string]ctr.SnapshotInfo, len(data.RuntimeSnapshots))
for _, info := range data.RuntimeSnapshots {
name := strings.TrimSpace(info.Name)
if name == "" {
continue
}
runtimeByName[name] = info
}
lineage, ok := snapshotLineage(snapshotKey, data.RuntimeSnapshots)
if !ok {
h.logger.Warn("container snapshot chain root not found",
slog.String("container_id", data.ContainerID),
slog.String("snapshotter", data.Snapshotter),
slog.String("snapshot_key", snapshotKey),
)
return echo.NewHTTPError(http.StatusInternalServerError, "container snapshot chain not found")
}
items := make([]SnapshotInfo, 0, len(lineage)+len(data.ManagedMeta))
seen := make(map[string]struct{}, len(lineage)+len(data.ManagedMeta))
appendRuntime := func(runtimeInfo ctr.SnapshotInfo, fallbackSource string, meta *workspace.ManagedSnapshotMeta) {
source := fallbackSource
managed := false
var version *int
displayName := ""
if meta != nil {
if meta.Source != "" {
source = meta.Source
}
managed = true
version = meta.Version
displayName = strings.TrimSpace(meta.DisplayName)
}
name := displayName
if name == "" {
if version != nil {
name = fmt.Sprintf("Version %d", *version)
} else {
name = runtimeInfo.Name
}
}
items = append(items, SnapshotInfo{
Snapshotter: data.Snapshotter,
Name: name,
DisplayName: displayName,
RuntimeName: runtimeInfo.Name,
Parent: runtimeInfo.Parent,
Kind: runtimeInfo.Kind,
CreatedAt: runtimeInfo.Created,
UpdatedAt: runtimeInfo.Updated,
Labels: runtimeInfo.Labels,
Source: source,
Managed: managed,
Version: version,
})
seen[strings.TrimSpace(runtimeInfo.Name)] = struct{}{}
}
for _, runtimeInfo := range lineage {
name := strings.TrimSpace(runtimeInfo.Name)
if meta, hasMeta := data.ManagedMeta[name]; hasMeta {
appendRuntime(runtimeInfo, "image_layer", &meta)
continue
}
appendRuntime(runtimeInfo, "image_layer", nil)
}
for name, meta := range data.ManagedMeta {
if _, exists := seen[name]; exists {
continue
}
runtimeInfo, exists := runtimeByName[name]
if !exists {
h.logger.Warn("managed snapshot not found in runtime",
slog.String("container_id", data.ContainerID),
slog.String("snapshot_name", name),
slog.String("snapshotter", data.Snapshotter),
)
continue
}
appendRuntime(runtimeInfo, "managed", &meta)
}
sort.Slice(items, func(i, j int) bool {
if items[i].CreatedAt.Equal(items[j].CreatedAt) {
return items[i].Name < items[j].Name
}
return items[i].CreatedAt.Before(items[j].CreatedAt)
})
return c.JSON(http.StatusOK, ListSnapshotsResponse{
Snapshotter: data.Snapshotter,
Snapshots: items,
})
}
// RollbackSnapshot godoc
// @Summary Rollback container to a previous snapshot version
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Param payload body RollbackRequest true "Rollback payload"
// @Success 200 {object} object
// @Failure 400 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/snapshots/rollback [post].
func (h *ContainerdHandler) RollbackSnapshot(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "manager not configured")
}
var req RollbackRequest
if err := c.Bind(&req); err != nil {
return echo.NewHTTPError(http.StatusBadRequest, "invalid request body")
}
if req.Version < 1 {
return echo.NewHTTPError(http.StatusBadRequest, "version must be >= 1")
}
if err := h.manager.RollbackVersion(c.Request().Context(), botID, req.Version); err != nil {
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, map[string]any{"rolled_back_to": req.Version})
}
// ExportContainerData godoc
// @Summary Export container /data as a tar.gz archive
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Produce application/gzip
// @Success 200 {file} file
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/data/export [post].
func (h *ContainerdHandler) ExportContainerData(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "manager not configured")
}
reader, err := h.manager.ExportData(c.Request().Context(), botID)
if err != nil {
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
defer func() { _ = reader.Close() }()
c.Response().Header().Set("Content-Type", "application/gzip")
c.Response().Header().Set("Content-Disposition", `attachment; filename="`+botID+`-data.tar.gz"`)
c.Response().WriteHeader(http.StatusOK)
_, err = io.Copy(c.Response(), reader)
return err
}
// ImportContainerData godoc
// @Summary Import a tar.gz archive into container /data
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Accept multipart/form-data
// @Param file formData file true "tar.gz archive"
// @Success 200 {object} object
// @Failure 400 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/data/import [post].
func (h *ContainerdHandler) ImportContainerData(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "manager not configured")
}
file, err := c.FormFile("file")
if err != nil {
return echo.NewHTTPError(http.StatusBadRequest, "file is required")
}
src, err := file.Open()
if err != nil {
return echo.NewHTTPError(http.StatusBadRequest, "failed to open uploaded file")
}
defer func() { _ = src.Close() }()
if err := h.manager.ImportData(c.Request().Context(), botID, src); err != nil {
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, map[string]bool{"imported": true})
}
// RestorePreservedData godoc
// @Summary Restore previously preserved data into container
// @Tags containerd
// @Param bot_id path string true "Bot ID"
// @Success 200 {object} object
// @Failure 404 {object} ErrorResponse
// @Failure 500 {object} ErrorResponse
// @Router /bots/{bot_id}/container/data/restore [post].
func (h *ContainerdHandler) RestorePreservedData(c echo.Context) error {
botID, err := h.requireBotAccess(c)
if err != nil {
return err
}
if h.manager == nil {
return echo.NewHTTPError(http.StatusInternalServerError, "manager not configured")
}
if !h.manager.HasPreservedData(botID) {
return echo.NewHTTPError(http.StatusNotFound, "no preserved data found")
}
if err := h.manager.RestorePreservedData(c.Request().Context(), botID); err != nil {
return echo.NewHTTPError(http.StatusInternalServerError, err.Error())
}
return c.JSON(http.StatusOK, map[string]bool{"restored": true})
}
func snapshotLineage(root string, all []ctr.SnapshotInfo) ([]ctr.SnapshotInfo, bool) {
root = strings.TrimSpace(root)
if root == "" {
return nil, false
}
index := make(map[string]ctr.SnapshotInfo, len(all))
for _, info := range all {
name := strings.TrimSpace(info.Name)
if name == "" {
continue
}
index[name] = info
}
if _, ok := index[root]; !ok {
return nil, false
}
lineage := make([]ctr.SnapshotInfo, 0, len(index))
visited := make(map[string]struct{}, len(index))
current := root
for current != "" {
if _, seen := visited[current]; seen {
break
}
info, ok := index[current]
if !ok {
break
}
lineage = append(lineage, info)
visited[current] = struct{}{}
current = strings.TrimSpace(info.Parent)
}
return lineage, true
}
// ---------- auth helpers ----------
// requireBotAccess extracts bot_id from path, validates user auth, and authorizes bot access.
func (h *ContainerdHandler) requireBotAccess(c echo.Context) (string, error) {
channelIdentityID, err := h.requireChannelIdentityID(c)
if err != nil {
return "", err
}
botID := strings.TrimSpace(c.Param("bot_id"))
if botID == "" {
return "", echo.NewHTTPError(http.StatusBadRequest, "bot id is required")
}
if _, err := h.authorizeBotAccess(c.Request().Context(), channelIdentityID, botID); err != nil {
return "", err
}
return botID, nil
}
func (*ContainerdHandler) requireChannelIdentityID(c echo.Context) (string, error) {
return RequireChannelIdentityID(c)
}
func (h *ContainerdHandler) authorizeBotAccess(ctx context.Context, channelIdentityID, botID string) (bots.Bot, error) {
return AuthorizeBotAccess(ctx, h.botService, h.accountService, channelIdentityID, botID)
}
// requireBotAccessWithGuest is like requireBotAccess but also allows guest access
// via ACL when the caller explicitly opts into guest-compatible access.
func (h *ContainerdHandler) requireBotAccessWithGuest(c echo.Context) (string, error) {
channelIdentityID, err := h.requireChannelIdentityID(c)
if err != nil {
return "", err
}
botID := strings.TrimSpace(c.Param("bot_id"))
if botID == "" {
return "", echo.NewHTTPError(http.StatusBadRequest, "bot id is required")
}
if _, err := AuthorizeBotAccess(c.Request().Context(), h.botService, h.accountService, channelIdentityID, botID); err != nil {
return "", err
}
return botID, nil
}