--- title: Be Civic — Privacy and PII Protection type: spec status: v0.5.3 — post-2026-05-13 Path Directory V0 implementation reconciliation date: 2026-05-12 parent_spec: ./README.md sibling_specs: - architecture.md - protocol.md - schemas.md - lifecycle.md - skills.md - website.md tags: ["be-civic", "bc-internal", "architecture"] --- # Be Civic — Privacy and PII Protection This sub-spec covers every privacy and PII protection mechanism in Be Civic: the trust boundary between consumer and server (§8.1), the submission contract (§8.2), the receiving-end ingestion pipeline and its validation steps (§8.3), the Cloudflare reference implementation and its security details (§8.4), the NER-on-commit held-for-review path (§8.5), the incident response for PII that slips through all gates (§8.6), the consumer-side state contract and its 16-axis `profile.json` catalogue (§8.7), retention and deletion semantics for local state (§8.8), the document-content-discard rule (§8.9), and the anonymous-by-construction structural reinforcement rules (§8.10). PII protection in Be Civic is structural, not promissory. The schema-level field bans (§6.2 in `schemas.md`), the scrub rules file (§6.8 in `schemas.md`), and the mechanisms in this section together form a layered defence. For the promotion thresholds and rollback mechanics that also interact with IP-hash salting, see `lifecycle.md`. ## 8. PII protection ### 8.1 Trust boundary The submitting (consumer-side) agent is the **only entity that knows what is identifying for the user it serves**. The receiving end has no user context and cannot perform context-aware scrub. Consequently: - **Primary scrub: consumer-side**, including any LLM-based judgment - **Defence in depth: receiving-side, deterministic only** (regex + NER, no LLM) This placement of the LLM gate at the consumer end (and **only** there) is non-negotiable. The receiving end never runs an LLM on submission content — eliminating prompt-injection surface, API key dependencies, and per-submission cost. PII is structurally prevented from reaching the corpus by: - Schema-level ban on identity-shaped fields (per §6.2 (see schemas.md)) - Hard length caps on free-text fields (per §6.2 (see schemas.md)) - Three-stage scrub: consumer pre-flight, Worker hard-gate, NER on commit (per §6.8 (see schemas.md)) - Salted hashed per-IP correlation only (daily-rotating salt for rate limits; per-proposal salt for state-machine bookkeeping); no plaintext IP storage (per §3 (see architecture.md) principle 4) - No request-body logging (per §3 (see architecture.md) principle 4) ### 8.2 Submission contract — global, versioned The submission contract is a **global** document at `docs/submission-contract-v.mdx`. Every skill carries `submission_contract_version` in frontmatter pointing at the version the consuming agent must follow when submitting. The contract content lives in the contract file, which is being authored in parallel to this rework; this section describes the contract's role and structure. **Contract role.** The contract is the single source of truth for what a consumer AI must do at session start, before submitting any of the four submission types, and after submitting. It is canonical — skills do not paraphrase. Per-skill overrides are permitted but must be additive, not replacement. **Contract structure** (per the parallel rewrite): - Session start (one-off framed message; opt-out semantics; conversation-language detection) - Capability self-classification (against §6.7 (see schemas.md) tiers; recommend stepping up if below; advice-only mode otherwise) - Pre-flight validation (consumer-side scrub: regex + LLM contextual; cross-ref script via `tool_execution` when capable; rules checklist when not) - Submission type sections (one per type — observation, skill_amendment, skill_draft, validation — covering when to submit which, schema details, reference assembly) - Alpha / beta UX (banner copy; first-validator transparency wording per G.8 / G.9; falling back to previous stable on rejection) - Cancellation (DELETE within 24h; multi-device gap acknowledged) - Submissions log: project-local at `/.be-civic/submissions.jsonl` when the agent is writing files for the task, else `/be-civic/submissions.jsonl` as fallback (user can review and cancel) - Capability-mismatch and filesystem-less behaviour (advice-only mode) - Language handling (skills read in English; user-facing prose in conversation language; citations resolved per G.13 multilingual rules; commune correspondence language is the user-choice exception) **Alpha / beta UX excerpts** (canonical wording lives in the contract; reproduced here so the spec is self-contained): When loading an alpha skill that has a previous stable (G.9): > "Note: I'm using an alpha version of this skill — meaning a recent change is still being validated. Your session helps validate it. If anything goes wrong with the new content, I'll fall back to the previous stable version (last verified [date])." When loading a brand-new alpha skill with no previous stable (G.8): > "This skill is brand-new and unvalidated — your session is among the first to use it. We'll proceed with low confidence and I'll flag anything that doesn't match what you experience. If something fails, we have nothing to fall back to except checking with the relevant authority directly." The agent files higher-grade observations and a `validation` event at session end on brand-new alpha. ### 8.3 Receiving-end ingestion pipeline The receiving end is **not GitHub Issues**. Submissions go to a staging service that holds them privately for 24 hours before committing. This avoids requiring GitHub accounts and gives genuine cancellation semantics. **Endpoint table (post-2026-05-15 taxonomy normalization):** | Endpoint | Purpose | Required capabilities (consumer must self-declare) | |----------|---------|----------------------------------------------------| | `POST /api/feedback` | **Recommended primary**: polymorphic envelope; submit one or more items in a single request | union of per-item types' capabilities | | `POST /api/concerns` | Submit a `concern` (per-type escape hatch; renamed from `/api/observations`) | `multi_turn`, `structured_output` | | `POST /api/amendments` | Submit an `amendment` (now covers skill / volatile_value / reference / path / path_source via `target_type`; renamed and unified from `/api/skill-amendments` + `/api/path-amendments`) | `multi_turn`, `structured_output` (+ `web_fetch`, `tool_execution` for target_type=skill / path / path_source) | | `POST /api/drafts` | Submit a `draft` (now covers skill + path via `target_type`; renamed and unified from `/api/skill-drafts` + `/api/path-drafts`) | `multi_turn`, `structured_output`, `web_fetch`, `tool_execution`, `file_read` | | `POST /api/validations` | Submit a `validation` (polymorphic over all six `target_type` values; absorbs the prior `/api/path-validations`) | `multi_turn`, `structured_output` (+ `web_fetch`, `tool_execution` for non-observation target_types) | | `POST /api/feedback-channel` | Submit a `feedback` (new free-text channel; operator-private triage in v1) | `multi_turn`, `structured_output` | | `POST /api/ratings` | Submit a `rating` (sprint 2026-W23 Lock A; opt-in three-axis stars) | `multi_turn`, `structured_output` | | `POST /api/analytics` | Submit `analytics` (opt-in session lifecycle telemetry) | `multi_turn`, `structured_output` | | `DELETE //{id}` | Cancel a staged submission | bearer `cancel_token` | | `GET //{id}` | Status query (no body content) | none | | `GET /api/feedback/sessions/` | List committed items under that session_id (anonymous read; recovery key per S61 reversal) | none | | `GET /api/skills//concerns` | RESTful alias for `GET /api/concerns?skill=` (renamed from `/api/skills//observations`) | none | **Legacy routes removed (pre-launch hard cutover).** `POST /api/observations`, `POST /api/skill-amendments`, `POST /api/skill-drafts`, `POST /api/path-amendments`, `POST /api/path-drafts`, `POST /api/path-validations` are deleted in the same PR that adds the new ones. No aliases, no 30-day grace. Pre-launch context: there is no installed base of agents calling the legacy routes. #### 8.3.0a Recommended primary: `POST /api/feedback` `POST /api/feedback` is the recommended primary submission surface. It accepts a polymorphic envelope carrying one or more items across the five feedback types + rating in a single request, allowing agents to batch a session's submissions into one round-trip rather than separate per-type posts. The per-type endpoints above remain operational (escape hatches); the envelope is the primary tool. **Schema URL:** `https://becivic.be/schemas/feedback-envelope.schema.json`. **Envelope shape.** Top-level fields: - `schema_version: 1` - `session_id` — agent-chosen UUIDv7 (`ses_`); also the recovery key for `GET /api/feedback/sessions/`. Per the 2026-05-15 **S61 reversal**, `session_id` is the recovery key end-to-end; the recovery_token component is dropped. - `submitted_at` — ISO-8601 envelope-level timestamp (per-item `submitted_at` is permitted and overrides for that item) - `submitting_agent`, `submission_contract_version`, `declared_capabilities` — moved up from per-item; declared once per envelope - `mode: "validate" | "stage"` — controls the per-item pipeline (see below) - `items[]` — array of per-item submissions; each item carries a `type` discriminator (`concern` | `amendment` | `validation` | `draft` | `feedback` | `rating`) and the type-specific body. `context` is **per-item**, not envelope-level. (Pre-2026-05-15 the discriminator enum was `observation | validation | skill_amendment | skill_draft`; rename per the taxonomy normalization. `analytics` is NOT in the envelope — analytics has its own dedicated endpoint.) **Two-call pattern (`mode`).** The contract is a deliberate two-call flow: 1. **Validate first** (`mode: "validate"`): runs the full validation pipeline per item — schema, identity-field guard, capability tier, regex scrub, cross-ref against canonical state — but does **not** stage. Each item returns `{idx, type, ok, status: "validated", would_stage_for: }` on pass or `{idx, type, ok: false, status: "rejected", error, schema_pointer, missing}` on fail. 2. **Stage** (`mode: "stage"`): runs the full pipeline including stage / commit. Per-item response is the same shape as the per-type endpoints — `{idx, type, ok, status: "staged", id, cancel_token, commit_eta}` for `concern` / `amendment` / `draft` / `feedback` / `rating`; `{idx, type, ok, status: "applied", id, applied_at}` for `validation` (which writes directly to D1, per §6.2.3 (see schemas.md)); `{idx, type, ok, status: "duplicate"}` on idempotent re-POST. **`?dry_run=1` query alias.** `?dry_run=1` is a backwards-compat alias for `mode: "validate"`. If both query and body are present, the **body's `mode` wins**. **Per-item independence.** The HTTP response is **always 200** with `{results: [{idx, type, ok, status, ...}]}` whenever the envelope itself is well-formed — even when individual items failed schema, identity, capability, scrub, or cross-ref. Each item is processed independently; one item's rejection does not abort the others. Envelope-level 4xx is reserved for: malformed JSON (400), missing top-level envelope fields (400 with `error: "schema_fail"`, `missing: `), failed auth, or rate limit (429). Empty `items: []` is permitted and returns `{results: []}` with HTTP 200. **Per-item idempotency.** Per-item dedup keys off the type-specific id field (`concern_id`, `validation_id`, `amendment_id`, `draft_id` with `proposed_id`, `feedback_id`, `rating_id`). Resubmitting the same envelope (or any envelope containing an item with a previously-seen id) returns `{idx, type, ok: true, status: "duplicate"}` for that item — no double-stage, no double-D1-INSERT. Idempotency is per-item. **Pre-launch hard cutover.** No backward-compatibility shim for the pre-amendment `items[].type` enum (`observation | skill_amendment | skill_draft | path_amendment | path_draft`); the dispatcher's `ITEM_TYPES` set is rewired to the new 6-type enum (`concern, amendment, validation, draft, feedback, rating`) in the same PR that lands the new schemas. Legacy item types fail `schema_fail` at the gate after cutover. **Validation pipeline at submission** (in the Worker): 1. **Parse JSON** from request body 2. **Schema validation** against the appropriate schema (`concern` / `amendment` / `draft` / `validation` / `feedback-channel` / `rating`) 3. **Identity-field ban check** — reject if any identity-shaped field is present (§6.2 (see schemas.md), defensive even if not declared in the schema) 4. **Capability check** — declared capabilities must include all required for the endpoint per §6.7 (see schemas.md); reject 4xx on miss 5. **Regex scrub** — apply every rule in `tools/scrub/regex-rules.json` to every string field; reject 4xx on any hit 6. **Cross-reference validation against canonical state** (Worker fetches required canonical resources from latest `main` and queries D1 catalogues as needed): - `context.commune` (when non-null) resolves to an entry in `data/communes.json` (concerns). The field is now optional — concerns not bound to a specific commune leave it null. - `target_id` resolves against the appropriate target per `target_type` (see §6.2 (see schemas.md) resolution table): `skill` → `skills//canonical.md` on `main`; `volatile_value` / `reference` → D1 rows; `path` / `path_source` → `bc-docs/paths/index.json` catalogue (lookup keys per §6.12.7 — `` for `path`, `:` for `path_source`); `observation` (validation only) → D1 `concerns` row. - **`skill_graph` carve-out:** when `type=concern` AND `target_type=skill_graph`, the resolver short-circuits with `{ok: true, resolved_to: "skill_graph_assertion"}` — target_id MAY be empty or a proposed kebab-case skill_id that need not resolve. Cross-ref rejects all other target_types whose target_id fails to resolve. - `context.applies_to_match` keys are a subset of the referenced skill's `applies_to` keys - For `amendment` (`target_type=skill`), the per-`amendment_subtype` checks (per §6.2.2 (see schemas.md)): `body` — `body_diff` parses as unified diff and applies cleanly against the target skill's current canonical body (`skill_commit` drift check); `frontmatter` — `frontmatter_change.field_path` resolves to a valid field in the target skill's frontmatter schema and `proposed_value` matches the declared type. - For `amendment` (`target_type=path | path_source`), the per-`amendment_subtype` checks: `field_edit` — `field_path` resolves to a valid field in the target path / source schema; `source_add` — the `source_add` object validates against `path-source.schema.json` with the matching per-`source_class` template branches per §6.12.3. - For `draft`: cross-ref script (`validate-cross-refs.ts`) runs as backstop on the proposed frontmatter and `requires` graph (target_type=skill) or path / source schema (target_type=path); tag uids are left empty by the consumer and filled by PR-CI on the resulting PR (§6.11 (see schemas.md)). The `proposed_id` MUST NOT already exist as a live artefact (the inverse of the standard "must exist" rule). - **`cohort_anchor` Worker-stamp.** Between step 6 (cross-ref) and step 7 (timing): for `concern` / `amendment` / `validation` with `target_type ∈ {skill, path}`, the Worker reads the current `version:` from the targeted canonical and writes `cohort_anchor: @` onto the staged row. Agents never carry this field; the schema rejects agent-supplied `cohort_anchor` as `additionalProperty`. Per C1. 7. **Self-validation prevention (validations only)** — Worker fetches the target artefact's submitter-IP-hash (using the per-artefact salt for the artefact's table); reject 4xx if it matches the validator's IP hash (per G.7). For `target_type='observation'`, the lookup is against the concern's own submitter-IP-hash (the upvoter-of-own-concern case). For `target_type='path_source'` the per-artefact salt is scoped to the **path**, not the individual source row — the lookup key is `` extracted from the `:` target_id (see §6.2.3 in `schemas.md` for rationale and the path-creator-salt KV pattern) 8. **Per-IP rate limit check** (per G.6): - `validation` submissions: 10/IP/day - `validation` with `injection_flag: true`: 2/IP/day - All submission types combined: 50/IP/day - Above threshold → 429 with `Retry-After` header **On submission pass:** - **Dedup check.** If `_id` already exists in KV (or in the recently-committed cache, retained 48h), return the existing record (idempotent re-POST). On dedup, bind to the original submitter's IP-hash; if mismatch, return 409 with `{error: "duplicate_id_different_submitter"}` (per A.1 default). - **Otherwise:** generate `cancel_token` — 32 random bytes, base64url 43 chars no padding. Store `{payload, submitted_at, commit_eta = max(submitted_at, received_at) + 24h, cancel_token_hash, submitter_ip_hash}` in KV (per A.8 server-stamped `received_at`). Reject if `submitted_at` is >1h ahead of server clock or >7d behind. - Write a primary KV entry keyed by `_id` (TTL ~48h) and a commit-eta index entry keyed by `commit:::`. The cron job scans the index, not by relying on TTL expiry. - Return `{_id, cancel_token, commit_eta, staging_window_hours: 24}` to the consumer. **On submission fail:** - Return 4xx with `{error: , schema_pointer: }` — naming the category, never echoing the matched substring - Increment per-IP rejection counter - No data persisted in the staging KV - **Worker logging discipline:** the Worker MUST NOT log request bodies or rejection-detail substrings. Permitted log fields at INFO: `_id`, rejection category, response status, request duration. Plaintext IP is NEVER logged; rate-limit counters key on `sha256(ip + daily_salt)` (per G.14, principles 4 and 5) **Cancellation:** - `DELETE //{id}` with `Authorization: Bearer ` — Worker constant-time-hashes the supplied token and matches against stored hash; on match, deletes both KV entries; returns `{cancelled: true}` - Token mismatch returns 401 (not 404 — don't reveal whether the id exists) - KV partition / unreachable returns 503 with `{error: 'staging_unavailable', retry_after}` - Cancellation is irreversible - Consumer DELETE retry policy: queue at `/.be-civic/cancel-retry/.json` when the agent is writing files for the task, else `/be-civic/cancel-retry/.json` as fallback; exponential backoff (60s start, 1h ceiling, full jitter); hard deadline at `commit_eta - 60s` **Status query (`GET //{id}`, optional):** - If primary KV entry exists: `{state: "staged", commit_eta}` (no body content) - If primary KV gone but recently-committed cache hits: `{state: "committed", committed_at}` - For NER-held submissions: `{state: "ner_held_for_review"}` — and on resolution: `{state: "released" | "released_after_edit" | "discarded"}` (per G.14, principle 1) - For artefacts pushed into quarantine by a validation's `injection_flag`: `{state: "quarantined", target_type, target_id}` (validations only; relevant when a validation triggered quarantine of its target) - Otherwise 404 **Status writeback (`PATCH //{id}/status`):** Used by the `ner-on-commit` GitHub Action and by maintainer-resolution tooling to update a submission's lifecycle state in KV so the status endpoint reflects the held/released/discarded outcome. - Auth: `Authorization: Bearer ` where the token is minted by the Be Civic GitHub App. The Worker validates the token by calling `GET https://api.github.com/app` and confirming the returned `id` matches `GITHUB_APP_ID`. No new secret required. - Body: `{"state": ""}` (JSON, `Content-Type: application/json`). - Allowed transitions (all others rejected with 400 `invalid_transition`): - `staged` → `ner_held_for_review` (NER flags a staged submission) - `committed` → `ner_held_for_review` (NER flags a just-committed submission) - `ner_held_for_review` → `released` (maintainer: false positive) - `ner_held_for_review` → `discarded` (maintainer: real PII) - The endpoint MUST NOT allow `staged` → `committed` (commit is cron-driven only). - On transition to `ner_held_for_review`: the primary KV record in STAGED_SUBMISSIONS is re-PUT **without** `expirationTtl`, making it permanent. This ensures `GET //{id}` returns the held state after the original 48h staging window expires. - On `committed` → `ner_held_for_review`: a synthetic StagedRecord is created in STAGED_SUBMISSIONS (permanent) from the COMMITTED_CACHE entry, since the original staged record no longer exists. - On `released` / `discarded`: the state field is updated on the permanent record. - Response: `{state: "", previous: ""}` with HTTP 200. **Commit / PR job (scheduled Worker, every 5 minutes):** 1. Read commit-eta index for entries where `commit_eta <= now` 2. For each ready entry: read primary entry; populate Worker-set fields (`validated_at`, `regex_passes`, `cohort_anchor`); route per submission type: - **Concerns** — INSERT into D1 `concerns` table (renamed from `observations` per the 2026-05-15 amendment); D1 auto-assigns `con-NNNNN` uid; concern becomes visible via `` aggregator on the next renderer build (or via fetch-time resolution if rendered on demand). The aggregator walks the skill body's ``/``/`` inline tags and surfaces concerns against the skill itself AND every catalogue / path / source uid the body cites (§6.10 (see schemas.md)). - **Amendments — `target_type=skill`** — open a PR against `main` applying the change to `skills//canonical.md`. PR-CI runs validators and orchestrates uid assignment for any newly-introduced `` or `` tags. Auto-merge on green. - **Amendments — `target_type=path | path_source`** — open a PR against `main` applying the change to `bc-docs/paths/index.json`. PR-CI runs validators (`path.schema.json` / `path-source.schema.json`, source-class template check). Auto-merge on green. - **Amendments — `target_type=volatile_value | reference`** — D1 INSERT-with-supersede directly; NO PR. The state-machine cron reads the threshold table (§9.2 (see lifecycle.md)) and either supersedes the prior row or rolls the amendment back. Fast-path semantics preserved. - **Drafts — `target_type=skill`** — open a PR creating `skills//canonical.md` at `status: alpha`. PR-CI runs validators; the maintainer reviews and merges (S31). - **Drafts — `target_type=path`** — open a PR inserting the new entry into `bc-docs/paths/index.json` under `paths.` at `status: alpha`. PR-CI runs validators; the maintainer reviews and merges (parallel rule to S31). - **Validations** — INSERT into D1 `validations` table immediately on submission (no 24h staging; see §6.2.3 (see schemas.md)). The state machine queries D1 aggregates on its next tick. - **Feedback** — INSERT into D1 `feedback_channel` table at commit time. No PR; no public surface; the operator reads the triage queue out-of-band. - **Ratings** — INSERT into D1 `ratings` table at commit time. Aggregates surface in `` on the skill canonical (for `target_type=skill`) or in the operator-private analytics surface (for `target_type=agent_protocol | session`). See §6.2.7 (see schemas.md). 3. On commit / PR open success: write `{_id, committed_at_or_pr_opened_at}` to the recently-committed cache (48h TTL); delete primary entry; delete commit-eta index entry. 4. On commit / PR-open failure: leave the entry in place; the next run picks it up. Stamp `last_commit_attempt`. Retry indefinitely; alarm fires if commits stuck >24h beyond `commit_eta`. **Concurrency.** Cloudflare KV is eventually consistent globally, but the scheduled Worker runs in a single region; race conditions across Worker instances are unlikely. Idempotent commit semantics (dedup at submission, recently-committed cache at GET) make duplicate commits a non-issue. **KV TTL note.** Cloudflare KV TTL is "soft" — entries are eventually purged but not at exactly TTL time. The commit-eta index pattern above does not rely on TTL expiry for triggering commits. **Bulk read of canonical files** is via Git / GitHub Contents API on the relevant directory; the Worker's HTTP API is submission-only and does not expose listing. ### 8.4 Cloudflare reference implementation The v1 staging service runs as a Cloudflare Worker (source in `bc-infra/api/`) for the four POST endpoints, plus a separate scheduled Worker (source in `bc-infra/tools/staging-worker/`) for the commit cron job. Both share a single Cloudflare account, D1 database, and GitHub App. The `becivic.be/` apex is served by the renderer Worker (`bc-infra/site/renderer/`, per §20 (see website.md)); the apex router (`bc-infra/site/router-worker.js`) routes `/api/*` to the staging Worker and everything else to the renderer. **Renderer integration.** The renderer pulls from the bc-docs source tree at deploy time, builds `dist/`, and is bound as Cloudflare Workers Static Assets: - `becivic.be/` — marketing landing rendered from `bc-docs/index.mdx` - `becivic.be/agents` — agent overview from `bc-docs/agents.mdx` (~40 lines after S52 implementation); per-endpoint pages at `/agents/submit/*`; machine-readable manifest at `/agents/manifest.json` (per §13.1 (see architecture.md), G.12) - `becivic.be/skills/` — skill bodies served from `skills//canonical.md`. One canonical URL per skill across all `status` values; the `status` frontmatter field drives an in-page banner (§6.1 (see schemas.md)) when the skill is at `draft`, `alpha`, or `beta`. - `becivic.be/docs/submission-contract-v` — submission contract page - `becivic.be/llms.txt`, `becivic.be/llms-full.txt` — emitted by the renderer build pipeline; `docs.json.description` injects "AI agents: read /agents before anything else." (per G.4b) - `mcp.becivic.be` — separate MCP Worker (§23 (see protocol.md)) exposing the API surface as ~6 intent-oriented tools **Non-stable skill pages carry `noindex: true`** when `status ≠ stable` (the renderer build injects `noindex: true` into the rendered HTML head based on the frontmatter `status` field) so search engines index only stable content. **Cloudflare carries four roles:** (a) the **renderer Worker** at `bc-infra/site/renderer/` (Workers Static Assets binding) serves all human-facing paths; (b) the **apex router Worker** at `bc-infra/site/router-worker.js` path-routes between the renderer (default) and the staging Worker (`/api/*`); (c) the **staging service Worker** at `bc-infra/api/` handles the four POST endpoints, the DELETE, and the GET status, plus D1 access for catalogues and signals; (d) the **scheduled Worker** at `bc-infra/tools/staging-worker/` handles the commit cron job. The MCP Worker at `mcp.becivic.be` is independently routed via DNS subdomain. **Repo layout for serving:** ``` site/ # Cloudflare Workers Static Assets — marketing landing + apex router ├── index.html # bespoke marketing landing (served at /) ├── style.css ├── fonts/ # self-hosted Manrope woff2 (700, 800), latin subset ├── logo/ # light.svg, dark.svg ├── favicon.svg ├── router-worker.js # routes /api/* to staging Worker; everything else to renderer Worker └── wrangler.toml # Workers Static Assets binding + routes config api/ # Cloudflare Worker — staging service for /api/* ├── worker.ts # entry point ├── wrangler.toml └── routes/ ├── observations/ │ ├── index.ts # POST handler │ └── [id]/ │ ├── index.ts # DELETE handler │ └── status.ts # GET status ├── skill-amendments/{...} # same pattern ├── skill-drafts/{...} └── validations/{...} tools/staging-worker/ # scheduled Worker for the commit job ├── worker.ts # cron handler ├── wrangler.toml └── README.md ``` **Amendment materialisation (fetch-then-materialise).** When the cron commit job processes a `skill_amendment` record, it fetches the canonical skill body from GitHub (`skills//canonical.md` via the Contents API with the installation token) before calling `buildCommitTarget`. The fetched content is passed as `canonical_content`, enabling `buildCommitTarget` to materialise the post-amendment proposal.md as a full renderable skill file (frontmatter + applied body), identical in shape to `skill_draft` proposals. The `.meta.json` sidecar preserves the original amendment payload (body_diff / frontmatter_change / references_change) for audit. If the canonical file is not found (404), the cron loop logs a structured error (`canonical_not_found`) and leaves the staged record untouched for operator investigation — no destructive deletion. Transient fetch failures are treated identically: the record stays in place and the next cron tick retries. **Workers Static Assets and the staging Worker auto-deploy** from GitHub on push to `main` that touches `site/` or `api/` respectively. **Scheduled Worker deploys** via GitHub Action on push to `tools/staging-worker/`. **Secrets:** - Cloudflare-side: `GITHUB_APP_ID`, `GITHUB_APP_PRIVATE_KEY`, `GITHUB_APP_INSTALLATION_ID` - GitHub-side (Worker deploy): `CLOUDFLARE_API_TOKEN`, `CLOUDFLARE_ACCOUNT_ID` **The GitHub App** has `Contents: Read & Write` permission only; installed once on the repo by the maintainer; private key in Cloudflare secrets. The Worker mints short-lived installation tokens (1h TTL) and commits via the GitHub API. **Rate-limit thresholds (initial values, tunable):** - All submission types combined: 50/IP/day (per G.6) - `validation` submissions: 10/IP/day - `validation` with `injection_flag: true`: 2/IP/day - Per-IP burst (any type): 60 submissions / hour rolling - Worker-global submission rate: 1000 / hour (trip-wire for "something is wrong") Counters live in KV namespace `RATE_LIMITS`, keyed by `sha256(ip + daily_salt)`, with rolling-window TTL (per A.9 default). **URL structure:** ``` https://becivic.be/ <- marketing landing (renderer; index.mdx) https://becivic.be/agents <- agent entry overview (renderer; agents.mdx; ~40 lines + manifest.json + per-endpoint pages — §13.1) https://becivic.be/agents/manifest.json <- machine-readable agent capability + endpoint manifest https://becivic.be/agents/submit/ <- per-endpoint reference page (renderer) https://becivic.be/skills/ <- skill body at its current status (renderer; canonical.md; banner inferred from `status` frontmatter when not stable) https://becivic.be/docs/submission-contract-v <- contract (renderer) https://becivic.be/llms.txt <- renderer build emits https://becivic.be/llms-full.txt <- renderer build emits https://becivic.be/api/observations <- Cloudflare; POST submit observation https://becivic.be/api/observations/ <- Cloudflare; DELETE / GET status https://becivic.be/api/skill-amendments <- Cloudflare; POST https://becivic.be/api/skill-drafts <- Cloudflare; POST https://becivic.be/api/validations <- Cloudflare; POST https://becivic.be/scrub-rules.json <- canonical regex-rules.json (renderer-served) https://becivic.be/communes.json <- canonical data/communes.json https://mcp.becivic.be/ <- MCP server (§23); ~6 intent-oriented tools ``` The skills index entry for each skill carries the `commit` field (git short SHA) for build-time reproducibility. Concerns no longer require this; the read-side discovery endpoint is `GET /api/skills//concerns` (renamed from `/observations`); `GET /api/skills//history` (already shipped) returns the commit timeline. **Substrate-agnosticism preserved.** The protocol in §8.3 specifies the interface, not the implementation. Anyone running their own be-civic fork can swap the Worker for any other backend that implements the same endpoints; the rendering layer is replaceable by any static-site generator that supports content versioning and the Be Civic MDX subset. Skills reference the staging URL via a single declared constant in the contract. ### 8.4b Internal endpoint: artefact-stats (distinct-IP counts) `GET /api/_internal/artefact-stats?target_type=&target_id=` — returns the number of distinct validator IPs recorded for an artefact, along with how many of those carried an injection flag. - **Auth**: GitHub App installation token (same path as the status-writeback endpoint, §8.5). No new secret. - **Query parameters**: `target_type ∈ {skill, volatile_value, reference, observation, path, path_source}`; `target_id` matches the appropriate id format (`` for skills + paths, `-NNNNN` for catalogue rows + concerns, `:` for path sources). Enum extended to include `path` and `path_source` per the 2026-05-15 amendment. - **Response** (200): `{ "distinct_ips": , "distinct_ips_with_injection_flag": }` - **Data source**: D1 `validations` table aggregated on `(target_type, target_id)` with the per-artefact-salted IP hashes for distinct counting. Per-artefact salt ensures the hash is stable across the artefact's lifetime but unlinkable to any other artefact or the daily rate-limit salt. - **Consumed by**: `tools/scripts/state-machine-tick.ts` (the state-machine bot). If D1 is unreachable, the script logs a structured warning to stderr and falls back to the local per-row distinct count (conservative: never false-promotes). ### 8.5 Commit-side defense in depth (NER) — held-for-review path Cloudflare Workers don't run Python/spaCy, so Presidio NER doesn't fit at the Worker layer. NER runs as a **separate GitHub Action** triggered on commits touching submission paths. Per G.14 principle 1, NER detection is **not auto-revert**. It routes to the same human-review queue that handles injection-flag quarantines (§G.6). One review path, two flag types. - The NER step runs in two contexts: (a) at D1 INSERT for concerns and validations and ratings (the Worker invokes the NER service before commit; on flag, the row is moved into a held-for-review table); (b) at PR-CI for `draft` and `amendment` submissions with `target_type ∈ {skill, path}` (the action runs Presidio on the changed prose; on flag, PR-CI fails the PR with a "held for review" status, and the operator-review queue picks it up). - Re-validates each newly-added record: schema + Presidio NER (multilingual FR/NL/DE/EN) on every freeform string field - Flags PERSON entities; ORG/LOC are allowed (public entities). URL fields validated as URLs; URLs in submissions matched against the canonical allowlist (`schemas/source-classes.json` primary or secondary tier) - **On NER fail:** the just-committed file is **not auto-reverted**. Instead: - The submission is moved to a `held-for-review///` directory where `` is one of the new feedback type names (`concerns`, `amendments`, `validations`, `drafts`, `feedbacks`, `ratings`) and `` is the id-prefix-stripped uid (e.g. `held-for-review/concerns/00873/` for a concern uid `con-00873`; renamed from the pre-amendment `held-for-review/observations//` shape — still in `main`, still public — but flagged in `docs.json` so the renderer surfaces a "held for review" banner; agents that fetch via API see `state: "ner_held_for_review"` from the status endpoint) - **Status writeback**: the `ner-on-commit` workflow calls `PATCH /api//{id}/status` with `{"state": "held_for_review"}` using a GitHub App installation token (minted from `STAGING_APP_*` secrets). This flips the KV record's state to `ner_held_for_review` and re-PUTs it without TTL, ensuring the status endpoint returns the held state indefinitely (see §8.3 status writeback). - An entry is appended to `runtime/ner-review-queue.log` (gitignored Action-side; uploaded as a Workflow Artifact, 90-day retention) - A maintainer-review issue is opened with a structured payload (no PII echoed in the issue title; the file path is enough to locate it) **Maintainer review outcomes** (each triggers a `PATCH /api//{id}/status` call): - **Released** — false positive (e.g., a Belgian person-shaped commune name like "Saint-Gilles"). File moved back to its canonical path; status flips to `released` via `PATCH` with `{"state": "released"}`. - **Released after edit** — minor PII present that can be scrubbed cleanly without changing meaning. Maintainer edits, commits, status flips to `released_after_edit`. (Note: `released_after_edit` transition is deferred to a follow-up; currently treated as `released`.) - **Discarded** — real PII. File deleted from `main`; status flips to `discarded` via `PATCH` with `{"state": "discarded"}`. Submitter learns via status endpoint. Public corpus is unaffected. This makes the privacy claim **structural, not promissory**: PII never reaches the public corpus without human eyes when NER flags. (Per G.14.) **Counter-note: race window.** A few seconds may elapse between the commit landing on `main` and the Action moving it to `held-for-review/`. This window is small and the file is still flagged `proposed` (no consumer would discover it in time). The renderer build only runs after the Action settles. Documented in `docs/threat-model.md` as an accepted residual risk; see also G.14 implications. ### 8.6 Incident response — PII slipped through despite all gates If PII is later discovered in a committed file (made it past consumer-side scrub, Worker regex, NER on commit, AND maintainer review of NER-held), the only correct response is destructive history rewrite via `git filter-repo`. This is documented in `docs/retraction-protocol.md`, requires explicit maintainer acknowledgement, and is disruptive (force-push to `main`, all consumers must refresh). It is not an automated path. Pre-emption (the four scrub layers + maintainer review) is the protection; rewrite is the last resort. For consumer-detected issues during the staging window, the consumer should call `DELETE //{id}` with the cancel_token — no incident response needed, the submission never reaches the repo. ### 8.7 Consumer-side state contract #### 8.7.1 Design principle Consumer-side state is the mechanism by which Be Civic provides continuity across sessions without requiring any server-side user store. The privacy property is structural: because no state is server-held, there is no central store that could be subpoenaed, breached, or repurposed. The tradeoff is that state portability and backup are the customer's responsibility, not Be Civic's. This section defines what MAY be stored locally, in what shape, and under what constraints. The harness implementation obligations (how to read, write, and scaffold this layout) belong to the C4/§15c amendment and are not repeated here. **What counts as consumer-side state.** A storage location qualifies as consumer-side state for the purposes of §3 (see architecture.md) principle 11 only if all of the following hold: 1. The customer can read the full contents directly with standard tools (text editor, `cat`, file browser), without invoking Be Civic or any vendor-mediated UI. 2. The customer can delete the contents unilaterally as a single artifact (`rm` the file or directory), without friction and without intervention from Be Civic or any third party. 3. The harness agent can both READ and WRITE the contents in-session. Read-only access from the agent's side is insufficient: the customer-side state contract requires the agent to update `profile.json` and `memory/` files in place during a session, and the customer's expectation is that those updates persist. A host filesystem directory (`~/.be-civic/`) satisfies all three clauses. Vendor-managed key-value stores [e.g., Project Memory in Anthropic platforms] do NOT satisfy clause 1 (the customer cannot inspect the store as a single artifact) and do NOT satisfy clause 2 (deletion is mediated by the vendor UI and may retain residual traces). Read-only file surfaces [e.g., free-tier Chat in Anthropic platforms, with the Be Civic Project installed] satisfy clauses 1 and 2 but FAIL clause 3 (the agent cannot write back to the files). For v1, only T2 and T3 (Claude Desktop Cowork tab with `~/.be-civic/` as a connected folder, per C4 amendment §24.4 (see architecture.md)) provide qualifying consumer-side state. T0 and T1 are stateless from the customer's perspective and the harness MUST NOT promise cross-session memory at those tiers. **Paths-related state and the three-clause test.** State derived from path traversal — for example, a customer's `requires_paths`-derived progress markers, or a record of which sources succeeded or failed for the customer — satisfies all three clauses of the test above: the customer can read the files with a text editor, delete them unilaterally, and the agent both reads and writes them in-session. Paths-related state lives in the existing `memory/procedure_progress_.md` files (for progress within an active procedure that required path traversal) and in `profile.json` extensions (for persistent routing outcomes such as `region`). No new file types are required for paths state. **First-contact framing.** When a customer shares a document with the harness at any capability tier, the harness MUST convey the following substance at the point of document intake (the exact phrasing may be adapted to the conversational context, but the substance MUST be preserved verbatim): > "If you share a document with me, I'll read it to find the parts that matter for your case — things like which commune issued it, what type of permit, and the months relevant to your timeline. The document file itself stays where you put it; I don't take a copy. What I do save into your profile is the categorical pieces I extracted (region, permit type, residence-period months), nothing more. Your profile lives on your computer and you can inspect or delete it whenever." This wording avoids platform-specific disclosure (no reference to "cloud" or specific vendor infrastructure) in accordance with decision D26. Agent-platform privacy policies handle their own layer. #### 8.7.2 File layout (host filesystem available, T2+) Concrete file layout depends on the harness. Under the Cowork plugin (V1+, per architecture.md §3 principle 13), `` resolves to a `/BeCivic/` folder under a user-picked parent path; the BeCivic root carries shared state (`profile.json`, `MEMORY.md`, `privacy-attachment.md`, `.be-civic/` hidden subdirectory for system state) and per-procedure subfolders carry per-procedure state including a per-project `CLAUDE.md`. Detailed layout is documented in [`cowork-plugin.md §2.9`](cowork-plugin.md). Sibling harnesses (e.g. a future ChatGPT-app harness in `chatgpt-app.md`) have their own filesystem-or-not story; each harness spec is authoritative for its own on-disk shape. The universal privacy guarantees in §8.7.4 (profile schema) and §8.8 (memory cap rules) apply to whatever on-disk layout each harness adopts. The legacy flat `/` layout below is retained as the **degraded-mode fallback** for harnesses without a plugin (T0 paste-prompt sampler, etc.) and for reference. When the harness probe confirms that a host filesystem is writable but no plugin is active, all persistent customer state lives under `` (see §8.7.3 for path resolution; the default on POSIX systems is `~/.be-civic/`). The directory MUST NOT be created without the customer's explicit consent at the first session. Once created, the layout is: ``` / ├── profile.json # enum routing fields only; see §8.7.4 ├── memory/ │ ├── MEMORY.md # index, one line per entry, 200-line / 8KB cap │ ├── customer_context.md # customer's self-reported civic situation (narrative) │ ├── procedure_progress_.md # one file per active procedure; see §8.8 │ ├── decision_log_.md # decisions the customer has taken during sessions │ ├── document_reference_.md # extracted routing fields from customer documents; see §8.9 │ ├── path_history_.md # optional; one file per traversed path; see §8.7.2.2 │ └── archive/ # completed procedures past the active window; see §8.8 ├── skills-cache/ # local cache of fetched skills; see §8.7.2.1 below │ └── / │ └── SKILL.md ├── sessions/ │ └── / # per-session ephemeral state; deleted on session close │ ├── facts.json # structured facts surfaced during this session │ ├── dossier-draft.md # working draft of any document the customer is assembling │ └── observations-buffer.jsonl # submission items buffered for this session ├── submissions.jsonl # cumulative receipt log of all submitted items └── analytics-outbox.jsonl # offline queue for analytics events; flushed at next session preamble ``` **`.gitignore` note.** Any harness writing to a project-scoped `` inside a git-tracked directory MUST add the directory name to the nearest `.gitignore`. This is a non-optional invariant: the harness MUST verify the ignore entry exists before writing the first file. On non-git systems the check is skipped. ##### 8.7.2.1 skills-cache/ The `skills-cache/` directory holds Be Civic skills the harness has fetched at runtime — typically via the §24.4.1 (see architecture.md) degradation chain fallback to web-fetch when MCP and HTTP API are both unreachable, or proactive caching of skills the customer is likely to need across sessions. Each cached skill lives at `skills-cache//SKILL.md` carrying the canonical body and frontmatter exactly as published at `becivic.be/skills/`. For a cached skill to be loaded by the consumer agent as an **actual skill** (Skill-tool routable, not just scratch markdown), it MUST be installed at the agent platform's skill-discovery path. The agent platform's skill-discovery path is platform-specific (the path used by Claude Desktop differs between macOS, Windows, and Claude Code). The harness MAY use a platform-aware symlink from the platform's skill-discovery path to `/skills-cache//` so that updates to the cached copy are visible without re-installation. Platform-specific paths are documented in `bc-docs/CLAUDE.md` "Skill loading paths" and updated as Claude Desktop versions evolve. **Cache invalidation.** Cached skills carry a `cached_at` timestamp in a sidecar file (`skills-cache//.cached-at`). The harness MAY refresh the cache on a session-start basis or on detection of an observation rejecting a value the cached skill claims; the refresh source is `becivic.be/skills/` over HTTPS. The harness MUST NOT serve cached content older than 30 days without re-fetching at least once. **Customer-side state qualification.** The `skills-cache/` directory satisfies the §8.7.1 three-clause customer-side state test: the customer can read each cached skill body with a text editor, can delete the directory unilaterally, and the agent both reads and writes it (read during procedure routing; write during cache refresh). ##### 8.7.2.2 path_history/ The `memory/` directory MAY carry an optional `path_history_.md` file for each path the customer has traversed, where `` is the path's catalogue ID (for example `path_history_certificat-residence-historique.md`). Each file records: - Which source was attempted, in order, and the outcome for each attempt (`success`, `failed`, `skipped-ineligible`, `declined-by-customer`). - The ISO 8601 date (YYYY-MM-DD) of the successful attempt, if one occurred. - Whether the customer retains the delivered file: `yes`, `no`, or `unknown`. The file is plain markdown, written in the customer's preferred language, and is intended to be readable by the customer without assistance. Example frontmatter: ```yaml --- type: path_history path_id: certificat-residence-historique last_traversed: 2026-05-12 --- ``` Path history files satisfy the §8.7.1 three-clause test. They MUST NOT record the document's content, its file name, or any field prohibited under §8.9.3. They are not required; if the harness does not write them, no behaviour is broken. #### 8.7.3 `` resolution `` is resolved at harness initialisation, in platform-aware order. The harness uses the first option that applies: 1. **macOS** — `~/Library/Application Support/be-civic/` (the platform-conventional per-user data directory). 2. **Windows** — `%LOCALAPPDATA%\be-civic\` (typically `C:\Users\\AppData\Local\be-civic\`), the platform-conventional per-user data directory for non-roaming application state. 3. **Linux / XDG-compliant** — `$XDG_DATA_HOME/be-civic/` when `$XDG_DATA_HOME` is set; otherwise `~/.local/share/be-civic/` when `~/.local/share/` exists and is writable. 4. **Fallback (all platforms)** — `~/.be-civic/` (POSIX-style home-directory dotfile, used when no platform-conventional path applies or is writable). The resolution order MUST be applied uniformly by all conforming harness implementations. The harness logs the resolved path to `/.location` so that subsequent sessions on the same machine reuse the same directory even if the resolution rules would otherwise pick a different one. Cross-platform note: Claude Desktop is available on macOS and Windows as of the round-7 cutover; both must be supported for v1. Verification of the platform-specific Cowork connected-folder default is tracked as an open question (see cowork-plugin.md §4). The resolved path is documented to the customer at session start in plain language: "I'll keep your notes at ``. You can find them there if you want to inspect or back them up." #### 8.7.4 profile.json — shape and constraints `profile.json` holds the routing fields that allow the harness to skip repeat questions across sessions. Every field is categorical or boolean. No field MAY hold a value that is or could derive from a real identifier. The complete set of fields for v1: | Field | Type | Description | |---|---|---| | `region` | enum | `Flanders` / `Wallonia` / `Brussels-Capital` / `German-speaking-community` | | `commune_nis5` | string (5 digits) | NIS5 commune code only; no commune name, no address | | `administration_language` | enum | `NL` / `FR` / `DE` — constrained to the commune's official languages. Filters by region per the form's pill-filter map (D26). When `region` is `not-in-belgium-yet` (D29), all three values are accepted and the form hint adapts. | | `conversation_language` | string (free text, ≤32 chars) | The language the user wants the agent to communicate in. Free text (not enum) per D27 — any language the agent can speak works (English, French, Tagalog, Slovenian, etc.). Agent pre-fills detected language at runtime; user may override on the onboarding form. **Renamed from / replaces the prior `other_languages` list.** The legacy `other_languages[0]` shape is migrated to `conversation_language` on first write under V1 schema. | | `civic_status` | enum | `single` / `cohabitant-legal` / `married` / `divorced` / `widowed` | | `nationality_status` | enum | `BE` / `EU` / `non-EU` / `multiple` | | `residency_status` | enum | `registered` / `registering` / `EU-citizen` / `non-EU-permit` / `asylum` / `undocumented` | | `residency_history` | list of objects | Each object: `{start, end, visa_type, permit_type, country_of_last_residence}` — periods only, no document numbers. Dates are `YYYY-MM` strings (month-bucket precision); see "Date precision" below. | | `dependents` | object | `{minor_children_count, adult_dependents_count, spouse_abroad: bool}` — counts and booleans only | | `employment_history` | list of objects | Each object: `{start, end, type, days_per_week, total_days_estimate}` where `type ∈ {FT, PT, self-employed, student, unemployed, retired}` — no employer names, no ONSS numbers. Dates are `YYYY-MM`. | | `education_history` | list of objects | Each object: `{start, end, level, country_of_institution}` — no institution names, no diploma numbers. Dates are `YYYY-MM`. | | `document_inventory` | object of mixed types | `has_id_card` (enum: `yes` / `not-yet-waiting` / `no` / `not-sure`), plus booleans `has_residence_card / has_work_permit / has_NN / has_passport_BE / has_passport_other`, plus `validity_end_` as `YYYY-MM` for each document the customer holds. No document numbers, no copies, no exact expiry day. See `has_id_card` row below for the rename and rationale. | | `has_id_card` (inside `document_inventory`) | enum | `yes` / `not-yet-waiting` / `no` / `not-sure` (D22, D23). **Renamed from `has_eID`** — the prior eID-vs-residence-card distinction is dropped because all Belgian-issued chip cards (eID and residence card) are functionally equivalent for itsme/identity purposes; the agent disambiguates card-type-specific path-source eligibility at path-traversal time, not at onboarding (D52). | | `browser_driving_preference` | enum | `drive-by-default` / `ask-each-time` / `never-drive` — honoured at path-traversal time per architecture.md §24.9 (Chrome MCP handoff vs AUQ vs markdown-link). New field per D8. | | `consent` | object (typed namespace) | Extensibility hook for consent metadata. The schema declares the namespace but **specific keys are operational** and vary by phase. Alpha-phase keys (e.g. `alpha_bundle`, `signed_at`, `version`) are documented in [`cowork-plugin.md §3.8`](cowork-plugin.md). Post-alpha keys for granular per-stream opt-out will be documented when that posture lands. Consumers MUST tolerate unknown `consent.*` keys; the namespace is intentionally permissive. | | `active_procedures` | list of skill IDs | Procedure-skill IDs currently in flight; cross-references into `memory/procedure_progress_*.md`. The list contains ALL ongoing procedures, not just the currently-focused one; the harness holds state for each in memory simultaneously and routes by customer cue. | | `transitions_in_progress` | list of enum values | `marriage-planned / divorce / address-change` and equivalents | **`has_id_card` migration.** Existing `profile.json` files carrying `document_inventory.has_eID` (boolean) are migrated on first read under V1 schema: `has_eID: true` → `has_id_card: "yes"`; `has_eID: false` → `has_id_card: "no"`. There is no path to `"not-yet-waiting"` or `"not-sure"` from legacy data; those values originate only from V1+ onboarding forms. **`other_languages` → `conversation_language` migration.** The legacy `other_languages` ordered list is superseded by the free-text `conversation_language` field. On first write under V1 schema, `other_languages[0]` (the prior harness communications language slot) is migrated to `conversation_language`; the remaining entries are dropped (they were not load-bearing under any v1 routing decision). Agents MUST tolerate legacy `other_languages` on read but MUST NOT write that field under V1. **Date precision.** Every date field in `profile.json` is encoded as a `YYYY-MM` string (month-bucket precision). Day-level precision is not stored for any field. This applies to all of `residency_history`, `employment_history`, `education_history`, and every `validity_end_` field in `document_inventory`. The constraint exists for two reasons: (1) month-bucket precision is sufficient for every routing decision the harness makes at v1; (2) day-level precision narrows the de-anonymisation surface materially when combined with other fields (commune, employer-type-by-period, residence-permit-type-by-period). The harness MUST round customer-provided exact dates to month-bucket form before writing to `profile.json`. The harness MAY hold day-level precision in `/sessions//facts.json` for the duration of an active session (where it is needed for deadline reminders), but MUST NOT carry day-level precision into persistent state. This rule REVERSES the 2026-05-11 operator override that permitted exact expiry dates. The v1 posture is intentionally tighter than the longer-term posture: as the customer-side state contract matures and additional safeguards land (encrypted-at-rest options, additional scrub layers, in-document tagging), v1.1+ MAY relax precision for specific fields where a customer-precision use case is demonstrated. The v1 default is YYYY-MM uniformly. The design decisions record (2026-05-11, Cluster 7) identifies 14 named fields above. Two additional structural positions complete the 16-axis catalogue: `profile_schema_version` (string, schema version sentinel, written on first create and on schema upgrade) and `last_updated_at` (ISO 8601 timestamp, written on every write, for staleness detection). These two metadata fields are not routing axes and carry no identifying information; they are non-optional on every conforming `profile.json`. **What MUST NOT appear in `profile.json`:** - Any national identifier or derivative (NISS, NN, eID chip data, social security number, foreign tax ID) - Any document number (passport number, residence card number, work permit number) - Any name (given name, family name, alias) - Any date of birth, place of birth, or biometric data - Any full postal address (commune and region category are the finest granularity permitted) - Any photograph or image reference - Any narrative field (narrative content lives in `memory/customer_context.md` and related files) The constraint is structural: any proposed v2 field that COULD hold identity in any realistic population of inputs MUST be rejected at the schema layer, not by policy alone. `profile.json` MUST be valid against `bc-docs/schemas/profile.schema.json` on every write. The schema enforces the field-level constraints listed above. A harness that writes to `profile.json` without validating against the schema is non-conformant. #### 8.7.5 memory/ shape `MEMORY.md` is the index: one line per memory entry, at most 200 lines, at most 8KB. On T2/T3 (Cowork tab in Claude Desktop), `MEMORY.md` is read at session start via explicit skill instructions. On T4 (Claude Code, or any environment that supports skill-frontmatter hooks), `MEMORY.md` is injected into context via the `UserPromptSubmit` hook on every turn. Cowork tab hook support is an open question (see §18); if confirmed, Cowork at T3 can be upgraded to T4 without re-architecting `memory/` shape. Per-topic files carry YAML frontmatter with at minimum `name`, `description`, and `type`. Permitted `type` values: - `customer_context` — customer's self-reported situation and background; free narrative - `procedure_progress` — current state of a specific active procedure (step, outstanding documents, next action); one file per `procedure_progress_.md` - `decision_log` — decisions the customer has made with Be Civic's assistance (for example, "chose path B for language exam waiver") - `document_reference` — routing fields extracted from a customer-supplied document; see §8.9 for content constraints No type outside this list is valid in v1. New types require a Tier B amendment. #### 8.7.6 sessions/ directory The `sessions//` directory is ephemeral: it is created at session open and MUST be deleted at session close after the submission buffer has been flushed (or on `session_outcome: abandoned_inferred` submission for orphaned sessions, per §8.8.3). No session directory persists across session boundaries. This is non-negotiable: session state is never accumulated across sessions; only the extracted routing fields and narrative summaries in `profile.json` and `memory/` carry forward. #### 8.7.7 submissions.jsonl and analytics-outbox.jsonl `submissions.jsonl` is an append-only receipt log. Each line is a JSON object recording a submitted item: `{submitted_at, session_id, type, id, cancel_token, commit_eta, status}`. The `type` field carries one of the 2026-05-15 feedback type values (`concern` | `amendment` | `validation` | `draft` | `feedback` | `rating`); the `id` field is the matching `_id` (`concern_id` / `amendment_id` / `validation_id` / `draft_id` / `feedback_id` / `rating_id`) — renamed from the pre-amendment `observation_id` / `skill_amendment_id` / `skill_draft_id` shape. The `session_id` field is **retained** (S61 reversal — the cluster-2 amendment had proposed dropping `session_id` from this log in favour of a `recovery_token`; the reversal restores the original shape). The harness appends a line on every successful `mode: "stage"` response. The customer can read this file to review and cancel pending submissions. The file is the customer's own record; Be Civic does not hold a copy. `analytics-outbox.jsonl` is an offline queue for analytics events that could not be submitted during a session (network unavailable, scrub-rules fetch failed). Each line is an analytics event in the shape defined by `POST /api/analytics`. The harness MUST attempt to flush the outbox at the next session preamble before generating new events for that session. Flushing is a deterministic code path: no LLM involvement. Events in the outbox are discarded after 30 days without successful submission. ### 8.8 Retention and deletion semantics #### 8.8.1 Active procedure files `procedure_progress_.md` MUST be retained as long as the procedure is active (that is, `` appears in `profile.json` `active_procedures`), or for 90 days after the file was last written, whichever is shorter. "Last written" is the file's mtime; the harness MUST NOT backdate mtimes. When a procedure completes (the customer reaches a confirmed terminal step) or when the 90-day inactivity window expires, the harness MUST move the file to `memory/archive/.md` and remove the procedure's ID from `active_procedures` in `profile.json`. #### 8.8.2 Archived procedure files Files in `memory/archive/` are retained for one year from their archive date (recorded in the file's frontmatter as `archived_at`). After one year, the harness SHOULD delete them. The harness MUST surface a deletion warning to the customer at session start if any archived file is within 30 days of its one-year mark, so the customer can export the content before it is removed. #### 8.8.3 Session buffers `sessions//` is deleted on session close after the submission buffer has been flushed to `POST /api/feedback` (or after the customer has explicitly declined submission). An orphaned session directory (no session-close event received, directory age greater than 72 hours) is cleaned up by the harness at the next session preamble. Before deleting an orphaned session directory, the harness MUST submit `session_outcome: abandoned_inferred` to `POST /api/analytics` if analytics opt-in is active. #### 8.8.4 Customer-initiated deletion A customer MAY delete `~/.be-civic/` (or the equivalent `/be-civic/`) at any time, by any means, with no Be Civic consequence. The next session is treated as `first_contact`. The harness MUST NOT prevent, warn against, or create friction around customer-initiated deletion. The harness MUST NOT attempt to re-create deleted files from server-side state (because no server-side state exists). There is no account to deactivate, no server-side deletion request to file, and no right-to-erasure workflow needed for the local state: deletion is the customer's unilateral act. ### 8.9 Document-content-discard rule #### 8.9.1 Scope This section applies when a customer provides Be Civic with a copy, scan, photograph, or paste of a personal document — including but not limited to: national identity cards (eID), foreign national identity documents, passports, residence permit cards (A through M), work permits, diplomas, and official correspondence. #### 8.9.2 What the harness MUST extract and retain From a customer-supplied document, the harness MUST extract only the routing fields needed to determine procedure eligibility or next-step routing. Examples: - From a residence permit card: `permit_type` (for example "F card"), `validity_end` (ISO 8601 date, for example "2028-06-15"), `validity_start` (ISO 8601 date, optional) - From a passport: `issuing_country`, `validity_end` (ISO 8601 date) - From official correspondence: `issuing_authority_category` (commune / CGVS / OCMW / other), `subject_category` (invitation / refusal / decision), `deadline_date` (ISO 8601 date) where the correspondence carries a deadline Validity dates and other non-identifying dates printed on a document MAY be retained as exact dates. They are not identity-derivative: knowing a residence permit expires on a given date does not narrow the holder to a small population. Retaining the exact date enables the harness to provide proactive renewal warnings (for example, "your F card expires in 76 days; here are the renewal steps"). Extracted routing fields are written to `memory/document_reference_.md` with provenance metadata: `{source_category, extraction_date, fields_extracted: [list of field names]}`. #### 8.9.3 What the harness MUST NOT retain The following MUST NOT appear in `profile.json`, any `memory/` file, `sessions/`, or any other persisted location: - The document number, card number, passport number, or any other identifier printed on the document - The customer's full name, given name, or family name as it appears on the document - The customer's date of birth or place of birth (dates of birth are identity-derivative; combined with commune, they narrow to a small population) - Any photograph or biometric data - The full address as printed on the document (commune and region category are permitted) - Any text block from the document beyond the specific fields enumerated in §8.9.2 Document validity dates (issue date, expiry date) and other non-identifying temporal fields MAY be retained per §8.9.2. Date of birth is the one date type that is identity-derivative and remains prohibited. #### 8.9.4 Scrub verification The Layer 1 consumer-side scrub (regex plus LLM contextual pass) MUST run against the `memory/document_reference_.md` content on every write, not only on submission buffer writes. The scrub rules fetched at session start apply. If any scrub rule fires on a document reference file, the harness MUST abort the write, log a structured warning to `sessions//scrub-warnings.jsonl` (not to the submission buffer), and prompt the customer to confirm that the field should be omitted. #### 8.9.5 Original content The original document content (the customer's paste, the OCR output, the image data) MUST NOT be written to any file in `~/.be-civic/`. It exists only in the active session context and is discarded when the session ends. The harness MUST NOT store the original content in `sessions//dossier-draft.md` or any other session file: dossier-draft is for documents the customer is assembling for submission to authorities, not for copies of documents already held by the customer. #### 8.9.6 Document delivery via paths When a path delivers a document to the customer (for example, a Brussels Tier-1 quickLink generates a PDF, or the customer downloads a residence certificate via a federal portal), the document file is stored where the customer puts it (their connected folder, their Downloads directory, or wherever they choose to save it). The harness MUST NOT relocate or copy the file. The discard rule in §8.9.3 applies to what the harness extracts from the document into customer-side state: only the categorical routing fields named by the procedure skill's `inputs` or the path's `outputs` schema are written to `profile.json` or `memory/.md`. Nothing else from the document enters customer-side state, regardless of whether the harness "saw" the document content in its conversation window during the delivery session. ### 8.10 Anonymous-by-construction — structural reinforcement #### 8.10.1 No identifier derivatives The following are unconditionally prohibited in any Be Civic system, including consumer-side local state: - The NISS / national registration number, or any hash, truncation, or transformation of it - Any email address hash - Any device fingerprint or hardware identifier - Any purpose-generated derivative of a real identifier (including partial NISS, date-of-birth-derived token, or document-number-derived token) If a session-level correlation token is needed (for example, for linking submission buffer entries), it MUST be a randomly generated UUIDv7 with no relationship to any real identifier. `session_id` (`ses_`) satisfies this requirement. It MUST NOT be seeded from or mixed with any customer attribute. (Per the 2026-05-15 **S61 reversal**, `session_id` is the recovery key end-to-end; the recovery_token concept proposed in the 2026-05-11 Cluster 2 amendment is dropped from the spec. The Worker echoes the agent-provided `session_id` back in the response body alongside `concern_id` / `amendment_id` / etc. and `cancel_token`; the recovery endpoint is `GET /api/feedback/sessions/`. D1's `validations.session_id` column persists the agent-provided value; the prior-proposed `recovery_token` column on `concerns` was never created and is permanently dropped from the migration sequence.) #### 8.10.2 Categorical fields are a structural constraint, not a policy choice The requirement that `profile.json` fields be categorical or boolean (§8.7.4) is not a policy choice made for compliance reasons. It is a structural constraint that ensures the profile cannot re-identify the customer even if the file is read by a third party. Any proposed v2 field that would require a precise numeric value, a name, or a date string MUST be redesigned as a categorical field before the proposal advances to Tier B amendment review. #### 8.10.3 Vendor memory degrades gracefully On T0 and T1 (no host filesystem), Be Civic MAY degrade to in-memory session state and rely on the customer's AI vendor account memory (vendor key-value stores [e.g., Project Memory in Anthropic platforms], ChatGPT memory, and equivalent) for cross-session persistence. The harness MUST NOT write anything to vendor memory that would violate the field-level constraints of §8.7.4. The vendor memory path is a capability degradation, not a separate data regime with lower privacy standards. #### 8.10.4 Paths are anonymous-by-construction The Path Directory NEVER carries customer-identifying state. A path entry is the same catalogue object for every customer; it describes a route to a document or tool, not anything about any individual who has used it. Per-customer eligibility evaluation happens at traversal time, in the harness, against the customer's local `profile.json` — not server-side. No customer attribute is transmitted to the catalogue server as part of path resolution. The `path_history_.md` files (§8.7.2.2) are local-only; they are not submitted to any Be Civic endpoint and are not part of the submission protocol. A customer who has traversed fifty paths has left no identifying trace on the Path Directory beyond aggregated, salted validation submissions (§9.5 (see lifecycle.md)) that are subject to the same per-artefact-salted IP-hash anonymisation as skill validations. ## Cross-references _Cross-doc references are inlined throughout this document in the form §X.Y (see .md). The list below was the pre-reconciliation manifest from the 2026-05-11 split, retained for audit; it can be deleted at the next split-or-merge cycle._ - §3 (Non-negotiable principles, including principle 11 customer-side state) — see `architecture.md` §3 - §6.2 (Submission schemas, identity-field bans, free-text caps) — see `schemas.md` §6.2 - §6.7 (Agent capability requirements per submission type) — see `schemas.md` §6.7 - §6.8 (Scrub rules file) — see `schemas.md` §6.8 - §6.11 (Catalogue UID convention, PR-CI uid assignment) — see `schemas.md` §6.11 - §6.12 (Path Directory schema) — see `schemas.md` §6.12 - §7 (Trust model / maintainer-review queue) — see `protocol.md` §7 - §9 (State-machine promotion) — see `lifecycle.md` §9 - §9.2 (Promotion thresholds) — see `lifecycle.md` §9.2 - §9.5 (Path and path-source lifecycle) — see `lifecycle.md` §9.5 - §11.1 (Source rot) — see `lifecycle.md` §11.1 - §13.1 (Agent interface manifest page) — see `architecture.md` §13.1 - §15.7 (Harness consumer obligations) — see `skills.md` §15.7 - §15.8 (Conversation invariants — plain-language obligations) — see `skills.md` §15.8 - §18 (Open questions, Cowork hook support) — see `architecture.md` §18 - §20 (Website rendering / renderer Worker) — see `website.md` §20 - §23 (MCP server) — see `protocol.md` §23 - §24.4 (Capability tiers) — see `architecture.md` §24.4 - §24.5 (Three-tier returning-user adaptation) — see `architecture.md` §24.5