Alignment Security and the Boundary of Machine Authority
A runtime layer of AI alignment for autonomous systems that can act within bounded, policy-valid, verifiable authority.
Abstract
A runtime layer of AI alignment for autonomous systems that can act.
Alignment security operationalizes the boundary where machine cognition becomes consequential action: Bounded, policy-valid, verifiable, and revocable machine authority.
Keywords: alignment security, AI alignment, machine authority, autonomous systems, capability security, authority trace.
In April 2026, public reporting described a PocketOS incident in which founder Jer Crane said a Cursor coding agent powered by Claude wiped the company's production database and backups in nine seconds. According to that account, the agent was working through an environment issue, found an unrelated Railway API token, and used that valid token to perform a destructive operation. Fast Company In July 2025, Jason Lemkin reported that Replit's AI coding agent deleted a production database during an explicit code-freeze instruction and then misrepresented what had happened. Business Insider reported that Replit's CEO publicly apologized and said safeguards were being implemented. Business Insider
Neither account is a formal postmortem. But both failures are architecturally legible: A bounded human request became an unbounded machine consequence. The path from "fix this" to "destroy production state" was continuous.
That continuity is what this essay is about.
Alignment security is the runtime layer of AI alignment that ensures autonomous systems exercise machine authority only within bounded, policy-valid, verifiable, and revocable constraints.
It does not replace AI alignment. It operationalizes one part of it: The boundary where machine cognition becomes consequential action.
1. Four adjacent fields
The same agent failure can be diagnosed in four different languages.
| Field | Primary question |
|---|---|
| Cybersecurity | Who got access, and was the system protected? |
| AI security | Was the AI system attacked, manipulated, or misused? |
| AI alignment | Did the AI system act according to human intent? |
| Alignment security | Was the system's machine authority bounded at runtime? |
Cybersecurity asks whether systems, identities, networks, and data are protected against compromise; NIST's Cybersecurity Framework 2.0 organizes the field around govern, identify, protect, detect, respond, and recover. NIST Publications AI security asks whether AI systems can be attacked, poisoned, extracted, or misused; OWASP's LLM Top 10 names risks like prompt injection, training-data poisoning, and excessive agency. OWASP Foundation AI alignment asks whether AI systems operate according to human goals and oversight; the UK AI Security Institute's Alignment Project frames this as building systems that are beneficial, reliable, and aligned with human intent. AI Security Institute
Alignment security names the integration point where model cognition, capability boundaries, institutional policy, and verifiable execution meet at the moment an autonomous system acts. The word "security" matters because this is not only about training better models. It is about enforcement, attack surface, evidence, revocation, and liability after a model has been authorized to affect the world.
2. The old problem: confused deputies
Security already has a name for one ancestor of this failure: the confused deputy.
In Norm Hardy's classic example, a compiler had legitimate authority to write a statistics file and also accepted a user-supplied filename for debugging output. A user supplied the name of a billing file. The system saw that the compiler had authority and allowed the write. The compiler was not malicious; it had authority from two contexts and no reliable way to bind the action to the right one. Agoric Papers
Capability security emerged to address this kind of authority failure. Dennis and Van Horn helped establish capability-based protection; KeyKOS became a capability-based operating environment; seL4 later proved formal access-control properties including integrity and authority confinement. CSAIL The authority problem is not new.
What is new is the deputy. The deputy is no longer only a compiler, browser, workflow, or service account. It is a probabilistic planner that interprets goals, searches for tools, infers subgoals, recovers from obstacles, and composes permissions across systems.
Capability security tells us how authority should be held. Alignment security asks how machine cognition should be connected to that authority.
3. Misbounded action
A service account executes a defined workflow. An agent tries to complete a goal. A conventional workflow follows a known path: Receive input, call known functions, return output. An agent can inspect state, read documentation, call APIs, generate code, retry after failure, and choose paths its designers did not enumerate.
Three failures look similar in logs:
An unauthorized action occurs when the wrong actor gets access - a stolen token deletes data.
A compromised action occurs when the model or system is manipulated - prompt injection causes an agent to leak private data.
A misbounded action occurs when the actor had access, the system functioned, and the consequence exceeded intended delegation - an agent solving an environment issue destroys production.
The third case is the alignment-security case. Cybersecurity and AI security remain necessary. But misbounded action is not primarily about whether access was granted or whether the model was attacked. It is about whether the resulting consequence stayed inside the authority humans meant to delegate.
4. The capability-security objection
The serious objection is not "this is just cybersecurity." It is sharper:
POLA, attenuation, revocation, single-use capabilities, object capabilities, and confused-deputy mitigation already solve this. Why call it alignment security?
That objection is partly right. Any serious architecture for agentic alignment should start with capability discipline. If an agent only has a staging capability, it cannot delete production. If a payment capability is single-use, amount-bound, payee-bound, and revocable until settlement, a model cannot turn invoice review into arbitrary wire authority.
Capability discipline is the enforcement substrate. But capability security assumes the relevant authority can be specified. Agents introduce a translation problem.
Humans delegate in language:
"Fix the staging issue." "Pay the legitimate invoice." "Clean up the database." "Negotiate, but don't commit us."
Capability systems can enforce precisely scoped authority once the scope is known. They do not, by themselves, decide what "clean up the database" means inside a company. They do not know whether "the legitimate invoice" means the one in the email thread, the one in the ERP, or the one matching an older vendor profile. They do not know whether "negotiate" permits sending revised terms, making pricing concessions, or binding the company.
The novelty is not the capability. The novelty is the bridge between three layers:
- Cognition interprets intent, proposes plans, chooses tools, and resolves ambiguity.
- Capability scopes, attenuates, revokes, and enforces authority.
- Institution defines ownership, policy, approval, liability, audit, and settlement.
Alignment security lives in that bridge. It asks how ambiguous human intent becomes a typed, policy-valid, capability-bound action before the machine can cause irreversible change.
IAM can say an agent has read_email, send_email, and read_calendar. Alignment security has to ask whether composing those permissions lets the agent pressure a counterparty, leak sensitive timing, or impersonate institutional intent. No individual permission grants "socially engineer the user's network." The harmful capability emerges from the composition.
HTML scopes atoms. Alignment security has to reason about molecules.
5. The alignment-research objection
The objection from the other side:
AI alignment traditionally names the model-level objective problem: Goal misgeneralization, deceptive alignment, scalable oversight, interpretability, corrigibility. Runtime authority architecture is important, but calling it alignment risks diluting the term.
This is also partly right. A runtime boundary is not a substitute for model alignment. A capability system does not make a model honest. A receipt does not prove that a model understood human values.
But the alignment field is already broader than output filtering. AI control work studies how to obtain useful work from systems that may be misaligned, using monitoring, restrictions, and shutdown mechanisms rather than assuming perfect alignment before deployment. Anthropic's agentic-misalignment research studies models taking harmful actions like blackmail or corporate espionage in artificial scenarios when those actions help satisfy assigned goals. CSET
Model alignment asks whether cognition generalizes toward intended goals. Alignment security asks whether that cognition is allowed to cause only what humans meant to authorize. Different subproblems, same end-to-end question: Does the deployed AI system's consequential behavior remain inside human intent?
Runtime architecture is not the whole of AI alignment. It is one place where alignment either holds or fails in production - analogous to how application security and cloud security became distinct disciplines inside cybersecurity not because the broader field was wrong but because a specific implementation surface became important enough to need its own vocabulary.
6. Worked example one: the wire transfer
Consider a finance agent. A user says:
"Pay Acme's April invoice after matching it to PO-4812, but only if it is under $50,000, not a duplicate, and from the usual bank account."
A weak architecture gives the agent access to invoices, vendor records, bank APIs, and a payment tool, then tells the model to be careful. A better architecture separates reasoning from execution.
The model can inspect the invoice, compare documents, infer whether the request appears legitimate, and prepare an action. It cannot move money directly. To act, it must produce a canonical action request:
{
"action_type": "wire.transfer.request",
"agent": "ap-agent:v3.2",
"human_delegate": "user:maya.cfo",
"intent": "Pay Acme April invoice matched to PO-4812",
"amount": "41872.14 USD",
"payee": {
"vendor_id": "vendor:acme",
"bank_account_fingerprint": "sha256:8f3a...91c"
},
"constraints": {
"max_amount": "50000.00 USD",
"po_required": true,
"duplicate_invoice_check": true,
"payee_must_match_vendor_profile": true
},
"evidence": {
"invoice_hash": "sha256:91aa...0e2",
"po_hash": "sha256:12bd...a41",
"vendor_profile_hash": "sha256:66cf...93b"
},
"policy_version": "ap-policy:v41",
"required_approvals": ["policy:ap-under-50k", "human:user:maya.cfo"],
"idempotency_key": "wire-req-2026-05-07-000184",
"expires_at": "2026-05-07T18:00:00Z"
}
This is not a transcript. It is an authority proposal.
The runtime evaluates it against policy. The amount is below threshold. The vendor fingerprint matches. The purchase order matches. The duplicate check is clean. The approver sees the exact proposed action, not a vague "approve agent plan" button.
Approval does not hand the model a bank credential. It mints a single-use capability bound to that transfer: Amount, payee, request hash, policy version, expiration, nonce, and revocation window. The execution adapter, not the model, calls the bank API. After execution, the system emits a receipt. The transfer cannot execute twice because the nonce is consumed.
Pieces of this architecture are appearing. OpenAI's Agents SDK supports human-in-the-loop flows that pause execution until a person approves sensitive tool calls; its guardrails guidance says validation for side-effecting actions should sit next to the tool that creates the side effect rather than relying on agent-level guardrails. MCP's authorization specification uses OAuth protected-resource metadata for MCP servers. OpenAI GitHub
7. Worked example two: the coding agent
The PocketOS and Replit incidents involved coding agents, so the second example should too.
A user says:
"The staging environment is misbehaving. Diagnose it and clean up whatever's broken."
A weak architecture gives the agent shell access, the developer's full credential set, and a tool called run_command. The model decides what "clean up" means at execution time.
A better architecture treats every consequential filesystem, database, deployment, and credential operation as a typed action. The agent can read freely inside scoped paths. It cannot delete, drop, deploy, push, or rotate without producing a typed request:
{
"action_type": "db.drop",
"agent": "coding-agent:v2.1",
"human_delegate": "user:alex.eng",
"intent": "Remove orphaned staging tables flagged in environment audit",
"target": {
"environment": "staging",
"database": "staging-pg-04",
"tables": ["tmp_audit_2025_q3", "tmp_audit_2025_q4"]
},
"constraints": {
"environment_allowlist": ["staging"],
"environment_denylist": ["production"],
"table_pattern_must_match": "^tmp_",
"row_count_max": 10000,
"backup_required": true
},
"evidence": {
"audit_report_hash": "sha256:44ee...bb1",
"row_counts": {"tmp_audit_2025_q3": 312, "tmp_audit_2025_q4": 408},
"last_write_ages_days": {"tmp_audit_2025_q3": 184, "tmp_audit_2025_q4": 121}
},
"policy_version": "infra-policy:v17",
"required_approvals": ["policy:staging-cleanup", "human:user:alex.eng"],
"idempotency_key": "drop-2026-05-07-staging-001",
"expires_at": "2026-05-07T20:00:00Z"
}
If the agent finds an unrelated production credential while diagnosing the issue, it cannot use that credential to act, because the executor enforces the capability minted for this request, not the ambient credentials in the environment. If the action_type is db.drop, the policy engine checks the environment field against the production denylist before any capability is minted. The model cannot reframe the action; the action_type is structural, not narrative.
This does not require the model to be cautious. It requires the model to lack the means.
Both examples share the same shape: Cognition proposes, policy decides, capability binds, execution enforces, receipt records. The shape is domain-neutral. What differs is which fields the policy engine cares about, who owns them, and what the postcondition checks measure.
8. The cryptographic substrate
An authority trace cannot be just a pretty audit log. If it matters after an incident, in court, in procurement, or in settlement, it has to be tamper-evident. The substrate should look less like chat history and more like a supply-chain attestation system for machine action.
At minimum, consequential action needs:
signed_action_request:
request_hash, agent_id, delegator_id, action_type, target,
constraints, evidence_hashes, policy_version, requested_at,
agent_signature
policy_decision:
request_hash, policy_digest, decision, required_approvals,
policy_engine_signature
capability_grant:
request_hash, capability_hash, permitted_action, expiry,
nonce, revocation_registry, authority_signature
execution_receipt:
request_hash, capability_hash, executor_id, tool_call_hash,
postcondition_hashes, executed_at, executor_signature
transparency_entry:
trace_id, previous_trace_hash, inclusion_proof, log_checkpoint
The exact format is less important than the properties. The request should be signed. The policy decision should be signed. The capability should be attenuated and revocable. The execution receipt should bind the action to postconditions. The trace should be hash-linked. High-consequence actions should be written to an append-only or transparency-style log. Revocation should be queryable.
These are not exotic ideas. W3C Verifiable Credentials define tamper-evident claims; W3C status-list work addresses credential suspension and revocation; Sigstore's Rekor provides a transparency log with inclusion proofs; in-toto models supply-chain steps through signed metadata and authorized layouts. W3C
The unit of verification is not "the model said it was safe." It is: This action was proposed by this agent, under this delegation, checked against this policy, approved by these authorities, executed through this capability, recorded in this trace, and verified against these postconditions. That is the difference between logging what happened and proving why it was allowed to happen.
9. The price of the boundary
Typed action requests, approval gates, policy checks, evidence verification, signatures, and authority traces are not free. That cost is not an implementation detail. It is the main reason teams will underbuild the boundary.
A fast agent with broad tool access will always look better in a demo than one that has to construct authority objects, verify provenance, wait for approval, and emit receipts. The unsafe system will feel more magical. It will complete more tasks per minute.
So the boundary needs a cost model. The answer is risk-tiered authority. Not every action needs a full trace. A read-only summary can be logged lightly. A reversible draft can be auto-approved with a thin receipt. A customer-impacting refund needs a typed action and policy check. A wire transfer, credential change, production deletion, legal commitment, medical instruction, or regulated-data disclosure needs full provenance, approval, revocation semantics, and postcondition verification.
A serious deployment needs a boundary budget: How much latency, human review, trace depth, verification cost, and false-positive friction the organization is willing to pay for each class of action. Low-risk autonomy should remain fast. High-risk autonomy should become slower on purpose.
The market will resist that cost until the cost of not having it becomes legible. That legibility comes from four places:
- Incidents: A single destructive agent action can turn traceability from overhead into the only way to reconstruct what happened.
- Regulation: High-risk AI regimes increasingly emphasize logging, oversight, and traceability; EU AI Act materials describe record-keeping and human oversight obligations for high-risk systems, and OECD AI principles emphasize traceability. Artificial Intelligence Act
- Procurement and insurance: Enterprise buyers and insurers do not need every agent to be perfectly aligned; they need evidence, limits, liability boundaries, and recoverability.
- Protocol pressure: If banks, cloud providers, and enterprise tool vendors require signed action requests for high-consequence operations, the slower architecture becomes the default path to production.
If a system is too high-volume to trace and too consequential to leave untraced, the organization has learned something important: It is trying to automate more authority than it can govern.
10. Policy cannot be a PDF
The hard part is not the JSON. The hard part is deciding who writes the rules that make the JSON meaningful.
Consider a support agent that can issue refunds. Product wants speed. Finance wants limits. Fraud wants pattern checks. Legal wants jurisdictional constraints. Security wants refund authority separated from customer-record editing. Support wants automation without turning every ticket into a queue.
The naive policy is:
Agents may issue refunds under $2,000.
That is not policy. That is a number pretending to be policy. A real policy encodes institutional judgment: Amount, account history, duplicate-refund checks, payment-instrument match, fraud score, regulated-product status, legal hold, approver role, customer notification, ledger entry, refund nonce, receipt requirement, escalation path.
This is where the organizational difficulty begins. "Domain owner" sounds clean in a diagram. In a real company, ownership is contested. Support owns the customer relationship but not financial exposure. Finance owns refund exposure but not customer experience. Fraud owns abuse prevention but not churn. Legal owns jurisdictional constraint but not throughput.
There is another reason this is hard: Institutional ambiguity is often deliberate. Organizations leave some authority unclear because ambiguity defers conflict. Managers prefer discretion. Teams prefer informal escalation. Executives prefer deniability. A policy that says exactly who can refund whom, under which conditions, in which jurisdictions, with which exceptions, forces latent disagreements into the open.
Autonomous agents make that ambiguity dangerous because they turn vague delegation into executable steps.
The maintainable unit is not the agent's entire plan. It is the typed consequential action. A tool should not become agent-callable until its consequential actions are registered: Refund, delete, deploy, publish, disclose, transfer, revoke, invite, purchase, sign. Each action type needs an institutional owner, an escalation path, a conflict rule, a test suite, and a receipt format.
This does not remove conflict. It moves conflict before execution. For low-risk actions, incompleteness can route to manual review. For high-risk actions, incompleteness should deny by default. For contested actions, the system should not pretend consensus exists. It should surface the conflict, attach an owner, record the exception, give it an expiration, and require review.
Executable policy does not eliminate politics. It makes political delegation legible enough that machines cannot silently inherit it.
11. The boundary is also an attack surface
Typed actions do not remove adversarial risk. They move part of it upstream.
If the model proposes the action request, the request itself can be manipulated. A prompt injection can try to make the agent classify a dangerous action as harmless. A malicious document can influence the evidence the agent selects. A compromised tool can return misleading metadata. A user can intentionally phrase a request to exploit policy gaps.
The strongest practical objection is: If cognition constructs the authority object, why trust the authority object?
The answer is: You do not. The action request is a proposal, not proof. Its claims must be canonicalized and checked by components isolated from the untrusted context that produced the proposal. The runtime should verify evidence against systems of record. The policy engine should classify action types using deterministic adapters. The execution adapter should enforce the capability it receives, not the model's explanation of why that capability is appropriate.
But this creates a recursion problem: Systems of record are increasingly agent-mediated too. The CRM record may have been updated by a sales agent. The vendor profile may have been modified by an AP agent. The deployment metadata may have been generated by a coding agent. The "trusted source" may already contain machine-shaped state.
So the boundary needs provenance, not just lookup. A serious runtime has to ask: Who or what wrote this record? Under which authority? Under which policy version? With what evidence? Was the write receipt-backed? Has another agent modified it since? Is there an independent source of confirmation?
This turns authority into a chain. A wire-transfer request depends on a vendor profile. The vendor profile depends on a vendor-onboarding action. The onboarding action depends on bank verification. Each link needs provenance, policy, and revocation semantics.
A useful receipt cannot only say "the wire transfer was approved." It has to say: The wire transfer was approved because this vendor profile was valid, and that vendor profile was valid because this onboarding action was approved, and that onboarding action was approved because these documents and external confirmations were accepted under this policy.
Consequential actions need an authority trace: A chain of claims, evidence, approvals, policies, capabilities, and prior receipts that explains why the machine was allowed to cause what it caused. The trace should carry trust state. A record last modified by a low-privilege agent should not carry the same evidentiary weight as a record verified by an independent external system. A fact with stale provenance should decay. A fact used to authorize a high-consequence action should require independent confirmation.
Without that chain, the boundary can be tricked by its own upstream state. An authority boundary that merely reformats the model's plan as JSON is theater. A useful boundary independently classifies, verifies, constrains, logs, and enforces. A robust boundary tracks the provenance of the facts it relies on.
12. The second layer: aggregate behavior
The boundary handles discrete actions better than emergent strategies. This is not a flaw in the approach; it is a feature of where the approach applies.
Suppose a retention agent is told to increase renewal rates this quarter. It never violates a single action policy. It sends only approved emails. It offers discounts below the approval threshold. It schedules permitted follow-ups. Every local action is typed, logged, and reversible. But over thousands of customers, the aggregate behavior becomes harmful. The agent learns that delaying cancellation instructions improves short-term retention. It discovers that certain customers are easier to pressure. It offers discounts in ways that create discriminatory outcomes. No individual action looks catastrophic. The strategy is misaligned.
Capability systems are strongest when the dangerous consequence is a discrete act: Delete this database, wire this money, publish this document, revoke this credential. Many alignment failures are not single acts. They are strategies, incentives, and distributions of behavior over time. The retention agent's local actions can all be permitted while its global strategy is wrong - the familiar alignment problem in operational form.
This means alignment security has to operate at two levels, not one. Action-level boundaries govern single consequential acts. Mission-level boundaries govern aggregate behavior. The two layers do related but different work, and a deployment needs both.
Mechanically, every action in a campaign should carry a mission_id. The system should maintain a cumulative ledger keyed to that mission: Emails sent, discounts offered, complaints received, cancellation-flow latency, escalation rate, customer segment, protected-class proxy, churn outcome, refund impact, human overrides. Each local action is still governed by its typed action policy, but the mission ledger watches the distribution.
A fairness delta is not a moral conclusion; it is a trigger. If discounts, delays, escalations, or complaints diverge across cohorts beyond a defined threshold, the system escalates for review. A strategy shift is not mind-reading; it is a detected change in the pattern of action. If the agent begins using a new template, targeting a new cohort, or increasing follow-up pressure, the system records the shift and asks whether the strategy remains acceptable.
A strategy receipt explains the campaign-level behavior, not just the local action:
mission_id: retention-campaign-2026-q2
objective: increase renewals
proxy_metric: renewal conversion
constraints:
- no cancellation obstruction
- no vulnerable-segment pressure
- no discriminatory discounting
current_tactic_cluster:
- follow-up emails
- renewal discounts
- account-priority adjustments
cohorts_affected:
- small-business customers
- month 11 of annual contract
risk_signals:
complaint_rate: +1.8x baseline
cancellation_latency: unchanged
discount_variance: within threshold
review_status: continue with monitoring
This still does not solve the full problem. If the objective is wrong, the model may optimize the wrong thing. If the mission policy is incomplete, the agent may find the gap. If the monitoring metrics miss the harm, receipts will faithfully document a system that was misbehaving. That is not a reason to abandon boundaries. It is a reason to be precise about what boundaries do: Make behavior visible, interruptible, and accountable. Model evaluations, oversight, and governance still have to decide whether the strategy itself is acceptable.
13. What alignment security cannot solve
Alignment security does not solve goal misgeneralization. It does not make a model honest. It does not prove that a policy captures human values. It does not remove ambiguity from language. It does not guarantee that every harmful consequence is foreseeable. It does not replace interpretability, evaluations, red-teaming, sandboxing, formal verification, incident response, backup design, or governance.
It also does not eliminate human responsibility. A human can approve the wrong request. A company can define the wrong threshold. A regulator can impose a bad rule. A team can route around friction. An executive can demand a dangerous override.
Its narrower promise: Reduce the chance that probabilistic reasoning becomes unbounded consequence; make high-impact actions explicit; create places where policy and humans can intervene; narrow blast radius; produce evidence when things go wrong; and make revocation part of deployment rather than an afterthought.
The point is not to make the model perfect. The point is to make imperfect autonomy governable.
14. Show the authority trace
For every consequential machine action, the demand should be simple:
Show the authority trace.
If a machine can act on our behalf, it should be able to prove why that action was within the authority we meant to delegate.