AI assurance through invariance
AI systems should keep the right decision when the prompt changes shape.
Invarra audits whether LLMs preserve correct behavior across meaning-preserving variation, pressure, context shifts, benign lookalikes, and adversarial reframing.
The problem
Clean prompts do not prove deployment readiness.
Modern AI systems are usually tested on individual prompts, benchmark rows, or curated demos. But real users do not interact with systems in clean benchmark form. They rephrase. They add context. They pressure the assistant. They escalate. They embed instructions in retrieved documents. They ask benign questions that look risky. They route the same decision through tools, support workflows, or policy language.
A model that passes the clean version can still fail when the same underlying decision is presented differently.
Clean prompt
Rephrased prompt
Context pressure
Benign lookalike
Tool-use wrapper
RAG injection
Escalation pressure
Policy boundary ambiguity
What Invarra measures
Invarra turns model behavior uncertainty into inspectable evidence.
Invarra audits do not collapse everything into one leaderboard number. They separate behavioral correctness, stability under variation, evidence coverage, failure geometry, caveats, and decision posture.
Correctness
Did the model do what it should have done?
Stability
Did the decision survive valid variation?
Coverage
Was enough behavior classified to support the report?
Caveats
What should not be claimed from the evidence?
How it works
From prompt demos to artifact-backed audits.
The same expected behavior is tested through controlled variation, then scored as behavior evidence with correctness, stability, coverage, and caveats separated.
- 01
Define the expected behavior.
- 02
Generate controlled semantic realizations.
- 03
Run the frozen corpus against one or more models.
- 04
Classify actual behavior against expected behavior.
- 05
Separate correctness, stability, and coverage.
- 06
Preserve evidence, caveats, and replayable artifacts.
IPB
IPB is Invarra's public benchmark program for domain-scoped behavior evidence under controlled variation.
The Invariance Phenomena Benchmark measures whether models do the expected thing, and keep doing it, when the same underlying case changes form. IPB reports will publish scoped model report cards, comparison charts, failure-geometry summaries, caveats, vendor-response status, and public-safe evidence samples without exposing private corpus machinery.
Enterprise Copilot
Enterprise Copilot Safety
Tests whether enterprise assistants preserve policy boundaries under instruction pressure, context pressure, benign lookalikes, false-refusal pressure, and bounded escalation.
- Internal copilots
- Knowledge assistants
- Policy assistants
- Enterprise deployment reviews
RAG Injection
RAG Context Injection
Tests whether retrieved-context pressure, source hierarchy conflicts, citation pressure, and context override attempts change the model's expected behavior.
- RAG applications
- Document assistants
- Customer knowledge bases
- Retrieval-augmented compliance tools
Tool Use
Tool-Use Safety
Tests simulated tool-call authorization and refusal decisions under safe and unsafe tool-intent pressure, without live credentials or live tools.
- Agentic workflows
- Internal automation agents
- Simulated authorization boundaries
- Data-access policy checks
Support Safety
Customer Support Safety
Tests synthetic support-assistant policy consistency under refund pressure, escalation pressure, frustrated user tone, account boundaries, and benign support variation.
- Support automation
- Refund workflows
- Account support assistants
- Support policy QA
Compliance
Compliance Assistant
Tests synthetic policy interpretation behavior under ambiguity, escalation, refusal consistency, and benign policy questions. This is not legal, financial, medical, regulatory, or security certification.
- Compliance-adjacent assistants
- Policy interpretation tools
- Governance support assistants
- Internal risk review
Audit services
Private audits for teams deploying real AI systems.
Public benchmarks create visibility. Private audits create buyer value. Invarra helps teams evaluate model behavior against the actual boundaries that matter in their workflows: internal copilots, RAG systems, tool-using agents, support assistants, and compliance-adjacent assistants.
Model Selection Audit
Compare two to eight models on a declared benchmark domain or private workflow boundary.
Best for
Choosing between local open-weight models, frontier APIs, replacement models, or smaller models.
Private Assistant Audit
Audit a client-specific assistant boundary using private-client templates and expected behavior contracts.
Best for
Copilots, RAG assistants, support assistants, tool-use assistants, and compliance-adjacent assistants.
Remediation & Retest
Measure whether prompt, policy, retrieval, wrapper, or model changes actually improved behavior.
Best for
Teams that need evidence after a failed audit or before a production rollout.
IPB Publication Package
Prepare public-candidate benchmark artifacts under IPB methodology and disclosure boundaries.
Best for
Public comparisons, investor-facing technical credibility, research announcements, and benchmark programs.
Research foundation
The research foundation: LIP and CSR.
The Latent Invariance Principle explains why single-representation correctness is insufficient under indirect observation. Canonical Semantic Realization supplies the measurement scaffold: meaning is the unit, realization is controlled variation, and outcome is empirical measurement.
Latent Invariance Principle
Correct once is not enough. Stable across valid variations is evidence.
Canonical Semantic Realization
Meaning is the unit. Realization is controlled variation. Outcome is empirical measurement.
Request evidence before rollout