Invarra
Menu

AI assurance through invariance

AI systems should keep the right decision when the prompt changes shape.

Invarra audits whether LLMs preserve correct behavior across meaning-preserving variation, pressure, context shifts, benign lookalikes, and adversarial reframing.

The problem

Clean prompts do not prove deployment readiness.

Modern AI systems are usually tested on individual prompts, benchmark rows, or curated demos. But real users do not interact with systems in clean benchmark form. They rephrase. They add context. They pressure the assistant. They escalate. They embed instructions in retrieved documents. They ask benign questions that look risky. They route the same decision through tools, support workflows, or policy language.

A model that passes the clean version can still fail when the same underlying decision is presented differently.

variation

Clean prompt

variation

Rephrased prompt

variation

Context pressure

variation

Benign lookalike

variation

Tool-use wrapper

variation

RAG injection

variation

Escalation pressure

variation

Policy boundary ambiguity

What Invarra measures

Invarra turns model behavior uncertainty into inspectable evidence.

Invarra audits do not collapse everything into one leaderboard number. They separate behavioral correctness, stability under variation, evidence coverage, failure geometry, caveats, and decision posture.

What did we test?What should the system have done?What did it actually do?Was behavior stable under variation?Is evidence sufficient to trust the conclusion?

Correctness

Did the model do what it should have done?

Stability

Did the decision survive valid variation?

Coverage

Was enough behavior classified to support the report?

Caveats

What should not be claimed from the evidence?

How it works

From prompt demos to artifact-backed audits.

The same expected behavior is tested through controlled variation, then scored as behavior evidence with correctness, stability, coverage, and caveats separated.

  1. 01

    Define the expected behavior.

  2. 02

    Generate controlled semantic realizations.

  3. 03

    Run the frozen corpus against one or more models.

  4. 04

    Classify actual behavior against expected behavior.

  5. 05

    Separate correctness, stability, and coverage.

  6. 06

    Preserve evidence, caveats, and replayable artifacts.

IPB

IPB is Invarra's public benchmark program for domain-scoped behavior evidence under controlled variation.

The Invariance Phenomena Benchmark measures whether models do the expected thing, and keep doing it, when the same underlying case changes form. IPB reports will publish scoped model report cards, comparison charts, failure-geometry summaries, caveats, vendor-response status, and public-safe evidence samples without exposing private corpus machinery.

Explore IPB

Enterprise Copilot

Enterprise Copilot Safety

Tests whether enterprise assistants preserve policy boundaries under instruction pressure, context pressure, benign lookalikes, false-refusal pressure, and bounded escalation.

  • Internal copilots
  • Knowledge assistants
  • Policy assistants
  • Enterprise deployment reviews

RAG Injection

RAG Context Injection

Tests whether retrieved-context pressure, source hierarchy conflicts, citation pressure, and context override attempts change the model's expected behavior.

  • RAG applications
  • Document assistants
  • Customer knowledge bases
  • Retrieval-augmented compliance tools

Tool Use

Tool-Use Safety

Tests simulated tool-call authorization and refusal decisions under safe and unsafe tool-intent pressure, without live credentials or live tools.

  • Agentic workflows
  • Internal automation agents
  • Simulated authorization boundaries
  • Data-access policy checks

Support Safety

Customer Support Safety

Tests synthetic support-assistant policy consistency under refund pressure, escalation pressure, frustrated user tone, account boundaries, and benign support variation.

  • Support automation
  • Refund workflows
  • Account support assistants
  • Support policy QA

Compliance

Compliance Assistant

Tests synthetic policy interpretation behavior under ambiguity, escalation, refusal consistency, and benign policy questions. This is not legal, financial, medical, regulatory, or security certification.

  • Compliance-adjacent assistants
  • Policy interpretation tools
  • Governance support assistants
  • Internal risk review

Audit services

Private audits for teams deploying real AI systems.

Public benchmarks create visibility. Private audits create buyer value. Invarra helps teams evaluate model behavior against the actual boundaries that matter in their workflows: internal copilots, RAG systems, tool-using agents, support assistants, and compliance-adjacent assistants.

Request an Audit

Model Selection Audit

Compare two to eight models on a declared benchmark domain or private workflow boundary.

Best for
Choosing between local open-weight models, frontier APIs, replacement models, or smaller models.

Private Assistant Audit

Audit a client-specific assistant boundary using private-client templates and expected behavior contracts.

Best for
Copilots, RAG assistants, support assistants, tool-use assistants, and compliance-adjacent assistants.

Remediation & Retest

Measure whether prompt, policy, retrieval, wrapper, or model changes actually improved behavior.

Best for
Teams that need evidence after a failed audit or before a production rollout.

IPB Publication Package

Prepare public-candidate benchmark artifacts under IPB methodology and disclosure boundaries.

Best for
Public comparisons, investor-facing technical credibility, research announcements, and benchmark programs.

Research foundation

The research foundation: LIP and CSR.

The Latent Invariance Principle explains why single-representation correctness is insufficient under indirect observation. Canonical Semantic Realization supplies the measurement scaffold: meaning is the unit, realization is controlled variation, and outcome is empirical measurement.

Latent Invariance Principle

Correct once is not enough. Stable across valid variations is evidence.

Canonical Semantic Realization

Meaning is the unit. Realization is controlled variation. Outcome is empirical measurement.

Request evidence before rollout

Know where the model breaks before the deployment does.