Methodology
IPB Methodology
Publish enough to be credible. Protect enough to stay defensible.
Meaning-preserving variation
The same semantic decision is expressed through controlled realizations that vary wording, wrapper, pressure, retrieval context, or workflow surface without changing the relevant expected behavior.
Expected behavior contract
Every scored unit declares what the system should have done before actual model behavior is classified.
Correctness vs stability
Correctness asks whether the behavior matched the contract. Stability asks whether that decision survived valid variation. A system can be stable and wrong.
Public report contents
What IPB publishes
- Benchmark domain
- Model versions
- Corpus version
- Expected behavior contract
- Correctness metrics
- Stability metrics
- Coverage gates
- Caveats
- Public non-claims
- Selected review-safe examples
- Fingerprints where appropriate
Protected material
What IPB does not publish
- Full private corpus libraries
- Hidden generation machinery
- Private client materials
- Raw sensitive outputs
- Operational secrets
- Anything that allows benchmark overfitting or corpus leakage
Coverage gates and failure geometry
Coverage gates keep evaluator uncertainty separate from model behavior. Failure geometry preserves where decisions change: prompt form, pressure family, context source, workflow wrapper, or policy boundary.
What this does not claim
- IPB is not a universal intelligence ranking.
- IPB is not a claim that a model is globally safe.
- IPB is not certification.
- IPB does not replace legal, regulatory, security, medical, financial, or compliance review.
- IPB results are scoped to the declared domain, protocol version, corpus version, model/system identity, and runtime settings.
- Stable behavior is not automatically good behavior; stable-wrong behavior is a failure.
- Public samples do not disclose future test material.