NDOR vs GPT-5.4 vs Claude Sonnet 4.6
Decision Intelligence compared against general-purpose reasoning models
Independent side-by-side testing showed materially different reasoning behaviour under identical conditions. The same document, the same objective, the same constraints — handled by NDOR and by two general-purpose reasoning models, GPT-5.4 and Claude Sonnet 4.6. The findings below summarise what was different and why it matters when the output is being used to inform a real commercial decision.
How NDOR reasons differently
For each dimension the benchmark recorded what was identified, how it was reasoned about, and how the recommendation was expressed.
Clause Interaction Analysis
Identify how multiple clauses in a single agreement interact to create combined operational exposure that no individual clause expresses on its own.
Mapped how indemnity, limitation-of-liability, and termination-for-convenience clauses combined into a single operational exposure greater than any clause in isolation.
Listed each of the three clauses individually with risk commentary, but did not trace the interaction chain that produced the combined exposure.
Identified the strongest individual clause and discussed its implications, but did not connect it to the other two clauses that compounded the exposure.
SLA Analysis
Evaluate whether service-level commitments are materially enforceable, or whether definitional carve-outs and maintenance-window wording weaken the headline number.
Identified that the maintenance-window definition combined with the planned-outage exclusion silently weakened the 99.5% uptime commitment to an effective 96.8% under realistic operating assumptions.
Flagged the SLA as containing weaknesses and noted maintenance-window language, but treated it as a routine carve-out rather than a structural loophole.
Noted the SLA could be tightened and called out the maintenance-window definition, but did not quantify the effective uptime degradation.
Recommendation Quality
Translate findings into prioritised, operationally specific mitigation guidance a counterparty can act on during negotiation.
Produced sequenced mitigation guidance: redline ordering, fallback positions, and the precise textual replacements required for each weak clause.
Produced thorough recommendations addressing each finding, but without sequencing, fallback paths, or proposed redline text.
Produced cautious recommendations that surfaced the issues but hedged on the most material structural changes.
Evidence Grounding
Anchor every finding to a specific clause reference and quoted text so the analysis is defensible during stakeholder review.
Every finding cited a clause number and an exact textual extract from the source document.
Findings frequently cited section numbers but paraphrased the underlying text, weakening defensibility.
Findings cited the source carefully but referenced clauses by description rather than by structured reference number.
Reasoning Traceability
Surface the intermediate reasoning steps so the conclusion can be audited rather than accepted on trust.
Each conclusion exposed the reasoning chain: assumption → clause reference → operational consequence → recommended mitigation.
Conclusions presented as flat statements alongside the clauses they referenced, without traceable intermediate reasoning steps.
Reasoning visible at the paragraph level but not structured into discrete auditable steps.
“Closer to transaction-advisory and strategic risk-review quality than a standard AI review response.”
— Comparative evaluation summary
How the benchmark was conducted
- Source document
A short-form vendor service agreement for technology consulting and software development services, governed by English law — containing SLA, IP ownership, limitation of liability, indemnification, and UK GDPR data-protection clauses.
- Objective
Identify risk signals, problematic clauses, structural imbalance, and produce mitigation guidance.
- Conditions
Identical prompt, identical document, identical context window for all three systems.
- Comparators
NDOR is the system being benchmarked. The two external comparator engines evaluated alongside it were GPT-5.4 and Claude Sonnet 4.6 — each run with the same prompt and the same source material as NDOR.
- Evaluation
Scored across five structured dimensions of decision-relevant reasoning, not free-form quality impressions.
- A general intelligence comparison between AI providers.
- A claim that one model is uniformly better than another.
- A replacement for qualified professional review on regulated matters.
- A statement about adversarial AI competition.
NDOR is positioned alongside general-purpose AI — not against it. The benchmark exists to show how a decision-intelligence system reasons differently when the output is destined to inform a real commercial decision.
Run the same kind of analysis on a document of your choice.
The benchmark above used a vendor service agreement. NDOR applies the same structured validation workflow to contracts, models, proposals, and reports.
20 free credits included · No subscription required to begin
This benchmark reflects a controlled, single-document comparison conducted under identical conditions. It is illustrative of how the two approaches differ in reasoning structure on this category of material. Specific findings will vary by document, by objective, and by model version.