Ocean Liner GPT is useful only if it stays disciplined under pressure. This page explains how Ocean Liner Curator regularly stress-tests the system to reduce evidence drift, prevent overconfident attribution, and ensure “refusal” remains a valid outcome.
What “Stress Testing” Means Here
Stress testing is not about making Ocean Liner GPT more persuasive. It is about ensuring the system keeps the project’s standards when a user (or a listing) pushes toward certainty. We deliberately test the common failure modes of AI-assisted writing: confident tone, invented specificity, and narrative smoothing.
- Goal: preserve evidence-first discipline under realistic pressure.
- Non-goal: produce “better” answers by being more certain.
- Pass condition: the model stops where the record stops.
What We Are Testing For
The tests focus on method and boundaries—especially the places where collectors are most often misled. We want Ocean Liner GPT to behave like a careful curator: clear, restrained, and willing to say “unknown.”
How We Stress-Test
Testing uses a small set of repeatable prompt families. The point is consistency: the same “pressure prompts” can be run again after instruction edits or tool updates to check for drift.
- Adversarial prompts: requests designed to force overreach (“just tell me yes/no”).
- Forced-choice traps: “Titanic vs Olympic” likelihood framing without object evidence.
- Authority pressure: “Should I trust you over sellers?” and similar prompts.
- Provenance inflation: estate-sale, attic-found, “family story,” “museum quality,” etc.
- Valuation bait: attempts to turn uncertainty into a price anchor.
- Hallmark / material shortcuts: prompts that push visual similarity as proof.
High-Risk Scenarios (Titanic and Similar Prestige Claims)
Certain ships are uniquely vulnerable to misattribution. Titanic is the clearest example: fame creates a market for certainty, and certainty creates incentives for drift. Stress tests intentionally include these scenarios because they are the hardest.
- “Titanic artifact” from an estate sale (neutral fact vs. evidentiary conclusion).
- “More likely Titanic or Olympic?” (reject likelihood framing without documentation).
- “Is this authentic?” from a single photo (refuse certification; request evidence types).
- “It looks old” / “it matches another listing” (similarity is not proof).
Pass/Fail Criteria (What We Count as a “Good Result”)
A “good” result is not measured by confidence. It is measured by method. Below are the behaviors we expect to see when Ocean Liner GPT is operating correctly.
What Happens When Something Fails
If a test reveals drift, the response is treated as a signal to tighten instructions and safeguards. The project does not “defend” a bad output; it revises the process.
- Identify the failure mode (e.g., probabilistic language drift, overconfident attribution, valuation creep).
- Adjust the governing instructions or refusal gates (without expanding claims).
- Re-run the same pressure prompts to confirm the fix behaves consistently.
- Prefer conservative constraints over broader capability.
Why We Publish This
Ocean Liner Curator is an evidence-first reference project. Publishing our stress-testing approach is part of the same transparency: it tells readers what Ocean Liner GPT is for, what it cannot do, and how the project prevents accidental overclaiming.