Search Benchmarks You Can Rerun

Quality is a habit

A worse index does not ship.

Search quality has to be more than a claim. Every new index is scored against public test sets and Argand's own human-judged searches before it can reach users.

The gate

Before any new index reaches users, it has to match or beat the previous build on two measures: (how high good results appear) and (how often the best result lands at the top). If either drops by more than half a percent, the build is rejected automatically. A worse index does not get to touch users.

nDCG@10 ≥ prev MRR ≥ prev fail → blocked

Round 1 — BM25

Round one — exact-term matching

Public test set Argand Reference Better by

NFCorpusmedical IR · 323 queries 0.323 0.322 +0.001

SciFactscientific claim verification · 300 queries 0.687 0.679 +0.008

FiQAfinance Q&A · 648 queries 0.247 0.236 +0.011

— how high the best answers appear in the top ten. Higher is better, max 1.0. Test sets are public BEIR collections.

Round 2 — Rerank

Round two — the rerank lift

A slower model re-reads the top candidates against the original question. On SciFact, the score rises from 0.687 to 0.742. That is the second step doing its job.

Round two: the re-read

Then it reads the top results more closely.

Matching keywords is fast but rough. A second, small model re-reads the top results and reorders them by how well they actually answer the question. Think of it as a second librarian reading the shortlist with your actual question in mind. It was never trained or tuned on these tests. The lift below is what that one extra pass buys.

The right answer, first try

How often a right answer lands in the very first slot. On finance questions it goes from about one in five to nearly one in two.

FiQA 0% 0%

ArguAna 0% 0%

SciFact 0% 0%

Overall quality across the top ten results. The score asks how high the right answers landed.

FiQA finance questions

+0.173

ArguAna counter-arguments

+0.179

SciFact scientific claims

+0.056

NFCorpus medical search

+0.015

68M beats 568M

Small enough for a potato. This reranker has 68 million parameters and still beats a model eight times its size on the science test, 0.742 to 0.732. Tiny and better.

No graphics card

The version that runs on a plain CPU reproduces the original model to the decimal. Nothing heavy sits in the path of your search.

Reranking does not help everywhere. On one set, scientific document retrieval, it came out a touch worse. Across these public tests it lifts quality by 0.084 on average. We measured all of it, including the part that did not move.

How the numbers are reproducible. What's checkable today vs. at launch.

A benchmark is only useful if another person can rerun it. What you can verify today: BEIR provides the test collections, the answer keys, and the scoring method. The "Reference" column uses a public retrieval engine on that same data, so anyone can check the baseline.

The engine itself, the part that produces the "Argand" column, is still being finished and the source is private. It goes public at launch. The public launch window is summer 2026, with the exact date still fluid; from that day on the command below will print the same scores from the same public test sets on a regular laptop. The command is here now so the method is on record before the source is. If the table ever changes, the command and the test data have to explain why.

cd ~/argand

ORT_DYLIB_PATH=$(pwd)/lib/onnxruntime-linux-x64-gpu-1.26.0/lib/libonnxruntime.so \
./target/release/beir_pipeline_bench \
  --corpus  eval/datasets/nfcorpus/corpus.jsonl \
  --queries eval/datasets/nfcorpus/queries.jsonl \
  --qrels   eval/datasets/nfcorpus/qrels/test.tsv \
  --retrieve-k 100 --no-rerank \
  --out eval/results/beir_nfcorpus_bm25only.json

The result file is a single JSON with nDCG@1, nDCG@5, nDCG@10, MRR@10, Recall@100, and MAP. The numbers in the table above are lifted verbatim from runs of this command.

BEIR canonical datasets are from the UKP Darmstadt mirror. SciFact and FiQA swap in their own corpus / queries / qrels paths. The SPLADE-v3 row adds --splade-model models/splade/splade-v3-8bit.onnx and drops --no-rerank. Every metric we cite has a corresponding stored JSON in eval/results/.

How we compare

Every search engine makes different trade-offs.

Here is what each one actually does.

Feature Tap ? for details	Google	Bing	Kagi	DDG
Generated answer boxes
Owns its own index
Drives traffic to source sites		No	Partial	Mostly
Behavioural ad targeting			No	No
Anonymous searches end with the request		No	Configurable	Yes
Business model		Behavioural ads	Subscription	Contextual ads
Built in Rust	n/a	n/a	n/a	n/a
Open source

Tap an engine, or swipe.

Feature	Google	Argand
Generated answer boxes
Owns its own index
Drives traffic to source sites
Behavioural ad targeting
Anonymous searches end with the request
Business model
Built in Rust	n/a
Open source

Feature	Bing	Argand
Generated answer boxes
Owns its own index
Drives traffic to source sites	No
Behavioural ad targeting
Anonymous searches end with the request	No
Business model	Behavioural ads
Built in Rust	n/a
Open source

Feature	Kagi	Argand
Generated answer boxes
Owns its own index
Drives traffic to source sites	Partial
Behavioural ad targeting	No
Anonymous searches end with the request	Configurable
Business model	Subscription
Built in Rust	n/a
Open source

Feature	DDG	Argand
Generated answer boxes
Owns its own index
Drives traffic to source sites	Mostly
Behavioural ad targeting	No
Anonymous searches end with the request	Yes
Business model	Contextual ads
Built in Rust	n/a
Open source

Updated June 5, 2026. Argand is in active development, so its values reflect design intent and current implementation. Competitor notes are based on public documentation: Google AI in Search, Bing generative search, DuckDuckGo Search Assist, Kagi search sources.