Quality is a habit

A worse index does not ship.

Search quality has to be more than a claim. Every new index is tested against public question sets and against Argand's own human-judged searches. If the new build makes results worse, it does not reach users.

Round one: exact-term matching

Public test set Argand Reference Better by
NFCorpusmedical IR · 323 queries 0.323 0.322 +0.001
SciFactscientific claim verification · 300 queries 0.687 0.679 +0.008
FiQAfinance Q&A · 648 queries 0.247 0.236 +0.011

asks a plain question: when the engine shows its top ten results, how high did the best answers appear? Higher is better, max 1.0. The test sets are public BEIR collections, which academic search researchers use as a common yardstick.

The table above is round one: exact-term matching, measured against the standard engine researchers use as a baseline. Argand does not stop there. Round two is a careful re-read of the best matches. On SciFact, for example, the score rises from 0.687 after keyword search to 0.742 after the re-read. That is not a contradiction; it is the second step doing its job.

Round two: the re-read

Then it reads the top results more closely.

Matching keywords is fast but rough. A second, small model re-reads the top results and reorders them by how well they actually answer the question. Think of it as a second librarian reading the shortlist with your actual question in mind. It was never trained or tuned on these tests. The lift below is what that one extra pass buys.

The right answer, first try

How often a right answer lands in the very first slot. On finance questions it goes from about one in five to nearly one in two.

FiQA 0% 0%
ArguAna 0% 0%
SciFact 0% 0%

Overall quality across the top ten results. The score asks how high the right answers landed.

FiQA finance questions
+0.173
ArguAna counter-arguments
+0.179
SciFact scientific claims
+0.056
NFCorpus medical search
+0.015
68M beats 568M

Small enough for a potato. This reranker has 68 million parameters and still beats a model eight times its size on the science test, 0.742 to 0.732. Tiny and better.

No graphics card

The version that runs on a plain CPU reproduces the original model to the decimal. Nothing heavy sits in the path of your search.

Reranking does not help everywhere. On one set, scientific document retrieval, it came out a touch worse. Across these public tests it lifts quality by 0.084 on average. We measured all of it, including the part that did not move.

How the numbers are reproducible. What's checkable today vs. at launch.

A benchmark is only useful if another person can rerun it. What you can verify today: BEIR provides the test collections, the answer keys, and the scoring method. The "Reference" column uses a public retrieval engine on that same data, so anyone can check the baseline.

The engine itself, the part that produces the "Argand" column, is still being finished and the source is private. It goes public at launch. The public launch window is summer 2026, with the exact date still fluid; from that day on the command below will print the same scores from the same public test sets on a regular laptop. The command is here now so the method is on record before the source is. If the table ever changes, the command and the test data have to explain why.

cd ~/argand

ORT_DYLIB_PATH=$(pwd)/lib/onnxruntime-linux-x64-gpu-1.26.0/lib/libonnxruntime.so \
./target/release/beir_pipeline_bench \
  --corpus  eval/datasets/nfcorpus/corpus.jsonl \
  --queries eval/datasets/nfcorpus/queries.jsonl \
  --qrels   eval/datasets/nfcorpus/qrels/test.tsv \
  --retrieve-k 100 --no-rerank \
  --out eval/results/beir_nfcorpus_bm25only.json

The result file is a single JSON with nDCG@1, nDCG@5, nDCG@10, MRR@10, Recall@100, and MAP. The numbers in the table above are lifted verbatim from runs of this command.

BEIR canonical datasets are from the UKP Darmstadt mirror. SciFact and FiQA swap in their own corpus / queries / qrels paths. The SPLADE-v3 row adds --splade-model models/splade/splade-v3-8bit.onnx and drops --no-rerank. Every metric we cite has a corresponding stored JSON in eval/results/.

How we compare

Every search engine makes different trade-offs.

Here is what each one actually does.

Feature Tap ? for detailsGoogleBingKagiDDGArgand
Generated answer boxes
Owns its own index
Drives traffic to source sites NoPartialMostly
Behavioural ad targeting NoNo
Anonymous searches end with the request NoConfigurableYes
Business model Behavioural adsSubscriptionContextual ads
Built in Rust n/an/an/an/a
Open source

Tap an engine, or swipe.

FeatureGoogleArgand
Generated answer boxes
Owns its own index
Drives traffic to source sites
Behavioural ad targeting
Anonymous searches end with the request
Business model
Built in Rust n/a
Open source
FeatureBingArgand
Generated answer boxes
Owns its own index
Drives traffic to source sites No
Behavioural ad targeting
Anonymous searches end with the request No
Business model Behavioural ads
Built in Rust n/a
Open source
FeatureKagiArgand
Generated answer boxes
Owns its own index
Drives traffic to source sites Partial
Behavioural ad targeting No
Anonymous searches end with the request Configurable
Business model Subscription
Built in Rust n/a
Open source
FeatureDDGArgand
Generated answer boxes
Owns its own index
Drives traffic to source sites Mostly
Behavioural ad targeting No
Anonymous searches end with the request Yes
Business model Contextual ads
Built in Rust n/a
Open source

Updated June 5, 2026. Argand is in active development, so its values reflect design intent and current implementation. Competitor notes are based on public documentation: Google AI in Search, Bing generative search, DuckDuckGo Search Assist, Kagi search sources.