Measured, not vibed

How do we know it's any good? We measured.

Search researchers have public test sets: a list of real queries plus a list of the right answers for each one. You run your engine on the queries, score the answers, get a number. The higher the number, the better the engine. We ran Argand's text-matching engine on three of these public tests, and put the scores next to the reference engine that researchers compare against. Argand matches it or beats it on all three. No fancy hardware required.

Step one: the plain keyword search

Public test set Argand Reference Better by
NFCorpusmedical IR · 323 queries 0.323 0.322 +0.001
SciFactscientific claim verification · 300 queries 0.687 0.679 +0.008
FiQAfinance Q&A · 648 queries 0.247 0.236 +0.011

Scores are , the standard search-quality score. Higher is better, max 1.0. All runs on a regular CPU, no graphics card needed. The test sets are the public BEIR collections that academic search researchers use as the common yardstick.

The table above is step one: plain keyword matching, lined up against the standard engine researchers compare against. Argand does not stop there. Step two, below, is a careful re-read of the best matches, and that is where the scores climb. It is the same tests measured at two points, so a set like SciFact appears in both: 0.687 after the keyword search, 0.742 after the re-read. Not a contradiction, just the second step doing its job.

Step two: the re-read

Then it reads the top results more closely.

Matching keywords is fast but rough. A second, small model re-reads the top results and reorders them by how well they actually answer the question. It was never trained or tuned on these tests. The lift below is what that one extra pass buys.

The right answer, first try

How often a right answer lands in the very first slot. On finance questions it goes from about one in five to nearly one in two.

FiQA 0% 0%
ArguAna 0% 0%
SciFact 0% 0%

Overall quality, top ten results. Higher is better, max 1.0.

FiQA finance questions
+0.173
ArguAna counter-arguments
+0.179
SciFact scientific claims
+0.056
NFCorpus medical search
+0.015
68M beats 568M

Small enough for a potato. This reranker has 68 million parameters and still beats a model eight times its size on the science test, 0.742 to 0.732. Tiny and better.

No graphics card

The version that runs on a plain CPU reproduces the original model to the decimal. Nothing heavy sits in the path of your search.

Reranking does not help everywhere. On one set, scientific document retrieval, it came out a touch worse. Across these public tests it lifts quality by 0.084 on average. We measured all of it, including the part that did not move.

How the numbers are reproducible. What's checkable today vs. at launch.

A benchmark needs three things to be reproducible: the test data, the scoring method, and the engine. The first two are public today. The test data is the BEIR collection that academic researchers use as the common yardstick. The scoring method is the standard one they all share. The "Reference" column above uses a publicly-available retrieval engine on that same public data, so anyone can rerun it today and check that the reference number is honest.

The engine itself, the part that produces the "Argand" column, is still being finished and the source is private. It goes public at launch (planned mid-2026), and from that day on the command below will print the same scores from the same public test sets, on a regular laptop. We're showing the command in advance so the method is on record before the source is. If the table above ever changes without the command and the test data changing, that's a tell.

cd ~/argand

ORT_DYLIB_PATH=$(pwd)/lib/onnxruntime-linux-x64-gpu-1.26.0/lib/libonnxruntime.so \
./target/release/beir_pipeline_bench \
  --corpus  eval/datasets/nfcorpus/corpus.jsonl \
  --queries eval/datasets/nfcorpus/queries.jsonl \
  --qrels   eval/datasets/nfcorpus/qrels/test.tsv \
  --retrieve-k 100 --no-rerank \
  --out eval/results/beir_nfcorpus_bm25only.json

The result file is a single JSON with nDCG@1, nDCG@5, nDCG@10, MRR@10, Recall@100, and MAP. The numbers in the table above are lifted verbatim from runs of this command.

BEIR canonical datasets are from the UKP Darmstadt mirror. SciFact and FiQA swap in their own corpus / queries / qrels paths. The SPLADE-v3 row adds --splade-model models/splade/splade-v3-8bit.onnx and drops --no-rerank. Every metric we cite has a corresponding stored JSON in eval/results/.

How we compare

Every search engine makes different trade-offs.

Here is what each one actually does.

Feature Tap ? for detailsGoogleBingKagiDDGArgand
AI-generated answers No
Owns its own index
Drives traffic to source sites NoPartialMostly
Behavioural ad targeting NoNo
Zero retention for searches NoConfigurableYes
Business model Behavioural adsSubscriptionContextual ads
Built in Rust n/an/an/an/a
Open source

Tap an engine, or swipe.

FeatureGoogleArgand
AI-generated answers
Owns its own index
Drives traffic to source sites
Behavioural ad targeting
Zero retention for searches
Business model
Built in Rust n/a
Open source
FeatureBingArgand
AI-generated answers
Owns its own index
Drives traffic to source sites No
Behavioural ad targeting
Zero retention for searches No
Business model Behavioural ads
Built in Rust n/a
Open source
FeatureKagiArgand
AI-generated answers
Owns its own index
Drives traffic to source sites Partial
Behavioural ad targeting No
Zero retention for searches Configurable
Business model Subscription
Built in Rust n/a
Open source
FeatureDDGArgand
AI-generated answers No
Owns its own index
Drives traffic to source sites Mostly
Behavioural ad targeting No
Zero retention for searches Yes
Business model Contextual ads
Built in Rust n/a
Open source

* Argand is in active development. These values reflect the design intent and current implementation. Claims about competitors are based on public documentation and policy pages.