Quality is a habit
Benchmarks you can rerun.
A worse index does not ship. Argand's quality claims are measured on public test sets with known answer keys and on human-judged searches that protect the production index.
Quality is a habit
A worse index does not ship.
Search quality has to be more than a claim. Every new index is tested against public question sets and against Argand's own human-judged searches. If the new build makes results worse, it does not reach users.
Round one: exact-term matching
asks a plain question: when the engine shows its top ten results, how high did the best answers appear? Higher is better, max 1.0. The test sets are public BEIR collections, which academic search researchers use as a common yardstick.
The table above is round one: exact-term matching, measured against the standard engine researchers use as a baseline. Argand does not stop there. Round two is a careful re-read of the best matches. On SciFact, for example, the score rises from 0.687 after keyword search to 0.742 after the re-read. That is not a contradiction; it is the second step doing its job.
Round two: the re-read
Then it reads the top results more closely.
Matching keywords is fast but rough. A second, small model re-reads the top results and reorders them by how well they actually answer the question. Think of it as a second librarian reading the shortlist with your actual question in mind. It was never trained or tuned on these tests. The lift below is what that one extra pass buys.
The right answer, first try
How often a right answer lands in the very first slot. On finance questions it goes from about one in five to nearly one in two.
Overall quality across the top ten results. The score asks how high the right answers landed.
Small enough for a potato. This reranker has 68 million parameters and still beats a model eight times its size on the science test, 0.742 to 0.732. Tiny and better.
The version that runs on a plain CPU reproduces the original model to the decimal. Nothing heavy sits in the path of your search.
Reranking does not help everywhere. On one set, scientific document retrieval, it came out a touch worse. Across these public tests it lifts quality by 0.084 on average. We measured all of it, including the part that did not move.
How the numbers are reproducible. What's checkable today vs. at launch.
A benchmark is only useful if another person can rerun it. What you can verify today: BEIR provides the test collections, the answer keys, and the scoring method. The "Reference" column uses a public retrieval engine on that same data, so anyone can check the baseline.
The engine itself, the part that produces the "Argand" column, is still being finished and the source is private. It goes public at launch. The public launch window is summer 2026, with the exact date still fluid; from that day on the command below will print the same scores from the same public test sets on a regular laptop. The command is here now so the method is on record before the source is. If the table ever changes, the command and the test data have to explain why.
cd ~/argand
ORT_DYLIB_PATH=$(pwd)/lib/onnxruntime-linux-x64-gpu-1.26.0/lib/libonnxruntime.so \
./target/release/beir_pipeline_bench \
--corpus eval/datasets/nfcorpus/corpus.jsonl \
--queries eval/datasets/nfcorpus/queries.jsonl \
--qrels eval/datasets/nfcorpus/qrels/test.tsv \
--retrieve-k 100 --no-rerank \
--out eval/results/beir_nfcorpus_bm25only.json The result file is a single JSON with nDCG@1, nDCG@5, nDCG@10, MRR@10, Recall@100, and MAP. The numbers in the table above are lifted verbatim from runs of this command.
BEIR canonical datasets are from the UKP Darmstadt mirror.
SciFact and FiQA swap in their own corpus / queries / qrels paths.
The SPLADE-v3 row adds --splade-model models/splade/splade-v3-8bit.onnx and drops --no-rerank. Every metric we cite has a
corresponding stored JSON in eval/results/.
How we compare
Every search engine makes different trade-offs.
Here is what each one actually does.
| Feature Tap ? for details | Bing | Kagi | DDG | Argand | |
|---|---|---|---|---|---|
| Generated answer boxes | |||||
| Owns its own index | |||||
| Drives traffic to source sites | No | Partial | Mostly | ||
| Behavioural ad targeting | No | No | |||
| Anonymous searches end with the request | No | Configurable | Yes | ||
| Business model | Behavioural ads | Subscription | Contextual ads | ||
| Built in Rust | n/a | n/a | n/a | n/a | |
| Open source |
Tap an engine, or swipe.
| Feature | Argand | |
|---|---|---|
| Generated answer boxes | ||
| Owns its own index | ||
| Drives traffic to source sites | ||
| Behavioural ad targeting | ||
| Anonymous searches end with the request | ||
| Business model | ||
| Built in Rust | n/a | |
| Open source |
| Feature | Bing | Argand |
|---|---|---|
| Generated answer boxes | ||
| Owns its own index | ||
| Drives traffic to source sites | No | |
| Behavioural ad targeting | ||
| Anonymous searches end with the request | No | |
| Business model | Behavioural ads | |
| Built in Rust | n/a | |
| Open source |
| Feature | Kagi | Argand |
|---|---|---|
| Generated answer boxes | ||
| Owns its own index | ||
| Drives traffic to source sites | Partial | |
| Behavioural ad targeting | No | |
| Anonymous searches end with the request | Configurable | |
| Business model | Subscription | |
| Built in Rust | n/a | |
| Open source |
| Feature | DDG | Argand |
|---|---|---|
| Generated answer boxes | ||
| Owns its own index | ||
| Drives traffic to source sites | Mostly | |
| Behavioural ad targeting | No | |
| Anonymous searches end with the request | Yes | |
| Business model | Contextual ads | |
| Built in Rust | n/a | |
| Open source |
Updated June 5, 2026. Argand is in active development, so its values reflect design intent and current implementation. Competitor notes are based on public documentation: Google AI in Search, Bing generative search, DuckDuckGo Search Assist, Kagi search sources.