Teardown · exa

EXA

CategorySearch API for AILast round · $85M · 2025Site ↗

Benchmark
Lightspeed
NVIDIA

UX wrapper

Public web index + custom embedding search + API endpoints.

Public data / API layer

Common CrawlPublic

GitHub public reposPublic

arXivPublic

WikipediaPublic

Publisher licensing dealsLicensed

Proprietary web crawl indexScraped

Internal replication score

Easy

0.86

Feasibility of a useful internal substitute built with Claude (or similar), the same data access, and light agent logic — not rebuilding the whole product.

IRS = 0.30·D + 0.25·L + 0.20·O + 0.15·R + 0.10·Sthis record · 86%

D
Data accessibility
weight 0.300.95
- 1.0mostly customer-owned / public / standard third-party sources
- 0.5mixed accessibility
- 0.0hard-to-access or proprietary source layer
L
LLM substitutability
weight 0.250.85
- 1.0mostly retrieve / prompt / cite / summarize / classify / compare
- 0.5mixed standard + custom behavior
- 0.0strongly custom model behavior (fine-tunes on proprietary data, etc.)
O
Output simplicity
weight 0.200.90
- 1.0straightforward internal work product (memo, list, reply, SQL query)
- 0.5moderately specialized
- 0.0highly specialized (e.g. FDA-graded clinical text)
R
Review / risk tolerance
weight 0.150.85
- 1.0internal use with human review is acceptable
- 0.5moderate risk
- 0.0very low tolerance for error (e.g. external legal filings)
S
Surface complexity
weight 0.10inverse — higher means less surface dependence0.50
- 1.0a simple internal shell is enough
- 0.5polished workflow matters somewhat
- 0.0product surface / rollout / trust posture is central to value

LabelsEasy ≥ 0.67Medium ≥ 0.34Hard < 0.34

Missing factor rows use heuristics from wrapper scores. Editorial heuristic, not investment advice.

Build it yourself→

Recreate the workflow inside your org.

Internal build

Build it yourself

Same public web + embedding retrieval + frontier LLM extraction — lacks polish and agent integrations.

Internal use only. Replacing them in-market is a different bar than replaying the useful workflow inside your org.

01 · Connectors & flow

Common Crawl

GitHub public repos

arXiv

Wikipedia

Publisher licensing deals

Proprietary web crawl index

Internal build map

Data in

Connectors

Agent layer

Planner

Tools + retrieval

Reasoning model

Logic

LLM API

embedding search

retrieve

highlight extract

summarize

structured output

not custom weights

Outputs

Internal search

Answer

Citations

02 · Claude / agent prompt

Paste as the system or developer message in Claude (or your agent runtime). Scroll to read; Copy grabs the full text.

Claude / agent prompt

// Web search assistant for internal use You are a web search assistant inside [YOUR_COMPANY]. You help [YOUR_TEAM] retrieve and synthesize information from the public web using ONLY accessible data sources: Common Crawl snapshots, GitHub public repos, arXiv papers, Wikipedia dumps, and standard web APIs. ## What you must do 1. Retrieve first: query the company's internal web index (built from Common Crawl + curated sources) using semantic search over embeddings. Return ranked URLs with relevance scores. 2. Extract efficiently: use highlights or token-efficient extraction to surface only the relevant excerpts from each page. Avoid returning full HTML. 3. Cite rigorously: every claim must link back to a specific URL and excerpt. Use structured citations with page title, URL, and quoted text. 4. Scope: handle queries about current events, technical documentation, company research, people/company metadata — anything findable on the public web. ## What you are not Not a replacement for subscription databases or licensed corpora. Internal use only — no external distribution of results. ## Refusal Refuse if the query requires paywalled content, non-public APIs, or data the company does not have crawl rights to. Ask for more context if the query is ambiguous. ## Safety Internal posture — results are for research and analysis, not external publication. Human review required before using extracted data in customer-facing products.

03 · Result

latest news about semiconductor trade policy

Common Crawl + news site index

Recent articles from Reuters, Bloomberg, CNBC — all publicly crawled and semantically ranked.