Our goal is to know what federal contractors will do before they act—and turn that intelligence into advantage for American industry.
The U.S. federal government is the largest buyer on Earth. Each year, it awards over $700 billion in contracts — more than the GDP of Sweden. For the thousands of companies competing for this business, the central question is deceptively simple: "If I bid on this contract, will I win?"
The stakes are high. Preparing a federal proposal costs real money — sometimes tens of thousands of dollars for a routine bid, sometimes millions for a major defense program. Companies pour resources into capture management, compliance documentation, and pricing strategies, often with no idea whether they're the frontrunner or a long shot.
Historically, answering the "will I win?" question required intuition, relationships, and educated guesses. Incumbent contractors had an advantage because they knew their own win rates. Everyone else was flying blind.
Anvil o1 changes this. It replaces guesswork with prediction — using machine learning trained on millions of historical contract awards to forecast who will win future solicitations.
Every federal contract has two distinct moments in time, recorded in two separate government databases:
Here's the problem: these two systems don't talk to each other. There's no official field that links an award back to its original solicitation. The government publishes "we want to buy 500 connectors" in one place and "we bought 500 connectors from Acme Corp for $50,000" in another place, but never explicitly says "this award fulfilled that solicitation."
This disconnect creates an opportunity. If you can connect them — match each award to its original solicitation — you create something powerful: a labeled dataset where the input is "what the government asked for" and the label is "who won."
That's supervised learning. And that's exactly what Anvil o1 does.
We didn't pick the Defense Logistics Agency at random. DLA is the ideal starting point for building a contract prediction model:
DLA is the Pentagon's supply chain manager. It purchases everything the military needs: fuel, food, clothing, medical supplies, and — most importantly — spare parts. Millions of spare parts. DLA's 16+ million contracts over the past decade give us a massive training corpus.
DLA contracts follow predictable patterns. Most are for commodity items with National Stock Numbers (NSNs). The model can learn "when DLA needs Type X connectors, Vendor Y usually wins."
The same items get purchased repeatedly. DLA might buy the same O-ring NSN dozens of times per year. This repetition is gold for machine learning.
DLA contracts are typically fixed-price awards with a clear winner. The strategic logic: prove the approach works on DLA's high-volume procurements, then expand to other agencies.
Model Design, Approach, and Overview
Anvil o1 treats contract prediction as a learning-to-rank problem. Given a solicitation, the model scores every vendor in the database and returns a ranked list of likely winners. Under the hood, it's a gradient boosting model (LightGBM) trained with LambdaRank — an algorithm originally developed for search engine ranking that optimizes directly for putting the right answer near the top. The model learns from 90,000 historical DLA contract awards, extracting patterns from hand-crafted features: TF-IDF vectors from solicitation text, vendor win rates by product category, historical pricing, and geographic signals. When a new solicitation arrives, o1 computes these features, runs them through the trained model, and outputs probability scores for each vendor. The winner is in the top 10 predictions 56% of the time — 12,800x better than random.
Anvil o2 explores whether a fine-tuned language model can add value as an analyst layer on top of traditional ranking. We fine-tune Mistral 7B on 90,000 DLA contracts using QLoRA (4-bit quantization + LoRA adapters), teaching the model to generate predictions with explanations. The hypothesis: LLMs might identify semantic patterns in solicitation text that TF-IDF features miss. The limitation is structural: an LLM alone can't reliably predict specific vendors — it has no live database, can't validate eligibility, and its knowledge freezes at training time. Where o2 shows promise is as a reasoning layer: given o1's ranked shortlist, o2 can explain why certain vendors are strong candidates, surface uncertainty, and make predictions interpretable. The explanation is generated alongside the prediction (not post-hoc), but should be treated as a rationale rather than causal proof.
Anvil o3 combines the best of both approaches into a unified architecture. o3a takes the LightGBM ranker to the next level: more training data, richer feature engineering (CPARS performance ratings, subcontractor networks, protest history), and hyperparameter optimization targeting near-perfect recall in the top-50 candidates. The goal is a ranker that almost never misses the winner. o3b is a fine-tuned LLM (Mistral or Llama) purpose-built for analyst explanations — given o3a's shortlist, it generates human-readable reasoning about why each candidate is strong or weak, surfaces risk factors, and provides confidence calibration. The combination: o3a handles candidate generation with maximum recall, o3b handles interpretation with maximum clarity. Ranking + Reasoning. Prediction + Explanation. Anvil o3.
We ingest the complete DLA contract archive from FPDS — the government's official record of contract spending. This includes:
This data is public. Anyone can download it from USASpending.gov or query the FPDS API. The raw information isn't the competitive advantage — the linkage is.
Separately, we pull solicitations from SAM.gov (formerly FedBizOpps). Each solicitation contains:
Solicitation text varies in quality. Some are detailed, others sparse. The model has to work with what it gets.
This is the hard part — and where Anvil's proprietary value lies.
Awards and solicitations don't share a common identifier. You can't just join on a key. Instead, you have to infer the connection through:
This is a probabilistic matching problem. Not every link is certain. We apply confidence thresholds and validate a sample manually to ensure quality.
The result: ~90,000 high-confidence linked pairs where we know both what the government asked for and who won.
o2 builds on the same DLA solicitation-award pairs that powered o1:
The difference isn't the data — it's what we do with it. o2 transforms these pairs into a format that teaches a language model to reason about procurement.
For each training example, we compute the winning vendor's historical context:
This context teaches the model why certain vendors win — incumbency, category expertise, and overall DLA relationships.
Each example becomes an instruction-tuning prompt in Alpaca format:
The result: ~85,000 instruction-tuning examples that teach Mistral 7B to predict DLA contract winners with explicit reasoning.
Each training example is a JSON record:
| Field | Description |
|---|---|
| text | The raw solicitation text — product description, quantities, delivery terms |
| psc | Product Service Code — "25" means Vehicular Equipment Components |
| naics | Industry classification — "336390" is Other Motor Vehicle Parts Manufacturing |
| set_aside | Socioeconomic restriction, if any (small business, SDVOSB, 8(a), HUBZone) |
| vendor | The company that won — this is our prediction target |
| amount | Dollar value of the award |
The vendor field is what we're trying to predict. Given a new solicitation, which vendor will win?
To understand what 56.2% top-10 accuracy means in practice, we need to establish a baseline.
DLA has awarded contracts to approximately 227,000 unique vendors over the training period. These range from Lockheed Martin to one-person machine shops. Most vendors win only a handful of contracts; a small number win thousands.
If you picked 10 vendors at random from this pool, what's the probability that the actual winner is among them?
That's less than half a percent. Essentially zero.
Anvil's top-10 accuracy is 56.2%. The winner is in the model's top 10 predictions more than half the time.
The right question isn't "is 56% good enough?" — it's "can we improve it?" We track performance against the random baseline and prior model versions to ensure each iteration moves the needle.
We've identified two primary areas for future gains:
1. Better tail handling: Many contracts are won by vendors with limited history. Enhanced feature engineering for sparse data and transfer learning from related vendors can help.
2. Richer input signals: Adding past performance ratings, subcontracting patterns, and SAM.gov capability statements could give the model more to work with.
The current model is a baseline. Each of these improvements is a concrete step on the roadmap.
o2 uses Alpaca-style instruction tuning format for chat fine-tuning:
| Field | Description |
|---|---|
| instruction | System prompt defining the analyst role and prediction task |
| input | Structured solicitation details with PSC, NAICS, set-aside, text, and top category vendors |
| output | Predicted winner with explicit reasoning and confidence level |
The model learns to generate predictions with explanations — not just who will win, but why they're likely to win based on category expertise and DLA relationships.
o1 uses gradient boosting on tabular features. o2 takes a fundamentally different approach: fine-tune a language model to reason about procurement like an expert analyst.
1. Reasoning patterns: The model learns to articulate why incumbents win, why PSC expertise matters, and how set-asides constrain competition.
2. Context integration: By including top vendors in each category as part of the input, the model learns competitive dynamics within markets.
3. Confidence calibration: Training outputs include confidence levels (High/Medium/Low) based on signal strength.
4. Natural language understanding: The model processes solicitation text directly, capturing nuances that feature extraction might miss.
The PSC (Product Service Code) is one of the strongest predictors. Federal procurement is highly segmented — vendors specialize. A company that makes O-rings doesn't bid on turbine blades.
Set-aside status dramatically narrows the candidate pool. If a solicitation is marked "Total Small Business Set-Aside (FAR 19.5)," large contractors like Lockheed or Boeing are ineligible.
The model uses set-aside as a hard filter — certain vendors become impossible predictions.
For each vendor in the training data, we compute:
Historical winners tend to be future winners. Incumbency is real.
The solicitation text contains signal:
Contracting offices develop relationships with vendors. An office that has awarded 50 contracts to Vendor X in the past two years is more likely to award the 51st.
The model learns office-vendor affinities.
Unlike o1, o2 doesn't extract tabular features. The language model processes raw solicitation text directly.
The transformer architecture learns its own internal representations of what matters in solicitation text.
Instead of computing features at prediction time, vendor statistics are embedded in the training prompts:
The model learns to interpret these statistics as part of natural language input.
Each prompt follows a consistent structure that the model learns to parse:
Consistent formatting helps the model learn what signals to attend to.
Training outputs include structured reasoning that the model learns to generate:
Each prediction includes a confidence tier derived from signal strength:
The model learns to associate confidence with the strength of supporting evidence.
LightGBM gradient boosting with LambdaRank
Mistral 7B fine-tuned with QLoRA
Anvil o1 uses a gradient boosting ensemble — specifically, LightGBM optimized for ranking (LambdaRank objectives).
Why not deep learning? For tabular data with mixed features, gradient boosting still beats neural networks. It offers:
We frame this as learning to rank, not classification. For each solicitation, generate a ranked list of vendors. Evaluate success by whether the winner appears in the top K.
The model learns a scoring function: given (solicitation features, vendor features), output a relevance score.
We don't score all 227K vendors. First, filter to a candidate pool:
This reduces candidates to 500-5,000 vendors, which the model ranks.
mistralai/Mistral-7B-Instruct-v0.3
o2 is built on Mistral 7B Instruct v0.3 — a 7 billion parameter language model optimized for instruction-following.
Mistral's sliding window attention and efficient architecture make it suitable for the structured prompts in our training data.
We use QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning:
QLoRA lets us fine-tune a 7B model on a single A100 GPU with ~20GB VRAM.
Training runs for 1 epoch with supervised fine-tuning (SFT):
Training takes ~4-6 hours on A100 (40GB) or ~10-12 hours on T4.
At inference time, the fine-tuned model generates predictions with reasoning:
The model outputs structured predictions with vendor names, reasoning bullets, and confidence levels — ready for human review or downstream processing.
The single strongest predictor is "has this vendor won this exact NSN before?" If yes, they're the favorite. Federal procurement is sticky — agencies prefer known quantities.
About 23% of DLA contracts have small business set-asides. Performance by set-aside status:
Set-asides make prediction easier by constraining the candidate pool.
Certain categories are more predictable than others:
| PSC | Category | Top-10 |
|---|---|---|
| 59 | Electrical Components | 68% |
| 53 | Hardware & Abrasives | 61% |
| 16 | Aircraft Components | 54% |
| 84 | Clothing & Textiles | 49% |
The model can't observe bid prices. DLA uses LPTA evaluation — price often decides. This creates a theoretical ceiling on accuracy around 70-80%.
When a vendor wins their first contract in a PSC category, the model rarely predicts them. About 8% of awards go to first-time winners. This bounds our floor.
Five ways to understand how Anvil o1 predicts federal contract winners.
Projected performance improvements for Anvil o2.
Does the model know what it knows?
A well-calibrated model's confidence scores reflect true probabilities. When Anvil o1 predicts a vendor has a 30% chance of winning, they should win approximately 30% of the time.
The diagonal dashed line represents perfect calibration. Points close to this line indicate the model's confidence scores are reliable and actionable.
How much signal is in the top predictions?
This chart answers: "If I only review the model's top X% of predictions, what percentage of actual winners will I capture?"
The steeper the curve rises above the diagonal baseline, the better the model concentrates winners at the top of its rankings.
See predictions vs. actual outcomes
These are real solicitations from our test set. For each, we show Anvil's top-5 predicted vendors with confidence scores, and mark the actual winner.
The green checkmark indicates who actually won the contract. Notice how often the winner appears in the top 5 predictions.
How much better than random?
Lift measures how many times more likely you are to find a winner using model predictions versus random selection, at each decile of the ranked list.
A lift of 5x in the first decile means the top 10% of predictions contain 5 times more winners than you'd expect by chance.
Higher confidence = higher win rate
We group predictions by confidence level and measure the actual win rate in each bucket. This shows that confidence scores are meaningful—high confidence predictions really do win more often.
This is actionable intelligence: focus resources on opportunities where Anvil shows high confidence.
o2 generates predictions with natural language reasoning — a fundamentally different approach than o1's ranking scores.
o2 generates reasoning alongside each prediction — why the vendor is likely to win based on PSC expertise, DLA history, and set-aside eligibility.
The model sees top vendors in each PSC category as part of the input, learning competitive dynamics within markets.
Each prediction includes a confidence level (High/Medium/Low) derived from the strength of supporting evidence.
Output is structured but human-readable — ready for analyst review without needing to interpret numeric scores.
Built on Mistral's pretrained knowledge of language, contracts, and business — fine-tuned specifically for DLA procurement.
Incumbent contractors have always had an advantage — they know their win rates, their competitors, their agency relationships. New entrants are at an information disadvantage. Anvil changes this. A startup entering federal contracting can now see the competitive landscape with the same fidelity as a 20-year incumbent. This is democratizing.
If contractors can better predict their odds, they'll bid more selectively. This means fewer low-quality bids (good for agencies), more competitive bids on winnable opportunities (bad for incumbents), and better resource allocation industry-wide.
If vendors can see that a particular contracting office awards 80% of its business to one vendor, that's interesting. Maybe justified, maybe worth scrutinizing. The model makes patterns visible that were previously buried in millions of transaction records.
The proof-of-concept worked on DLA. The next step is generalizing to other agencies:
Each agency has its own procurement culture. Models will likely need per-agency training, at least initially.
Current features are primarily structured data. We're leaving signal on the table:
Currently, the model is trained offline on historical data. The vision is real-time scoring: the moment a solicitation posts on SAM.gov, Anvil ranks vendors and pushes alerts. This requires live SAM.gov monitoring, low-latency inference, and push notification infrastructure. It's engineering, not science — the hard ML work is done.
The biggest limitation is not knowing bid prices. If we could model price distributions — "Vendor X typically bids 15% above cost in this category" — we could predict winners even more accurately.
Price data isn't public, but some vendors might share their bid histories in exchange for insights. This creates a data network effect: the more vendors participate, the better the model gets for everyone.
The model is trained on DLA data. It reflects DLA's procurement patterns, DLA's vendor base, DLA's contracting offices. Predictions for non-DLA opportunities should be treated skeptically until we train agency-specific models.
DLA buys commodities — parts, supplies, consumables. The model won't work for major weapons systems (different evaluation, different dynamics), professional services (subjective evaluation criteria), or R&D contracts (unpredictable by nature).
We're good at predicting who wins the O-ring contract. We're not trying to predict who wins the next fighter jet program.
ML models learn from history. If procurement patterns shift — new vendors enter, incumbents exit, policy changes — the model's accuracy degrades until retrained. We retrain quarterly to stay current.
56% top-10 accuracy means 44% of the time, the winner isn't in the top 10. The model provides probabilistic guidance, not certainty. Treat predictions as one input among many in bid/no-bid decisions, not as gospel.
We predict who's likely to win based on structural factors. We can't predict who will submit the lowest price. For LPTA competitions, price often decides — and that's outside our visibility.
Anvil o1 solves a data integration problem that unlocks predictive capability. By linking 16 million DLA contract awards to their original solicitations, we create supervised training data that teaches a model to answer: "Given what the government is asking for, who will win?"
Current performance: 56% top-10 accuracy. This is a starting point, not a ceiling. o2 adds natural language reasoning on top of o1's rankings — but it's the analyst layer, not the prediction engine.
Anvil o3 is the model we actually want in production: a system that combines a best-in-class ranker with a best-in-class analyst layer.
o3a is the "perfect ranker" project — o1 taken to the next level:
The goal is simple: get the right winner into the shortlist as often as possible.
o3b is a fine-tuned LLM designed to operate on top of o3a:
Together, o3a + o3b create a system that is both:
That's Anvil o3: a contract prediction engine that behaves like a model and communicates like an expert.