Anvil - Federal Contract Intelligence

$700 Billion

The Problem

The Stakes i

The U.S. federal government is the largest buyer on Earth. Each year, it awards over $700 billion in contracts — more than the GDP of Sweden. For the thousands of companies competing for this business, the central question is deceptively simple: "If I bid on this contract, will I win?"

The Cost i

The stakes are high. Preparing a federal proposal costs real money — sometimes tens of thousands of dollars for a routine bid, sometimes millions for a major defense program. Companies pour resources into capture management, compliance documentation, and pricing strategies, often with no idea whether they're the frontrunner or a long shot.

The Solution i

Historically, answering the "will I win?" question required intuition, relationships, and educated guesses. Incumbent contractors had an advantage because they knew their own win rates. Everyone else was flying blind.

Anvil o1 changes this. It replaces guesswork with prediction — using machine learning trained on millions of historical contract awards to forecast who will win future solicitations.

Two Databases

The Core Insight

Labeled Data i

Every federal contract has two distinct moments in time, recorded in two separate government databases:

The solicitation — when the government posts a request for bids on SAM.gov, describing what it wants to buy
The award — when the government picks a winner, recorded weeks or months later in FPDS (Federal Procurement Data System)

Here's the problem: these two systems don't talk to each other. There's no official field that links an award back to its original solicitation. The government publishes "we want to buy 500 connectors" in one place and "we bought 500 connectors from Acme Corp for $50,000" in another place, but never explicitly says "this award fulfilled that solicitation."

This disconnect creates an opportunity. If you can connect them — match each award to its original solicitation — you create something powerful: a labeled dataset where the input is "what the government asked for" and the label is "who won."

That's supervised learning. And that's exactly what Anvil o1 does.

16M+ Contracts

Why Start with the DLA?

We didn't pick the Defense Logistics Agency at random. DLA is the ideal starting point for building a contract prediction model:

Volume i

DLA is the Pentagon's supply chain manager. It purchases everything the military needs: fuel, food, clothing, medical supplies, and — most importantly — spare parts. Millions of spare parts. DLA's 16+ million contracts over the past decade give us a massive training corpus.

Standardization i

DLA contracts follow predictable patterns. Most are for commodity items with National Stock Numbers (NSNs). The model can learn "when DLA needs Type X connectors, Vendor Y usually wins."

Repeatability i

The same items get purchased repeatedly. DLA might buy the same O-ring NSN dozens of times per year. This repetition is gold for machine learning.

Clear Outcomes i

DLA contracts are typically fixed-price awards with a clear winner. The strategic logic: prove the approach works on DLA's high-volume procurements, then expand to other agencies.

Anvil o1, o2 Thinking, o3, and o4

Model Design, Approach, and Overview

Anvil o1: Learning to Rank with LightGBM

Anvil o1 treats contract prediction as a learning-to-rank problem. Given a solicitation, the model scores every vendor in the database and returns a ranked list of likely winners. Under the hood, it's a gradient boosting model (LightGBM) trained with LambdaRank — an algorithm originally developed for search engine ranking that optimizes directly for putting the right answer near the top. The model learns from 90,000 historical DLA contract awards, extracting patterns from hand-crafted features: TF-IDF vectors from solicitation text, vendor win rates by product category, historical pricing, and geographic signals. When a new solicitation arrives, o1 computes these features, runs them through the trained model, and outputs probability scores for each vendor. The winner is in the top 10 predictions 56% of the time — 12,800x better than random.

Anvil o2 Thinking: Instruction-Tuned LLM Reasoning

Anvil o2 explores whether a fine-tuned language model can add value as an analyst layer on top of traditional ranking. We fine-tune Mistral 7B on 90,000 DLA contracts using QLoRA (4-bit quantization + LoRA adapters), teaching the model to generate predictions with explanations. The hypothesis: LLMs might identify semantic patterns in solicitation text that TF-IDF features miss. The limitation is structural: an LLM alone can't reliably predict specific vendors — it has no live database, can't validate eligibility, and its knowledge freezes at training time. Where o2 shows promise is as a reasoning layer: given o1's ranked shortlist, o2 can explain why certain vendors are strong candidates, surface uncertainty, and make predictions interpretable. The explanation is generated alongside the prediction (not post-hoc), but should be treated as a rationale rather than causal proof.

Anvil o3 Roadmap: The Full Stack

Anvil o3 combines the best of both approaches into a unified architecture. o3a takes the LightGBM ranker to the next level: more training data, richer feature engineering (CPARS performance ratings, subcontractor networks, protest history), and hyperparameter optimization targeting near-perfect recall in the top-50 candidates. The goal is a ranker that almost never misses the winner. o3b is a fine-tuned LLM (Mistral or Llama) purpose-built for analyst explanations — given o3a's shortlist, it generates human-readable reasoning about why each candidate is strong or weak, surfaces risk factors, and provides confidence calibration. The combination: o3a handles candidate generation with maximum recall, o3b handles interpretation with maximum clarity. Ranking + Reasoning. Prediction + Explanation. Anvil o3.

Anvil o4: DoD-Wide Prediction at Scale

Anvil o4 expands beyond DLA to cover the entire Department of Defense. Trained on 2.3 million DoD contracts from FPDS-NG (FY2020-2024), o4 uses a two-stage prediction pipeline. First, it parses the Notice ID prefix to identify the contracting office (e.g., SPE7M225T3623 → DLA office SPE7M2), achieving 85% coverage on live SAM.gov opportunities. Second, it retrieves vendors with historical wins at that office+PSC combination and ranks them using 12 features: wins at office+PSC, total contract value, wins by office, wins by PSC, overall volume, win share percentage, and recency signals. The model achieves 49% recall @10 on 338K test contracts — with near-perfect recall (92-100%) in specialized markets with 5-25 vendors. O4 prioritizes coverage and scalability: it can generate predictions for any DoD opportunity without requiring exact solicitation text matching.

The Data Pipeline

Key difference: o2 uses the same 85K DLA pairs as o1, but restructures them for language model training. Each solicitation-award pair becomes an instruction-tuning example where the model learns to predict winners while generating explicit reasoning about why they win.

Key difference: o4 bypasses solicitation linking entirely. Instead of matching SAM.gov postings to FPDS awards, o4 trains directly on 2.3M FPDS records — extracting office codes from contract IDs and building lookup tables of vendor wins by office+PSC. No text matching required.

Step 1

Collect the awards

We ingest the complete DLA contract archive from FPDS — the government's official record of contract spending. This includes:

16.1 million contract actions spanning multiple fiscal years
Vendor names and DUNS/UEI identifiers
Dollar amounts (base value, options exercised, modifications)
Product Service Codes (PSC) — a taxonomy of what's being purchased
NAICS codes — industry classifications
Contracting office identifiers
Award dates and performance periods
Set-aside designations (small business, veteran-owned, 8(a), etc.)

This data is public. Anyone can download it from USASpending.gov or query the FPDS API. The raw information isn't the competitive advantage — the linkage is.

Step 2

Collect the solicitations

Separately, we pull solicitations from SAM.gov (formerly FedBizOpps). Each solicitation contains:

The description of what's being purchased
Quantity and unit of issue
Delivery location and timeline
Set-aside restrictions
Approved source lists (when applicable)
Attachments (specifications, drawings, SOWs)
Response deadline

Solicitation text varies in quality. Some are detailed, others sparse. The model has to work with what it gets.

Step 3

Link them together

This is the hard part — and where Anvil's proprietary value lies.

Awards and solicitations don't share a common identifier. You can't just join on a key. Instead, you have to infer the connection through:

NSN matching: If the solicitation mentions NSN 5935010398902 and an award two months later references the same NSN, they're probably linked
Timing: Awards typically follow solicitations by 30-90 days
Dollar amount correlation: A solicitation for "estimated value $50K" that matches an award for $48,500
Contracting office: The same office that posted the solicitation issues the award
Textual similarity: The product descriptions should align

This is a probabilistic matching problem. Not every link is certain. We apply confidence thresholds and validate a sample manually to ensure quality.

The result: ~90,000 high-confidence linked pairs where we know both what the government asked for and who won.

Why only 90K from 16M contracts? Not all awards have matching solicitations (some are modifications, options, or sole-source awards). Some solicitations are too vague to match confidently. We prioritize precision over recall — better to have 90K clean examples than 500K noisy ones.

Step 1

Start with o1's linked pairs

o2 builds on the same DLA solicitation-award pairs that powered o1:

~90,000 linked pairs from DLA contract history
Same high-confidence matching as o1
Solicitation text, PSC, NAICS, set-aside, and winning vendor

The difference isn't the data — it's what we do with it. o2 transforms these pairs into a format that teaches a language model to reason about procurement.

Step 2

Enrich with vendor context

For each training example, we compute the winning vendor's historical context:

PSC wins: How many contracts they've won in this product category
PSC rank: Their position among all vendors in this category (1st, 5th, 50th)
Total DLA wins: Overall contract count across all categories
Experience level: Major contractor (1000+), established (100+), or emerging

This context teaches the model why certain vendors win — incumbency, category expertise, and overall DLA relationships.

Step 3

Convert to instruction format

Each example becomes an instruction-tuning prompt in Alpaca format:

Instruction: "You are an expert federal procurement analyst..."
Input: Solicitation details (PSC, NAICS, set-aside, text, top vendors in category)
Output: Predicted winner with reasoning and confidence level

The result: ~85,000 instruction-tuning examples that teach Mistral 7B to predict DLA contract winners with explicit reasoning.

Why instruction tuning? Unlike o1's tabular approach, o2 learns to generate predictions with explanations. The model can articulate why a vendor is likely to win, making predictions more interpretable and actionable.

Step 1

Ingest all DoD contracts

We pull the complete DoD contract archive from FPDS-NG — not just DLA, but all military branches:

2.3 million contract actions from FY2020-2024
All DoD components: Army, Navy, Air Force, Marines, DLA, and more
Vendor UEI identifiers for reliable entity resolution
Contract IDs (PIIDs) that encode office codes
PSC codes, award amounts, and dates

Step 2

Extract office codes from contract IDs

The contract ID (PIID) prefix identifies the contracting office:

This yields 1,001 unique office codes across DoD.

Step 3

Build office+PSC lookup tables

For each (office, PSC) combination, we aggregate vendor history:

Which vendors have won contracts at this office for this PSC?
How many times? For what total value?
When was their most recent win?
What's their win share vs. competitors?

The result: 79,920 office+PSC pairs with complete vendor history.

Why this works: Procurement relationships persist. An office that bought valves from Vendor X last quarter will likely buy from them again. O4 captures these patterns at scale without requiring solicitation text.

What the Training Data Looks Like

Key difference: o1 trains on flat JSON records with tabular features. o2 uses Alpaca-style instruction/input/output format — the model reads a prompt describing the solicitation and learns to generate a prediction with reasoning. This teaches the model to think like a procurement analyst, not just classify.

Key difference: o4 trains on aggregated lookup tables — not individual solicitations. Each (office, PSC, vendor) combination becomes a training row with 12 features: win counts, contract values, recency, and win share. The model learns to rank vendors using LambdaRank optimization.

Data Format

Each training example is a JSON record:

{ "text": "25--BOX,AMMUNITION STOW Proposed procurement for NSN 2541015263462 BOX,AMMUNITION STOW: Line 0001 Qty 154 UI EA Deliver To: By: 0069 DAYS ADO...", "psc": "25", "naics": "336390", "set_aside": "", "vendor": "OSHKOSH DEFENSE", "amount": 250000.0 }

Field	Description
text	The raw solicitation text — product description, quantities, delivery terms
psc	Product Service Code — "25" means Vehicular Equipment Components
naics	Industry classification — "336390" is Other Motor Vehicle Parts Manufacturing
set_aside	Socioeconomic restriction, if any (small business, SDVOSB, 8(a), HUBZone)
vendor	The company that won — this is our prediction target
amount	Dollar value of the award

The vendor field is what we're trying to predict. Given a new solicitation, which vendor will win?

The Math

Understanding the Numbers

To understand what 56.2% top-10 accuracy means in practice, we need to establish a baseline.

The vendor universe

DLA has awarded contracts to approximately 227,000 unique vendors over the training period. These range from Lockheed Martin to one-person machine shops. Most vendors win only a handful of contracts; a small number win thousands.

Random baseline

If you picked 10 vendors at random from this pool, what's the probability that the actual winner is among them?

Random probability

P(winner in top 10 | random) = 10 / 227,000 ≈ 0.0044%

That's less than half a percent. Essentially zero.

What Anvil achieves

Anvil's top-10 accuracy is 56.2%. The winner is in the model's top 10 predictions more than half the time.

Lift calculation

Lift = 56.2% / 0.0044% = ~12,800x better than random

18.3%

Top-1 Accuracy

41.7%

Top-5 Accuracy

56.2%

Top-10 Accuracy

84.1%

Top-50 Accuracy

Measuring progress

The right question isn't "is 56% good enough?" — it's "can we improve it?" We track performance against the random baseline and prior model versions to ensure each iteration moves the needle.

Where improvement is possible

We've identified two primary areas for future gains:

1. Better tail handling: Many contracts are won by vendors with limited history. Enhanced feature engineering for sparse data and transfer learning from related vendors can help.

2. Richer input signals: Adding past performance ratings, subcontracting patterns, and SAM.gov capability statements could give the model more to work with.

The current model is a baseline. Each of these improvements is a concrete step on the roadmap.

Data Format

o2 uses Alpaca-style instruction tuning format for chat fine-tuning:

{ "instruction": "You are an expert federal procurement analyst specializing in Defense Logistics Agency (DLA) contracts...", "input": "## Solicitation Details\n**Product Category:** PSC 59 - Electrical Components\n**NAICS Code:** 334417\n**Set-Aside:** Full and Open Competition\n\n## Solicitation Text\n59--CONNECTOR,PLUG,ELEC for NSN 5935015219249...\n\n## Top Vendors in This Category\n1. DCX-CHOL (1,247 wins)\n2. TTI (847 wins)...", "output": "**Predicted Winner:** TTI\n\n**Reasoning:**\n- TTI is ranked #2 in PSC 59 with 847 historical wins\n- As a major DLA contractor with 3,241 total wins...\n\n**Confidence:** Medium-High" }

Field	Description
instruction	System prompt defining the analyst role and prediction task
input	Structured solicitation details with PSC, NAICS, set-aside, text, and top category vendors
output	Predicted winner with explicit reasoning and confidence level

The model learns to generate predictions with explanations — not just who will win, but why they're likely to win based on category expertise and DLA relationships.

The Approach

Why LLM Fine-Tuning?

o1 uses gradient boosting on tabular features. o2 takes a fundamentally different approach: fine-tune a language model to reason about procurement like an expert analyst.

85K

Training Examples

Model Parameters

4-bit

Quantization

2048

Max Seq Length

What o2 learns differently

1. Reasoning patterns: The model learns to articulate why incumbents win, why PSC expertise matters, and how set-asides constrain competition.

2. Context integration: By including top vendors in each category as part of the input, the model learns competitive dynamics within markets.

3. Confidence calibration: Training outputs include confidence levels (High/Medium/Low) based on signal strength.

4. Natural language understanding: The model processes solicitation text directly, capturing nuances that feature extraction might miss.

Data Format

Each training example is a (query, vendor, features, label) tuple:

{ "office_code": "SPE7M2", "psc": "4820", "vendor_uei": "ABC123XYZ456", "wins_at_office_psc": 47, "value_at_office_psc": 2450000.0, "wins_at_office": 312, "wins_at_psc": 89, "total_wins": 1247, "days_since_last_win": 45, "is_most_recent_winner": 1, "label": 1 }

Field	Description
office_code	6-character prefix from contract ID (PIID)
psc	Product Service Code from FPDS
wins_at_office_psc	Vendor wins at this office for this PSC
value_at_office_psc	Total contract value at office+PSC
wins_at_office	Vendor wins at this office (any PSC)
wins_at_psc	Vendor wins for this PSC (any office)
days_since_last_win	Recency signal — how fresh is the relationship?
label	1 if vendor won this contract, 0 otherwise

The label column trains the ranker. For each contract, one vendor gets label=1 (winner), others get label=0.

The Math

Understanding O4's Numbers

O4 operates on a different scale than O1. The candidate pool is smaller (vendors with history at this office+PSC), but the overall coverage is much larger.

Training scale

O4 trains on 2.3 million contracts from FY2020-2024 — 25x more than O1's 90K DLA pairs. This covers all DoD components.

Office+PSC coverage

We identify 79,920 unique office+PSC combinations with enough history to make predictions. For each, we know which vendors have won and how often.

Coverage calculation

85% of live SAM.gov opportunities can be mapped to known office+PSC patterns

What O4 achieves

49%

Recall @10

81%

Recall @50

85%

Coverage

2.3M

Training Contracts

Tradeoffs vs O1

Lower top-10 recall: O4 achieves 49% vs O1's 56% — but O4 operates across all DoD, not just DLA's specialized categories.

Higher coverage: O4 can generate predictions for 85% of opportunities vs O1's DLA-only scope.

No text dependency: O4 works with just Notice ID and PSC description — no solicitation text parsing required.

Feature Engineering

Key difference: o1 requires hand-crafted features — TF-IDF vectors, vendor win counts, PSC encodings. o2 eliminates this entirely. The transformer learns its own representations from raw text. Vendor statistics are embedded in natural language prompts, letting the model learn what signals matter without explicit feature engineering.

Key difference: o4 uses 12 structured features focused on office+PSC patterns rather than text analysis. Unlike o1's TF-IDF approach, o4 relies entirely on historical procurement relationships — vendor win counts, contract values, and recency signals at the contracting office level. This enables predictions without parsing solicitation text.

Product Category

The PSC (Product Service Code) is one of the strongest predictors. Federal procurement is highly segmented — vendors specialize. A company that makes O-rings doesn't bid on turbine blades.

The model learns: "When PSC = 59 (Electrical Components) and NAICS = 334417 (Electronic Connector Mfg), these 50 vendors account for 80% of wins."

Set-Aside Constraints

Set-aside status dramatically narrows the candidate pool. If a solicitation is marked "Total Small Business Set-Aside (FAR 19.5)," large contractors like Lockheed or Boeing are ineligible.

The model uses set-aside as a hard filter — certain vendors become impossible predictions.

Vendor History

For each vendor in the training data, we compute:

Total wins in this PSC category
Win rate (wins / total opportunities)
Recency — days since last win
Average deal size
Geographic concentration

Historical winners tend to be future winners. Incumbency is real.

Text Features

The solicitation text contains signal:

NSN patterns: Approved sources for certain NSNs
Delivery location: Office-specific vendor preferences
Timeline: Rush vs standard delivery
Quantity: Manufacturers vs distributors

Agency Behavior

Contracting offices develop relationships with vendors. An office that has awarded 50 contracts to Vendor X in the past two years is more likely to award the 51st.

The model learns office-vendor affinities.

No Feature Extraction

Unlike o1, o2 doesn't extract tabular features. The language model processes raw solicitation text directly.

o1: text → TF-IDF → features → LightGBM → rank o2: text → Mistral 7B → prediction + reasoning

The transformer architecture learns its own internal representations of what matters in solicitation text.

In-Context Vendor Stats

Instead of computing features at prediction time, vendor statistics are embedded in the training prompts:

PSC leaderboard: Top 5 vendors in the category with win counts
Winner context: The winning vendor's PSC rank, total wins, experience tier

The model learns to interpret these statistics as part of natural language input.

Structured Prompt Design

Each prompt follows a consistent structure that the model learns to parse:

Solicitation Details: PSC, NAICS, set-aside, estimated value
Solicitation Text: The raw description (up to 2000 chars)
Competitive Context: Top vendors in category

Consistent formatting helps the model learn what signals to attend to.

Learned Reasoning

Training outputs include structured reasoning that the model learns to generate:

Category expertise signals ("ranked #2 in PSC 59 with 847 wins")
Overall DLA presence ("major contractor with 3,241 total wins")
Set-aside considerations ("meets Small Business eligibility")
NSN-specific signals ("repeat procurement with incumbent advantage")

Confidence Levels

Each prediction includes a confidence tier derived from signal strength:

High: PSC rank #1 + 1000+ total wins Medium-High: PSC top-5 or 100+ total wins Medium: Some PSC or DLA history Low: Limited historical signal

The model learns to associate confidence with the strength of supporting evidence.

Office+PSC Features

The core signal: how often has this vendor won at this office for this product category?

Vendor wins at office + PSC combination
Vendor total contract value at office + PSC
Win share percentage within the candidate pool

Office-level patterns capture procurement relationships that persist across contracts.

Broader History

Beyond the specific office+PSC, O4 considers overall experience:

Office wins: Total wins at this office (any PSC)
PSC wins: Total wins for this PSC (any office)
DoD wins: Overall contract count and value

Vendors with broader DoD relationships have proven delivery capabilities.

Recency Signals

Procurement relationships evolve over time:

Days since last win: More recent wins indicate active relationships
Is most recent winner: Binary flag for the last vendor to win this office+PSC

Recent winners are more likely to win again — they're known quantities with proven performance.

No Text Processing

Unlike O1, O4 doesn't analyze solicitation text. The input is:

Notice ID: SPE7M225T3623 → Office: SPE7M2 PSC Description: "VALVES, NONPOWERED" → PSC: 4820

This makes O4 faster and more robust — it works even with minimal opportunity details.

Pool Size Effects

Feature effectiveness varies by market size:

Pool Size	Recall @10	Why
5-10 vendors	100%	Small markets — features identify the few players
11-25 vendors	92%	Specialized — clear leaders emerge
26-50 vendors	77%	Competitive — multiple viable options
51-100 vendors	48%	Commodity — harder to differentiate

Model Architecture

LightGBM gradient boosting with LambdaRank

Mistral 7B fine-tuned with QLoRA

Two-stage pipeline: Office mapping + LightGBM ranking

Key difference: o1 uses gradient boosting (LightGBM) — fast, interpretable, and strong on tabular data. o2 uses a 7B parameter language model (Mistral) fine-tuned with QLoRA. The tradeoff: o2 is slower at inference but can process raw text and generate explanations. We chose Mistral for its efficiency and instruction-following capability.

Key difference: o4 scales beyond DLA to cover all DoD. It uses a two-stage pipeline: (1) parse Notice ID prefix to identify contracting office, (2) rank vendors with historical wins at that office+PSC combination. Unlike o1's text-matching approach, o4 achieves 85% coverage on live SAM.gov opportunities by leveraging procurement office patterns from 2.3M DoD contracts.

Gradient Boosting

Anvil o1 uses a gradient boosting ensemble — specifically, LightGBM optimized for ranking (LambdaRank objectives).

Why not deep learning? For tabular data with mixed features, gradient boosting still beats neural networks. It offers:

Interpretability: Inspect feature importances
Speed: Fast training and inference
Robustness: Less prone to overfitting

Learning to Rank

We frame this as learning to rank, not classification. For each solicitation, generate a ranked list of vendors. Evaluate success by whether the winner appears in the top K.

The model learns a scoring function: given (solicitation features, vendor features), output a relevance score.

Candidate Generation

We don't score all 227K vendors. First, filter to a candidate pool:

PSC category (vendors who've won here)
Set-aside eligibility
Active status (won in past 3 years)

This reduces candidates to 500-5,000 vendors, which the model ranks.

Built on Mistral 7B

mistralai/Mistral-7B-Instruct-v0.3

Mistral 7B Instruct

o2 is built on Mistral 7B Instruct v0.3 — a 7 billion parameter language model optimized for instruction-following.

Model: mistralai/Mistral-7B-Instruct-v0.3 Parameters: 7B Context: 32K tokens (we use 2048 max)

Mistral's sliding window attention and efficient architecture make it suitable for the structured prompts in our training data.

QLoRA Fine-Tuning

We use QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning:

4-bit quantization: NF4 with double quantization, bfloat16 compute
LoRA rank: r=64, alpha=16, dropout=0.1
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

QLoRA lets us fine-tune a 7B model on a single A100 GPU with ~20GB VRAM.

Training Configuration

Training runs for 1 epoch with supervised fine-tuning (SFT):

Batch size: 4 per device × 4 gradient accumulation = 16 effective
Learning rate: 2e-4 with cosine scheduler
Warmup: 3% of training steps
Optimizer: Paged AdamW 32-bit

Training takes ~4-6 hours on A100 (40GB) or ~10-12 hours on T4.

Inference

At inference time, the fine-tuned model generates predictions with reasoning:

Temperature: 0.7 | Top-p: 0.9 | Max tokens: 512

The model outputs structured predictions with vendor names, reasoning bullets, and confidence levels — ready for human review or downstream processing.

Two-Stage Pipeline

O4 uses a two-stage prediction pipeline designed for scalability across all DoD opportunities:

Stage 1: Office code mapping from Notice ID prefix
Stage 2: LightGBM ranking of candidate vendors

This approach achieves 85% coverage on live SAM.gov opportunities without requiring solicitation text matching.

Office Code Mapping

The Notice ID prefix encodes the contracting office:

SPE7M225T3623 → Office: SPE7M2 (DLA Aviation) FA864225Q0147 → Office: FA8642 (Air Force)

Combined with PSC description mapping, this identifies the office + product category for each opportunity.

Candidate Ranking

For each opportunity, O4 retrieves vendors with historical wins at that office+PSC combination and ranks them using 12 features:

Vendor wins at office + PSC combination
Vendor total contract value at office + PSC
Vendor wins at office (any PSC)
Vendor wins for PSC (any office)
Overall vendor win count and value
Win share percentage within candidate pool
Days since last win (recency)
Is most recent winner flag

Training Scale

O4 trains on 2.3 million DoD contracts from FPDS-NG (FY2020-2024):

1,001 office codes extracted from PIID prefixes
2,197 PSC codes for products and services
79,920 office+PSC pairs with historical patterns

This is 25x more training data than O1's 90K DLA contracts.

PSC	Category	Top-10
59	Electrical Components	68%
53	Hardware & Abrasives	61%
16	Aircraft Components	54%
84	Clothing & Textiles	49%

Model Performance

Five ways to understand how Anvil o1 predicts federal contract winners.

Projected performance improvements for Anvil o2.

Validated on 338K DoD contracts from FY2024.

Key difference: o1 outputs numeric rankings — a sorted list of vendors with probability scores. o2 outputs natural language predictions with reasoning. Instead of "Vendor A: 34.2%", you get "Vendor A is likely to win because they're ranked #2 in this PSC with 847 prior wins." This makes predictions actionable without interpretation.

Key difference: o4 scales to 25x more training data (2.3M contracts vs 90K) and achieves 85% coverage on live opportunities. The tradeoff: slightly lower top-10 recall (49% vs 56%) but much higher recall at @50 (81%) and near-universal coverage across DoD — not just DLA.

Calibration Curve

Does the model know what it knows?

A well-calibrated model's confidence scores reflect true probabilities. When Anvil o1 predicts a vendor has a 30% chance of winning, they should win approximately 30% of the time.

The diagonal dashed line represents perfect calibration. Points close to this line indicate the model's confidence scores are reliable and actionable.

Key insight: Anvil o1 is slightly overconfident at high probabilities but well-calibrated overall, with a Brier score of 0.089.

Predicted Probability

Actual Win Rate

Cumulative Gains

How much signal is in the top predictions?

This chart answers: "If I only review the model's top X% of predictions, what percentage of actual winners will I capture?"

The steeper the curve rises above the diagonal baseline, the better the model concentrates winners at the top of its rankings.

Key insight: Reviewing just the top 20% of Anvil's predictions captures 58% of all contract winners.

% of Predictions Reviewed

% of Winners Captured

Live Examples

See predictions vs. actual outcomes

These are real solicitations from our test set. For each, we show Anvil's top-5 predicted vendors with confidence scores, and mark the actual winner.

The green checkmark indicates who actually won the contract. Notice how often the winner appears in the top 5 predictions.

Key insight: In 56.2% of cases, the actual winner appears in Anvil's top-10 predictions.

SPRWA1-24-Q-0127 Electrical Connector, Receptacle

DCX-CHOL ENTERPRISES 29.9% ✓

EMPIRE AVIONICS CORP. 6.8%

PAR DEFENSE INDUSTRIES 6.3%

R & M GOVERNMENT SERVICES 5.2%

JANELS INDUSTRIES INC 4.5%

Outcome: Winner ranked #1 with 29.9% confidence

Lift Chart

How much better than random?

Lift measures how many times more likely you are to find a winner using model predictions versus random selection, at each decile of the ranked list.

A lift of 5x in the first decile means the top 10% of predictions contain 5 times more winners than you'd expect by chance.

Key insight: The first decile shows 12.8x lift—vendors in the top 10% of predictions win 12.8x more often than random.

Decile (sorted by predicted probability)

Lift vs. Random

Confidence Buckets

Higher confidence = higher win rate

We group predictions by confidence level and measure the actual win rate in each bucket. This shows that confidence scores are meaningful—high confidence predictions really do win more often.

This is actionable intelligence: focus resources on opportunities where Anvil shows high confidence.

Key insight: Predictions with >40% confidence have a 61% actual win rate, vs. 8% for predictions under 10%.

<10%

19%

10-20%

34%

20-30%

47%

30-40%

61%

>40%

Predicted Confidence → Actual Win Rate

A Different Kind of Prediction

o2 generates predictions with natural language reasoning — a fundamentally different approach than o1's ranking scores.

Base Model

Mistral 7B

Instruct v0.3

The foundation we build on. Mistral 7B is an open-source language model with 7 billion parameters, trained by Mistral AI. The "Instruct" variant is pre-tuned to follow instructions — we fine-tune it further for DLA procurement.

Training Data

85K

instruction examples

Each example pairs a real DLA solicitation with its actual winning vendor. The model learns to predict winners by studying 85,000 historical outcomes — what the solicitation looked like, and who ultimately won.

Fine-tuning

QLoRA

4-bit quantized

A memory-efficient training technique. QLoRA freezes most of the model and trains small "adapter" layers (~1% of parameters) using 4-bit precision. This lets us fine-tune a 7B model on a single GPU instead of needing a cluster.

Training Time

~6 hrs

on A100 GPU

How long it takes to train o2 on NVIDIA's A100 — the industry-standard GPU for AI training. One epoch through 85K examples. We run on Google Colab Pro+, which costs ~$50/month for A100 access.

What's Different from o1

Explainable Predictions

o2 generates reasoning alongside each prediction — why the vendor is likely to win based on PSC expertise, DLA history, and set-aside eligibility.

Context-Aware

The model sees top vendors in each PSC category as part of the input, learning competitive dynamics within markets.

Confidence Calibration

Each prediction includes a confidence level (High/Medium/Low) derived from the strength of supporting evidence.

Natural Language Interface

Output is structured but human-readable — ready for analyst review without needing to interpret numeric scores.

Transfer Learning

Built on Mistral's pretrained knowledge of language, contracts, and business — fine-tuned specifically for DLA procurement.

Recall at K

How often does the winner appear in the top K predictions?

Recall@K measures the percentage of contracts where the actual winner appears in O4's top K predictions. Higher K means more vendors to review but better coverage.

The curve shows diminishing returns — the first 10 predictions capture nearly half of winners, while the next 40 add another 32%.

Key insight: 81% recall at @50 means 4 out of 5 winners appear in the top 50 predictions — strong coverage for analyst review.

K (Top Predictions)

Recall (%)

Performance by Pool Size

How does recall vary with market competitiveness?

O4's performance varies dramatically based on how many vendors compete in each office+PSC combination. Small, specialized markets are highly predictable; large commodity markets are harder.

This explains why overall recall (49%) understates performance in specialized niches where O4 achieves near-perfect accuracy.

Key insight: In markets with 5-25 vendors, O4 achieves 92-100% recall @10. The bulk of "misses" come from commodity markets with 50+ vendors.

Vendor Pool Size

Recall @10 (%)

Coverage Analysis

What percentage of opportunities can O4 predict?

Coverage measures how many live SAM.gov opportunities O4 can generate predictions for. This depends on whether we can map the Notice ID to a known office code and PSC combination.

85% coverage means O4 works for the vast majority of DoD opportunities — a significant improvement over O1's DLA-only scope.

Key insight: The 15% uncovered opportunities typically have novel office codes or PSC combinations not seen in training data (FY2020-2023).

Feature Importance

Which signals drive O4's predictions?

O4 uses 12 features to rank vendors. LightGBM's built-in feature importance shows which signals contribute most to accurate rankings.

Office+PSC-specific features dominate — who has won at this specific office for this specific product category matters more than overall vendor size.

Key insight: Recency matters. "Days since last win" and "is most recent winner" together account for 25% of predictive signal — procurement relationships are time-sensitive.

Importance Score

Feature

Use Cases

Who Benefits

For Contractors: Bid/No-Bid Intelligence

A contractor sees a solicitation and asks: "What are my odds?" If Anvil ranks them #2 out of 500 candidates, that's a strong signal to bid. If they're ranked #200, maybe save the proposal dollars for a better opportunity.

For Contractors: Competitive Positioning

Beyond "should I bid?", the model reveals "who am I competing against?" Seeing that Vendor X is the frontrunner tells you something. Maybe sharpen your price, or consider teaming instead of competing.

For Investors: Pipeline Modeling

Defense contractors are publicly traded. Their stock prices depend on expected future revenues. An investor analyzing Raytheon might ask: "How likely are they to win the contracts currently in their pipeline?"

For M&A: Due Diligence

Acquiring a federal contractor? You want to know: "How defensible is their revenue? Are they winning because they're good, or because of relationships that might not transfer?"

Implications

Information asymmetry gets flattened

Incumbent contractors have always had an advantage — they know their win rates, their competitors, their agency relationships. New entrants are at an information disadvantage. Anvil changes this. A startup entering federal contracting can now see the competitive landscape with the same fidelity as a 20-year incumbent. This is democratizing.

Proposal economics shift

If contractors can better predict their odds, they'll bid more selectively. This means fewer low-quality bids (good for agencies), more competitive bids on winnable opportunities (bad for incumbents), and better resource allocation industry-wide.

Transparency creates pressure

If vendors can see that a particular contracting office awards 80% of its business to one vendor, that's interesting. Maybe justified, maybe worth scrutinizing. The model makes patterns visible that were previously buried in millions of transaction records.

The Road Ahead

Expanding Beyond DLA

The proof-of-concept worked on DLA. The next step is generalizing to other agencies:

GSA (General Services Administration): Government's general-purpose buyer. More diverse categories, different dynamics.
VA (Veterans Affairs): Large healthcare procurement. Medical supplies, equipment, services.
DoD components beyond DLA: Army, Navy, Air Force direct contracts. Larger dollar values, more complex evaluations.

Each agency has its own procurement culture. Models will likely need per-agency training, at least initially.

More Signal

Current features are primarily structured data. We're leaving signal on the table:

Full solicitation text: Using transformer models (BERT, etc.) to extract deeper textual understanding
Attachments: Many solicitations include PDFs with detailed specs. We don't currently parse these.
Vendor financials: SAM.gov registration data, revenue estimates, employee counts
Protest history: Has a vendor protested awards in this category? Are they litigious?

Real-Time & Price

Real-time prediction

Currently, the model is trained offline on historical data. The vision is real-time scoring: the moment a solicitation posts on SAM.gov, Anvil ranks vendors and pushes alerts. This requires live SAM.gov monitoring, low-latency inference, and push notification infrastructure. It's engineering, not science — the hard ML work is done.

Price modeling (the holy grail)

The biggest limitation is not knowing bid prices. If we could model price distributions — "Vendor X typically bids 15% above cost in this category" — we could predict winners even more accurately.

Price data isn't public, but some vendors might share their bid histories in exchange for insights. This creates a data network effect: the more vendors participate, the better the model gets for everyone.

Summary

Where We Are

Anvil o1 solves a data integration problem that unlocks predictive capability. By linking 16 million DLA contract awards to their original solicitations, we create supervised training data that teaches a model to answer: "Given what the government is asking for, who will win?"

Current performance: 56% top-10 accuracy. This is a starting point, not a ceiling. o2 adds natural language reasoning on top of o1's rankings — but it's the analyst layer, not the prediction engine.

Anvil o4 scales prediction beyond DLA to cover the entire Department of Defense. By parsing office codes from contract IDs and building lookup tables from 2.3M historical awards, o4 answers: "Given this contracting office and product category, which vendors are most likely to win?"

Current performance: 81% recall @50 across 338K test contracts, with 85% coverage on live SAM.gov opportunities. O4 trades some top-10 precision for massive scale — covering all DoD, not just DLA.

What's Next — Anvil o3

Anvil o3 is the model we actually want in production: a system that combines a best-in-class ranker with a best-in-class analyst layer.

o3a — The Ranker

o3a is the "perfect ranker" project — o1 taken to the next level:

More linked data, higher-quality labels
Stronger feature engineering: incumbency, category expertise, set-asides, repeat NSN behavior, contracting office patterns
Candidate generation + ranking tuned for maximum Recall@K and Top-K accuracy

The goal is simple: get the right winner into the shortlist as often as possible.

o3b — The Analyst

o3b is a fine-tuned LLM designed to operate on top of o3a:

Consumes the solicitation + o3a's top candidates with vendor context
Produces a structured prediction, rationale, and confidence
Stays grounded in provided evidence — no hallucinated vendor facts

Anvil o3 — Stacked

Together, o3a + o3b create a system that is both:

Predictive — ranking performance you can measure
Explainable — analyst-grade explanations you can use

That's Anvil o3: a contract prediction engine that behaves like a model and communicates like an expert.

What O4 Enables

O4 is a foundation for DoD-wide intelligence. By covering 85% of live opportunities across all military branches, it unlocks use cases that O1's DLA-only scope couldn't address.

Scale

O4 processes opportunities from Army, Navy, Air Force, Marines, and DLA — not just one agency. This means:

Comprehensive competitive intelligence across DoD
Cross-agency vendor pattern detection
Portfolio-level win probability modeling

Speed

O4 generates predictions without parsing solicitation text — just Notice ID and PSC description. This enables:

Real-time predictions on new SAM.gov postings
Batch scoring of entire opportunity pipelines
Integration with procurement monitoring tools

What's Next

O4 provides the coverage foundation. Future work focuses on precision:

O5: Combine O4's office patterns with O1's text analysis for best-of-both-worlds accuracy
Set-aside integration: Add small business eligibility filters to improve recall in restricted competitions
Agency expansion: Extend beyond DoD to GSA, VA, and civilian agencies

O4 proves the pattern: procurement relationships are predictable at scale.

Anvil o1.
Predicting federal contract winners.

Anvil o2.
Predicting federal contract winners.

Anvil o4.
DoD contract intelligence at scale.

$700 Billion

The Problem

Two Databases

The Core Insight

16M+ Contracts

Why Start with the DLA?

Anvil o1, o2 Thinking, o3, and o4

Anvil o1: Learning to Rank with LightGBM

Anvil o2 Thinking: Instruction-Tuned LLM Reasoning

Anvil o3 Roadmap: The Full Stack

Anvil o4: DoD-Wide Prediction at Scale

The Data Pipeline

What the Training Data Looks Like

Feature Engineering

Model Architecture

Built on Mistral 7B

What We Discovered

Model Performance

Calibration Curve

Cumulative Gains

Live Examples

Lift Chart

Confidence Buckets

A Different Kind of Prediction

What's Different from o1

Recall at K

Performance by Pool Size

Coverage Analysis

Feature Importance

Use Cases

The Road Ahead

Limitations

Summary

Anvil o1.Predicting federal contract winners.

Anvil o2.Predicting federal contract winners.

Anvil o4.DoD contract intelligence at scale.

$700 Billion

The Problem

Two Databases

The Core Insight

16M+ Contracts

Why Start with the DLA?

Anvil o1, o2 Thinking, o3, and o4

Anvil o1: Learning to Rank with LightGBM

Anvil o2 Thinking: Instruction-Tuned LLM Reasoning

Anvil o3 Roadmap: The Full Stack

Anvil o4: DoD-Wide Prediction at Scale

The Data Pipeline

What the Training Data Looks Like

Feature Engineering

Model Architecture

Built on Mistral 7B

What We Discovered

Model Performance

Calibration Curve

Cumulative Gains

Live Examples

Lift Chart

Confidence Buckets

A Different Kind of Prediction

What's Different from o1

Recall at K

Performance by Pool Size

Coverage Analysis

Feature Importance

Use Cases

The Road Ahead

Limitations

Summary

Anvil o1.
Predicting federal contract winners.

Anvil o2.
Predicting federal contract winners.

Anvil o4.
DoD contract intelligence at scale.