Tech Stack
Architecture
Data Pipeline
Roster extraction
The starting point is the IIHF website, which publishes roster PDFs for all 12 nations competing in the 2026 Olympic men's hockey tournament. Each PDF lists player names, positions, jersey numbers, and current club affiliations.
AI-assisted parsing
Rather than writing a custom PDF parser, I uploaded all 12 roster PDFs into Claude to extract the data into structured CSVs. This took care of the tricky parts in one pass: inconsistent table layouts, names with diacritics, and position formatting differences across countries.
Salary enrichment
A Python script reads the extracted CSVs and queries the CapWages API to pull each NHL player's 2025-26 contract details: cap hit, AAV, base salary, signing bonuses, trade clauses, and expiry status.
Database load
That same script loads everything into Supabase (PostgreSQL) across 7 normalized tables. Two database views handle the salary aggregation and roster joins on the backend, keeping the frontend queries simple.
Extending to Past Olympics
The database was designed with multiple Olympic years in mind from the start. Every player, roster entry, and contract season is scoped to a specific year, so adding a past Olympics is really just a matter of populating the same tables with new data.
The hard part is sourcing that data. The 2026 pipeline worked cleanly because the IIHF publishes current roster PDFs and CapWages has up-to-date contract information. For older Olympics, there's no single source that covers both rosters and salary data, so each year will likely need its own ingestion script to pull from different places.