When Your Labels Come from a Writers' Room

When Your Labels Come from a Writers' Room Lessons from building Ringside Analytics, an ML system trained on 482K pro wrestling matches TL;DR I built an end-to-end ML system on a public dataset of 482,166 professional wrestling matches spanning forty years and six promotions. XGBoost lands at 0.718 AUC on a temporal hold-out — meaningfully above coin flip, meaningfully above the favored-wrestler baseline. The more interesting result is a 25-point AUC gap between validation and test that turns out to be structural, not methodological. This post walks through the data, the model, and what the gap is teaching us about prediction problems where the label is generated by human creative work. Everything is public: dataset on Kaggle (CC0), model on Hugging Face (Apache 2.0), source on GitHub. Why this dataset is unusual Most public ML datasets pair with one of two label types. The first is a measured physical quantity — house prices, image content, sensor readings, anything you can put a unit on. The second is a behavioral response — clicks, ratings, churn, the kind of label that records what a user actually did. Pro wrestling occupies a stranger niche. Match outcomes are decided in advance by booking writers and executed by performers. The label is neither measurement nor response. It is the recorded artifact of a creative decision, made by humans, before the cameras rolled. That single property reshapes the entire ML pipeline downstream of it. I want to walk through how, because I think the lessons generalize past wrestling to any system where humans script the outcomes. The data The canonical dataset is built from two public sources. The primary one is Cagematch.net, a community-maintained pro-wrestling encyclopedia with detailed match cards going back to roughly 1985. The secondary source is the alexdiresta WWE/WWF Kaggle dump, which I used for cross-validation and to fill in pre-1990 results. A bespoke Python scraper ingests Cagematch HTML, an ETL pipeline normalizes match cards, an alias resolver collapses identities (Dwayne Johnson → Rocky Maivia → The Rock), match types map onto a fixed enum, and natural-key deduplication catches duplicates against (date, venue, canonical participant tuple). The published schema is nine relational tables. The fact table is match_participants — one row per (match, wrestler), keyed to dimensions for event date, promotion, opponents, and storyline state. It carries the result column, which is the prediction target. Headline counts: 482,166 matches, 731,133 wrestler-match participations, 35,064 events, 12,814 wrestlers, six promotions, 1980 to present. The kayfabe problem Kayfabe is wrestling jargon for the convention of presenting scripted events as real. The data-modeling translation is precise: the prediction target is generated by a writers' room rather than by athletic competition. Three consequences follow directly. Athletic skill cannot be learned. Nothing in the dataset measures it. Two equally-booked wrestlers produce indistinguishable feature vectors, regardless of who is the more skilled performer. Booking patterns can be learned. Writers escalate pushes until a planned cool-down. Faces tend to win in title matches at PPVs. Wrestlers returning from injury tend to win their first match back. These templates are visible across enough matches and the model picks them up. The label is autocorrelated with itself. A wrestler on a winning streak is being booked toward a payoff. Their next-match outcome depends on the previous five. The data has memory. The clearest empirical evidence is the career win-rate distribution. If outcomes were random athletic results, we would expect a roughly normal distribution centered at 0.5. What we observe is bimodal: a heavy left lobe of wrestlers with career win rates near zero (jobbers), a heavy right lobe with rates near one (stars), and a thin middle. Roughly 18% of qualifying wrestlers have career win rates above 0.7. Roughly 14% are below 0.3. A coin flip would predict 5% in each tail. If your label is generated by a writers' room, the marginal distribution of your target is the first place to look. Bimodality with fat tails is the kayfabe signature. Features, splits, and models I engineered 35 features per (match, wrestler) row, all computed from pre-match state to avoid label leakage. They cluster into seven families: Recent form (5 features): rolling win rates over 30/90/365 days, current win streak, current loss streak. Event context (4): is_ppv, is_title_match, card position, event tier. Match type (9): is_singles, is_royal_rumble, is_ladder, is_cage, and similar flags plus a per-match-type historical win rate. Title proximity (3): is_champion, defenses, days since title match. Career phase (3): years active, matches in last 90 days, days since last match. Head-to-head and alignment (8): h2h win rate, alignment, face/heel flags, recent turn count, face/heel matchup. Match quality and card momentum (3): rolling rating, card-position momentum. Splits are temporal, not random k-fold. Matches before 2024 form training (~590K rows), 2024 forms validation (~38K), 2025 onward forms test (~5K). A random shuffle would scatter adjacent matches across the train/test boundary, leaking storyline state. We will see why that matters in the next section. Two model architectures: a logistic-regression baseline using scikit-learn defaults, and an XGBoost primary with 300 trees, max depth 6, learning rate 0.1, and early stopping on validation log-loss. Hyperparameters were minimally tuned. The goal is honest evaluation, not score-chasing. Results — and the gap that matters On the test set, XGBoost reaches 0.718 AUC with 0.662 accuracy and 0.636 log loss. Logistic regression reaches 0.698 AUC. Both meaningfully exceed the 0.50 coin-flip baseline and the 0.62 favored-wrestler-always-wins baseline computable from career win rates alone. On the validation set, the same XGBoost model reaches 0.952 AUC. A 25-point gap. My first interpretation was overfitting. I tried regularization sweeps, capped tree depth, increased early-stopping patience, sampled features more aggressively. None of it closed the gap. Eventually I ran the diagnostic that should have been my first move: an ablation removing the streak features and the days-since-last-match feature. Test AUC collapsed from 0.718 to 0.541. Validation AUC also fell, but proportionally less. The model's signal is concentrated almost entirely in booking-momentum features. The other thirty features round to noise individually. That ablation reframed the gap. It isn't conventional overfitting. It's that storylines persist across calendar boundaries — Roman Reigns in March is informative about Roman Reigns in April, in a way he is not informative about a different wrestler in some unrelated 2025 feud. Validation in 2024 sees the tail end of training-era storylines. Test in 2025 sees a turned-over creative landscape with new alignments and new pushes. Once I accepted that, the gap stopped being a bug to fix and became a property to report. Closing it would have required either leakage (random splits) or pretending storylines don't have temporal structure (they obviously do). What the model is actually learning Aggregated across the test set, XGBoost feature importances tell a clean story. Current win streak carries 31% of total importance. Current loss streak carries 22%. Days since last match carries 13%. Head-to-head win rate adds 9%. The is_royal_rumble flag adds 3% — that match format has a famously non-uniform winner distribution that the model exploits. Everything else combined accounts for 22%, with no single remaining feature breaking 2%. In aggregate the model is a booking-pattern detector. It learns that wrestlers on multi-match win streaks tend to keep winning until a planned reversal, that wrestlers returning from time off tend to win their comeback match, that heels in title matches at PPVs lose more often than chance suggests. These are observations about human writers, not about athletic competition. The model is correctly identifying that booking decisions follow narrative templates, and predicting consistent with those templates. Limits Selection bias toward televised matches. House-show and indie results are sparsely covered; the model generalizes best to PPV and weekly TV. Era drift. A 1987 Hogan match and a 2025 Cody Rhodes match share training space, but booking philosophy has shifted considerably. Era-aware modeling could plausibly recover some of the val-to-test gap. Gender imbalance. Women's-division match counts are smaller, especially before 2015. Expect wider error bars on women's predictions. Per-match calibration is loose. The model identifies trends in booking decisions, not crisp probabilities. The model card explicitly disclaims betting use, and I want to repeat that here: this is analytics infrastructure, not a wagering tool. What I'd build next Three directions show real promise. Match-rating regression. Cagematch's rating column captures crowd response, which is a real human signal rather than a writer's whim. Regressing on rating instead of classifying on outcome reframes the problem to one with a less-corrupted target. Same data, different ceiling. Era-aware modeling. Explicit era embeddings (Attitude / Modern WWE / AEW) or per-era ensembles could test how much of the val/test gap is attributable to booking-style drift across decades. Storyline NLP. Match descriptions and event names contain narrative context — feud arcs, debut setups, title chases — that the current numeric features cannot capture. Embedding storyline text with a small language model and feeding it into the classifier is the most promising path to AUC gains beyond ~0.75. Why this matters past wrestling The kayfabe problem isn't unique to scripted athletics. Any system where humans decide outcomes through narrative templates has the same shape: A/B test winners chosen by a head of growth, hiring panels, content moderation, anything where someone is escalating toward a planned payoff. Streak features dominate. Random splits leak. Validation performance overstates real-world performance until enough time passes that the storyline context turns over. The most useful artifact from this project isn't the model. It's the documented gap. When the prediction target is a human decision, traditional validation underestimates true error in proportion to how much state persists across the temporal split — and the only honest move is to report the gap rather than tune around it. All resources are public under permissive licenses. The dataset is on Kaggle and mirrored on Hugging Face. The trained model lives here. A starter notebook gets you to a working prediction in about ten minutes. Source code on GitHub. If you're working on prediction problems where the labels come from human decisions and you want to compare notes on validation strategies, find me on Substack. I write about pattern recognition in messy systems at The Chaos Translator.

When Your Labels Come from a Writers' Room

More in Tech

The Kayfabe Problem

I Built a Smarter Privacy Shield for Your Browser. Then Discovered Why Dumb Noise Wins.

I Built a Bloomberg Terminal for VHS Tapes (And Here’s What 900,000 Price Records Taught Me)