AFG  ·  Operational Guidebook
Compendium May 2026
A Wayfair Explainer Series Compendium  ·  Translated for AFG

The operational guidebook for American Furniture Group.

Seventeen chapters, forty-plus interactive figures, three full-length editorial volumes, a strategic meta-analysis, a Lindy audit of post-2020 frontier developments, and a fourteen-initiative roadmap — synthesized from a decade of Wayfair's published technical writing into one document a new executive can read, share, and teach with.

Synthesized From14 Articles + 2 Editorial Volumes
Interactive Figures40+ embedded inline
Reading Time4.5 hours · cover-to-cover
AudienceAFG Tech Leadership
§ 00.A

Table of Contents.

Part I  ·  Foundations
Ch. 01 Every classifier is, secretly, an encoding problem. § 01 Ch. 02 A line drawn with conviction, and every line it could have been. § 02 Ch. 03 When the data refuses to balance itself. § 03 Ch. 04 The trouble with class labels, and the quiet power of comparison. § 04
Part II  ·  Causal Inference & Experimentation
Ch. 05 How do you measure the effect of a single price? § 05 Ch. 06 The experiment you can't run, and how to run it anyway. § 06
Part III  ·  Recommender Systems
Ch. 07 Predicting taste from the company you keep. § 07 Ch. 08 Many models, one shelf. § 08 Ch. 09 Knowing whether your model is actually better. § 09
Part IV  ·  Computer Vision
Ch. 10 Where the object actually is. § 10 Ch. 11 A bottle, three angles, and the open problem beneath every product page. § 11 Ch. 12 To teach a model the world, first simulate it. § 12 Ch. 13 The model decays. The humans keep it honest. § 13
Part V  ·  Catalog Operations
Ch. 14 How Wayfair teaches a catalog of millions to describe itself. § 14
Part VI  ·  Strategic Synthesis
Ch. 15 The shape of data science at Wayfair — and the AFG roadmap. § 15
Part VII  ·  The Lindy Audit
Ch. 16 The Lindy test — what endured and what was quietly replaced. § 16 Ch. 17 The roadmap, re-examined in light of the audit. § 17 App. A Glossary  ·  key terms used throughout § A App. B Reading paths  ·  six suggested orderings by role § B
§ 00.B

Executive Summary.

This guidebook collects nine years of Wayfair's published technical writing — fourteen Explainer-Series articles plus a longitudinal meta-analysis — and reorganizes it as an operational handbook for American Furniture Group. Every original article is embedded in full with all interactive figures functional. Every chapter is preceded by an editorial frame that does three things: states what Wayfair built (in plain language), translates it for AFG's scale, and supplies discussion promptsPromptThe text input given to an LLM to elicit a response. The 2026 unit of "ML engineering" for many capabilities — entire careers now exist around prompt design and evaluation. for the team meeting where this gets reviewed.

The guidebook is organized into seven parts. Part I covers the mathematical foundations every other technique sits on. Part II covers causal inferenceCausal inferenceMethods for estimating the effect of an intervention from non-experimental data. The thing pricing teams need but rarely build. and experimentation — the methods that tell you whether anything you ship is actually working. Part III covers recommender systems, including the offline metrics that decide which modelModelThe trained artifact — a learned function that maps inputs to outputs. "The recommender model" is a specific file (.pkl, .pt, .onnx) trained on a specific dataset on a specific date. Distinct from the algorithm, which is the recipe used to produce it. gets shipped. Part IV covers computer vision: pose, geometry, simulation, and the human-in-the-loop pattern. Part V covers catalog operations — the unglamorous foundation under everything else. Part VI is the strategic synthesis: the longitudinal shape of what Wayfair has built and the fourteen-initiative AFG roadmap. Part VII is the Lindy audit: a nine-year retrospective separating the patterns that survived the post-2020 frontier from the implementations that didn't, and a revised roadmap that incorporates the audit's findings.

The single most important observation across the whole corpus is about sequencing. Read the fifteen chapters in order and a clear pattern emerges: the companies that struggle in this space generally try to start from the visible work — the generative-AI room visualizer, the conversational shopping assistant, the transformerTransformerThe neural-network architecture under every modern LLM and most modern vision models. Replaced RNNs and CNNs in many domains starting 2017; attention is its core mechanism. recommender — without the search index, the catalog hygiene, or the incrementalityIncrementalityThe portion of an outcome that's causally attributable to an intervention, vs. what would have happened anyway. The honest version of attribution. Geo-experiments and uplift modeling measure incrementality directly; last-touch attribution does not. measurement underneath. Wayfair's published trajectory, read carefully, is a long argument for the opposite ordering. Catalog and search first. Platform debt second. Recommender on top of that. Causal inferenceInferenceRunning a trained model on new data to produce predictions. The fast, repeated, in-production half of a model's lifecycle — training is the slow, one-time half. Where latency budgets get spent. layered in so you can tell when any of it is working. Generative AI absorbed last, into the parts of the stack where the human cost was already highest.

The meta-analysis in Chapter 15 turns this observation into a phased roadmap of fourteen specific initiatives, organized by horizon: five Quick Wins (0–6 months), five Compounding Bets (6–18 months), and four Differentiators (18+ months). Every initiative has a method, a data requirement, a team size, and a payback window. The Quick Wins are designed to be self-funding; the Compounding Bets multiply on the foundation the Quick Wins build; the Differentiators only become possible once the first two phases are in place.

Three Questions This Guidebook Should Help Your Team Answer
  1. What are we building right now that — by Wayfair's own published evidence — is the wrong thing for our scale?
  2. Where in our roadmap are we tempted to skip the foundation (catalog hygiene, search relevance, MLOpsMLOpsOperations for ML — pipelines, monitoring, retraining, deployment, observability. The infrastructure layer that's eaten more model-development time than algorithm choice.) and jump straight to the visible work?
  3. If we adopted the fourteen-initiative roadmap as-is, which of our current commitments would we have to kill to free up the team capacity?
Part I

Foundations.

The mathematical primitives every other technique sits on top of — information theory, Bayesian inferenceBayesian inferenceUpdating a probability distribution over hypotheses as data arrives. Gives you calibrated uncertainty — when the model says 80%, you can trust the 80%., the rare-event problem, and pairwise learning.

CH. 01
Foundations · The Mathematics of Uncertainty

Every classifier is, secretly, an encoding problem.

What Wayfair Built

Wayfair's piece reframes machine-learning classificationClassificationThe ML task of predicting which category an item belongs to. Spam vs. not-spam, return vs. not-return, fraud vs. not-fraud. Most production catalog ML is classification under the hood. through Shannon's lens: a labelLabelThe thing a model is trying to predict — also called the target or ground truth. For fraud detection, the label is "was this transaction fraudulent: yes/no." Acquiring high-quality labels at scale is the bottleneck of most supervised ML projects. is just a very short message, and trainingTrainingThe process of fitting a model's parameters to a dataset. Costs money (GPU hours) and time (hours to weeks); the slow, one-time part of the workflow. Produces a model that can then be used for inference. a model is a question of how short the message can get. EntropyEntropyThe minimum number of bits needed to encode a value drawn from a distribution. The fundamental measure of uncertainty. measures the information content of the label distribution; cross-entropyCross-entropyA loss function that measures the cost (in bits) of describing reality using your model's predicted probabilities. The standard loss for classification. measures the cost of describing reality using the model's predictionsPredictionA model's output for a single input. "This customer's return probability is 0.07" is a prediction. A million predictions per day is inference at scale.; KL divergenceKL divergenceKullback-Leibler divergence. The information-theoretic distance between two probability distributions. Cross-entropy minus entropy. is the gap between the two. Every loss functionLoss functionThe mathematical objective the training process is minimizing. Cross-entropy for classification, mean-squared-error for regression, NDCG-loss for ranking. Choosing the wrong loss is the most common architectural mistake. you've ever optimized is, on this view, a measure of wasted bits.

Translation for AFG

For an AFG team, this is the lens to evaluate any classification problem in the catalog — product-category prediction, return-likelihoodLikelihoodHow probable the observed data is, given a particular model parameter setting. The thing maximum-likelihood estimation maximizes; the thing Bayesian inference multiplies by the prior to get the posterior. Distinct from probability: probability is over outcomes given parameters;… scoring, fraud detection, image-tagging. Before training a model, ask: what is the entropy of the label distribution we're trying to predict? A 50/50 problem has 1 bit of entropy; a 99-class furniture taxonomyTaxonomyA hierarchical classification of items by attributes — categories, subcategories, sub-subcategories. Wayfair's furniture taxonomy has thousands of leaf nodes. Maintaining it cleanly across millions of SKUs is a permanent operations problem;… where one class accounts for 80% of items has much less. Practical rule: if your model accuracyAccuracyFraction of predictions that are correct overall. Misleading on imbalanced problems — a 99% accurate fraud model that flags nothing is useless. See imbalanced data; information gain is the better default. is high but only matches the naive prior, you've learned nothing — measure information gainInformation gainThe reduction in entropy from learning a feature's value. The right metric to evaluate a classifier against, instead of raw accuracy., not raw accuracy. Build dashboards that report cross-entropy loss alongside accuracy on every classifierClassifierA model whose output is a category, not a number. Spam filters, fraud detectors, image taggers are all classifiers. The most common production-ML pattern at retail scale. in production.

Discussion Prompts  ·  For The Team Meeting
  1. Which of our existing classification models would look worse if we replaced accuracy with information gain on the dashboard?
  2. Do any of our team's metrics reward models for memorizing the prior rather than learning structure?
  3. Where in our catalog ops do we have a low-entropy distribution we're treating as if it were uniform?
Source Document  ·  Wayfair Explainer Series Ch. 01  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 02
Foundations · Distributions, Not Verdicts

A line drawn with conviction, and every line it could have been.

What Wayfair Built

The Bayesian framing replaces a single best classifier with a distribution over classifiers — and replaces a single prediction with a distribution over predictions. The interactive in this article lets you watch the posteriorPosteriorThe probability distribution over model parameters after observing data. The output of Bayesian inference; the thing you sample from for predictions and uncertainty estimates. collapse as data accumulates. The practical payoff is calibrated uncertainty: when the model says "80% likely category A," you can actually trust the 80%.

Translation for AFG

Most retailers leave money on the table by treating model outputs as point estimates. For AFG, the highest-value Bayesian wins are in three places: (1) demand forecastingDemand forecastingPredicting future product demand to drive inventory, staffing, and procurement decisions. Most-watched ML problem at any retailer; Bayesian forecasts that surface uncertainty bands beat point predictions for inventory buffer sizing. where uncertainty bands drive inventory buffers, (2) pricing decisions where the cost of being wrong is asymmetric, and (3) fraud scoring where the threshold on confidence determines false-positive rate. Don't try to make every model Bayesian — start with the three above. Tools: PyMC, Stan, NumPyro for the modeling layer; conformal predictionConformal predictionCheap frequentist alternative to Bayesian inference for uncertainty intervals. Splits data, calibrates a residual quantile, and gives you a guaranteed coverage probability without modeling priors. PyMC and NumPyro for Bayesian; MAPIE for conformal. (cheaper, frequentist) when full Bayesian inference is overkill.

Discussion Prompts  ·  For The Team Meeting
  1. Where in our operations does a 50% confident prediction get treated the same as a 95% confident one?
  2. Which decisions does our team make that have asymmetric costs — and does our model output reflect that?
  3. What would a "calibrated confidence" SLA look like for our most important model?
Source Document  ·  Wayfair Explainer Series Ch. 02  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 03
Foundations · The Rare Event Problem

When the data refuses to balance itself.

What Wayfair Built

Real catalogs are radically imbalanced — fraud, returns, defects, conversionsConversionA user completing the desired action — typically purchase, signup, or subscription. The denominator of conversion rate; the numerator of every monetization analysis., the rare-but-expensive events all sit on the long tailLong tailThe bulk of a distribution that sits beyond the head — many items each with low individual frequency. Most retail SKUs are tail items; cold-start, sparse interactions, and rare-event prediction are all long-tail problems. of the label distribution. Wayfair's piece walks through the standard playbook: stratified samplingStratified samplingSampling that preserves the class proportions of the source dataset. Critical for imbalanced data when constructing train/test splits — random splits can produce test sets with zero rare-class examples., class-weighting in the loss, oversampling (SMOTESMOTESynthetic Minority Over-sampling Technique. Generates synthetic examples of the rare class by interpolating between real ones. The most-cited fix for imbalanced data; also the most-overused — destroys probability calibration in ways that hurt downstream decisions.), undersampling, threshold tuningThreshold tuningAdjusting the cutoff probability above which a binary classifier predicts positive. The cleanest way to trade precision for recall on an imbalanced problem without retraining; tune on a held-out validation set, not the training set.. The hidden lesson is that which technique works depends on what you're optimizing forrecallRecallOf the items that were actually positive, what fraction the model caught. The "don't-miss-anything" metric. Trades off against precision; the right balance depends on the cost of each error type. on the rare class is a different problem than calibrated probabilities.

Translation for AFG

Almost every interesting AFG classifier is, at heart, an imbalanced dataImbalanced dataA classification problem where one class dominates the dataset. Fraud (0.5%), returns (5%), churn (15%) — almost every interesting business classifier is imbalanced. problem. The first decision isn't which balancing technique to use — it's what metric you're actually optimizing. Returns prediction at 5% base rate? Optimize PR-AUCPR-AUCArea under the Precision-Recall curve. The right summary metric for imbalanced classification (where ROC-AUC overstates performance because the negative class dominates). Default for fraud, returns, defect detection., not ROC-AUCAUCArea Under the [ROC] Curve. A threshold-free summary metric for a binary classifier. 0.5 is random, 1.0 is perfect, real-world models live in 0.7–0.9. Comparable across datasets in a way that accuracy isn't.. Fraud at 0.5%? Use cost-weighted precisionPrecisionOf the items the model flagged as positive, what fraction actually were. The "don't-cry-wolf" metric — high precision means fewer false alarms.-at-k where k is your review-team capacity. Production-grade rule: keep the imbalance in training (don't oversample) but tune the threshold on a held-out validationValidationEvaluating a model on data held out from training, to check it generalizes. Skipping validation is the most common rookie mistake in ML; it produces models that look great in the notebook and fail in production. set — oversampling destroys probability calibrationCalibrationProperty of a probabilistic classifier where predicted probabilities match empirical frequencies — when the model says 80%, it's right 80% of the time. Oversampling, focal loss, and SMOTE all destroy calibration; threshold tuning preserves it. in ways that hurt downstream decision-making.

Discussion Prompts  ·  For The Team Meeting
  1. What is the actual base rate of every rare-event classifier we run, and is the metric we report on it appropriate?
  2. Are any of our oversampled models outputting probabilities that downstream systems treat as calibrated?
  3. Do we have a single "capacity-aware" precision-at-k metric for any decision-support model?
Source Document  ·  Wayfair Explainer Series Ch. 03  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 04
Foundations · Pairwise > Absolute

The trouble with class labels, and the quiet power of comparison.

What Wayfair Built

Asking an expert "is this image modern or traditional?" produces noisy labels. Asking "which of these two images is more modern?" produces dramatically less noise — humans are far better at pairwise judgment than absolute scoring. Wayfair builds entire labeling pipelinesPipelineA sequence of automated steps that transform data or run a model. ML pipelines: ingest → clean → feature → train → evaluate → deploy. MLOps is the discipline of making pipelines reliable. around this fact, using Bradley-Terry-Luce models to convert thousands of pairwise comparisons into a continuous score per item.

Translation for AFG

If AFG ever needs to label a subjective property at scale — style, quality, fit, photo-attractiveness — do not ask labelers for a 1-5 score. Build a pairwise-comparison interface and reconstruct the scores via Bradley-Terry. This pattern works for: hiring "top photo" picks per SKU, prioritizing which products get studio reshoots, A/B-testing room scenes generatively, ranking designer-curated bundles. The labeling cost drops by an order of magnitude because pairwise agreement is high enough that you can pool labels across many cheap reviewers.

Discussion Prompts  ·  For The Team Meeting
  1. Where do we currently ask raters for absolute scores on subjective properties?
  2. Could we run a pairwise-vs-absolute pilot on one labeling task this quarter?
  3. How would we use a Bradley-Terry-derived score to feed our search ranker?
Source Document  ·  Wayfair Explainer Series Ch. 04  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
Part II

Causal Inference & Experimentation.

Knowing whether anything we ship actually works — pricing, marketing channels, regional rollouts, and policy changes when randomization isn't possible.

CH. 05
Causal Inference · Pricing

How do you measure the effect of a single price?

What Wayfair Built

This is the most operationally consequential article in the series. Wayfair walks through four approaches to measuring price elasticityPrice elasticityPercentage change in demand per percentage change in price. The number every pricing decision implicitly bets on. Elasticity varies by category, season, and customer segment — measuring it well is the central problem of analytical pricing.: regressionRegressionThe ML task of predicting a continuous number (price, time-to-return, days-until-churn, expected revenue). Distinct from classification, which predicts a category. on observational data, instrumental variablesInstrumental variable (IV)A variable that affects treatment but doesn't affect the outcome except through treatment. Used to estimate causal effects when randomization isn't available. Hard to find in practice., machine learning with partial-dependence and Double-MLDouble-MLDouble Machine Learning — a causal-inference framework (Chernozhukov et al., 2018) that uses ML for nuisance estimation while preserving valid statistical inference for the treatment effect. The 2026 default when you have many covariates and need honest standard errors. Library: EconML., and randomized price experiments analyzed via difference-in-differencesDifference-in-differences (DiD)A causal-inference method that compares the change-over-time in a treated group against the change-over-time in a control group. Wayfair's preferred analysis for randomized price experiments.. Each method makes a different bet about what the confoundersConfounderA variable that affects both a treatment and an outcome, creating spurious correlation. The reason naive comparisons of "users who saw the email" vs. "users who didn't" almost always mislead — engaged users self-select into the treatment. are. The honest takeaway: there is no one right method, but there is a wrong one for any given decision.

Translation for AFG

AFG's pricing team likely runs cost-plus or competitor-matching pricing today. The first step toward analytical pricing is not machine learning — it is a clean experimental platform that can randomize prices on a small share of traffic. Once you have that, the four methods stack: use observational ML for first-pass elasticity estimates across the catalog, IV when you have a natural instrument (cost shocks, competitor price moves), propensity scorePropensity scoreEstimated probability that a customer will take an action. Layer 1 of the standard marketing-ML stack; layer 2 (uplift) is what actually drives spend decisions. weighting when you can't randomize but have rich covariates, and randomized experiments to validate the highest-impact decisions. The same toolkit extends naturally to marketing-spend questions — uplift modelingUplift modelingEstimating the causal effect of a treatment (typically marketing) on each customer, rather than just predicting their outcome. pylift and CausalML are the standard libraries. for incremental-treatment effects, MMMMMMMarketing Mix Modeling. Statistical decomposition of marketing spend impact on revenue. 2026 stack: PyMC-Marketing, LightweightMMM, Robyn (Meta), Meridian (Google). Sellforte's October 2025 "Agentic MMM" puts an LLM agent on top of a calibrated MMM model. for top-down channel attributionChannel attributionThe problem of dividing conversion credit among the marketing channels that touched a customer. Last-touch is the default and is wrong; MMM and geo-experiments are the principled alternatives.. Order of operations: experimental infrastructure first, ML second, advanced causal inference third.

Discussion Prompts  ·  For The Team Meeting
  1. Do we have any infrastructure today to randomize price on a small percentage of traffic?
  2. Where in our pricing operation are we most likely confounding correlationCorrelationTwo variables move together. Does not imply causation — the single most-violated rule in product analytics. Causal inference exists as a discipline because correlation is so easily fooling. with causation?
  3. Which 20 SKUs would we run a pilot price experiment on if we could start tomorrow?
Source Document  ·  Wayfair Explainer Series Ch. 05  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 06
Causal Inference · Quasi-Experiments

The experiment you can't run, and how to run it anyway.

What Wayfair Built

The synthetic controlSynthetic controlA causal-inference method that builds a weighted combination of untreated units to serve as the counterfactual for a treated unit. The math underneath geo-experiments. method is how you measure the impact of an intervention you can't randomize: a marketing campaign in California, a policy change in one warehouse, a feature rolled out region-by-region. The technique builds a weighted combination of "untreated" units that tracks the treated unit's pre-intervention trajectory, then attributes the post-intervention divergence to the intervention itself. This is the foundation of geo-experimentsGeo-experimentA causal-inference technique that randomizes interventions across geographic units (DMAs, states, regions) when user-level randomization is impossible. The single highest-leverage causal tool for mid-market retailers. at Wayfair scale.

Translation for AFG

AFG cannot run user-level A/B testsA/B testRandom assignment of users to two versions of a system to causally measure which performs better. Cheap to run on user-level events; expensive when each user only sees one version of furniture (Demeter, geo-experiments are responses to this). for many interesting interventions — TV campaigns, regional promotions, supplier changes, policy updates. Build geo-experimentation infrastructure as one of your first compounding bets. The technical liftLiftMultiplicative improvement of a model over a baseline. "3× lift over popularity" means 3× better than the popularity baseline. The standard way to communicate model gains to non-technical stakeholders. is small (Google's CausalImpactCausalImpactGoogle's open-source R/Python library for synthetic-control-style causal inference on time-series data. Three lines of code to get a usable readout. library is essentially three lines of R; Meta's GeoLift is similar in Python), and the strategic payoff is huge: every marketing-channel decision becomes measurable, and you can finally retire the last-touch attributionLast-touch attributionMarketing attribution rule that gives 100% of conversion credit to the last channel a customer interacted with. Default at most retailers; provably wrong in any multi-channel funnel; the thing geo-experiments are designed to retire. model that the marketing team has known is wrong for years. Pair geo-experiments with DMADMADesignated Market Area. Nielsen's definition of a US TV-coverage region; the standard unit for geo-experiments and regional marketing analysis. Roughly 210 DMAs cover the US.-level KPIKPIKey Performance Indicator. The handful of metrics a team is judged on. Bad KPIs (vanity metrics — pageviews, "engagement") drive bad decisions; funnel-level conversion and retention KPIs are usually closer to the truth. rollups in your warehouse and you have a decade-long advantage over peers still litigating channel credit by spreadsheet.

Discussion Prompts  ·  For The Team Meeting
  1. Which of our marketing decisions in the last year would we have made differently with a geo-experiment readout?
  2. What is our smallest unit of geographic randomization (DMA? state? store? zip?)?
  3. Do we have a clean DMA-level rollup of weekly KPIs in the warehouse today?
Source Document  ·  Wayfair Explainer Series Ch. 06  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
Part III

Recommender Systems.

Matching customers to products at scale — from the foundational matrix-factorization pattern to modern stacked architectures and the offline metrics that decide what ships.

CH. 07
Recommenders · The Foundational Pattern

Predicting taste from the company you keep.

What Wayfair Built

Collaborative filteringCollaborative filteringA recommender pattern that predicts what you'll like from what people-like-you have liked. Foundation of every modern recommender stack. is the foundational recommender pattern: matrix factorizationMatrix factorizationDecomposing a sparse user-item interaction matrix into two low-rank matrices (user vectors × item vectors) whose product approximates the original. The mathematical foundation under collaborative filtering; ALS is the workhorse algorithm. on a user-item interaction matrix. The radical idea is that the best signal for what you want is what people like you wanted. Wayfair's article walks the user through the dot-product geometry, the cold-start problemCold-start problemA new user or new item has no interaction history, so collaborative filtering can't recommend for/with them. Solved by content-based features., and why pure CF — despite being decades old — is still the workhorse beneath more glamorous transformer-based recommenders.

Translation for AFG

For AFG, collaborative filtering is a two-week first build, not a research project. Spotify's Implicit library, Microsoft's LightFM, or even raw scipy.sparse + ALSALSAlternating Least Squares. The standard algorithm for matrix factorization in collaborative filtering: alternately fix user vectors and solve for item vectors, then vice versa, until convergence. Spotify Implicit, Microsoft LightFM, and scipy all expose ALS variants. (alternating least squares) on three months of clickstreamClickstreamTime-ordered sequence of user events (page views, clicks, searches, add-to-carts). The raw substrate for nearly all e-commerce analytics, recsys, and personalization. Logged as event streams; queried as session-level aggregates. will produce useful recommendations on day one. Don't start with a two-tower modelTwo-tower modelA recommender architecture with one neural network for users and one for items, joined at a dot product. The right next step after collaborative filtering.. Don't start with a vector databaseVector databaseA database optimized for nearest-neighbor search in high-dimensional embedding space. Qdrant, pgvector, Pinecone, Weaviate. Powers visual search and semantic recommendations.. Start with CF, measure NDCGNDCGNormalized Discounted Cumulative Gain. The standard offline metric for evaluating recommender and search-ranking systems. Penalizes putting good results lower in the list. against a popularity baseline, ship it to the homepage carousel, then iterate. Most retailers your size never get past the popularity baseline because they aim too high too early.

Discussion Prompts  ·  For The Team Meeting
  1. What's our current homepage recommendation logic — is it more sophisticated than "popular + new"?
  2. Could we ship a CF baseline in two weeks if we agreed today?
  3. How would we A/B test a CF recommender against current logic — DMA split or user split?
Source Document  ·  Wayfair Explainer Series Ch. 07  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 08
Recommenders · The Production Stack

Many models, one shelf.

What Wayfair Built

Production recommender systems aren't single models — they're stacks. Content-based filteringContent-based filteringRecommending items based on attributes of items the user liked, rather than on what other users liked (collaborative filtering). Handles cold-start, but tends to over-recommend within narrow categories. handles cold-start, collaborative filtering handles tail items, embeddings handle semantic similarity, and a final ranker picks the order. Wayfair's piece shows the full stack and walks through which techniques solve which blind spots. The pull-quote: each approach has a blind spot; production systems use them in concert.

Translation for AFG

After AFG ships its CF baseline (Chapter 7), the right next move is not a deep transformer — it's adding two layers: a content-based recall channel for new items (using product attributes you already have), and a learning to rankLearning to rank (LTR)ML for ranking problems where the goal is the order of results, not the score of any individual one. Pairwise (LambdaMART), pointwise, and listwise variants exist. (LTR) rerankerRerankerA second-stage model that takes candidate results from a fast retriever and reorders them by quality. The GBDT layer in Ch. 8's stack is the canonical example. on top of all retrievalRetrievalThe first stage of a two-stage recsys or search system: pull a candidate set of items quickly from a large catalog, then a second stage (the reranker) reorders them by quality. Vector search, BM25, and CF are all retrieval channels. channels. Implement the reranker as a GBDTGBDTGradient-Boosted Decision Trees. The workhorse algorithm for tabular ML. XGBoost, LightGBM, CatBoost are all implementations. Beats deep learning on structured data more often than people admit. (LightGBM or XGBoost) over engineered features before reaching for anything neural. This three-layer stack (content + CF + reranker) covers 90% of what Wayfair's MARSMARSWayfair's Multi-Headed Attention Recommender System (RecSys 2022). A transformer-based sequential recommender. Don't try to build this until your scale justifies it. transformer does at 5% of the engineering cost. Only consider transformers when your daily session count clears 1M+ and you've exhausted feature engineeringFeature engineeringThe art of transforming raw data into model inputs that capture domain knowledge. Engineered features (price-relative-to-category-median, days-since-first-purchase) routinely beat deep learning on tabular data. The skill that distinguishes a senior ML engineer from a junior one. on the reranker.

Discussion Prompts  ·  For The Team Meeting
  1. What is our current cold-start strategy for new SKUs — popularity? attribute lookup? nothing?
  2. Where in our stack would a learning-to-rank reranker sit, and what features would it use?
  3. Have we benchmarked the maximum NDCG lift available from our current data before considering deeper architectures?
Source Document  ·  Wayfair Explainer Series Ch. 08  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 09
Recommenders · The Question That Matters

Knowing whether your model is actually better.

What Wayfair Built

Online A/B tests are the gold standard, and they take weeks. So how do recommender teams decide which model is worth shipping in the first place? The answer is offline evaluationOffline evaluationEvaluating a recommender by running it on historical held-out interactions rather than live traffic. Cheap, fast, and the only way to compare 100 model variants — but offline-online gap is real and team-dependent., and the metric is NDCG — Normalized Discounted Cumulative Gain. Wayfair's article walks through the math, the common failure modes (popularity biasPopularity biasRecommenders trained on engagement data systematically over-recommend popular items, suppressing the long tail. Counter-acted with diversity reranking or popularity-aware loss functions., position biasPosition biasUsers click results higher in the list more often, regardless of quality. Naive use of click data trains models to predict position rather than relevance. Inverse-propensity weighting or counterfactual evaluation are the standard fixes., the offline-online gapOffline-online gapThe persistent disagreement between a recommender's offline metrics and its online A/B test outcomes. A team's most diagnostic operating-quality signal: the smaller the gap, the more trustworthy the offline harness.), and why teams that don't trust their offline metrics end up shipping nothing.

Translation for AFG

If AFG ships even one recommender, you need an offline evaluation harness from day one. NDCG@10 on held-out user sessions is the table-stakes metric. The gotcha: a temporal splitTemporal splitA train/test split by date so the test set is strictly later than training. The correct way to evaluate any time-series or recommender model. Random splits leak future information. is mandatory — split by date so the test set is strictly later than train. Random splits leak future information and produce metrics that don't survive contact with production. Build the harness before the second model. Without it, your team will ship things that don't help and won't know why.

Discussion Prompts  ·  For The Team Meeting
  1. Does our team have a single offline metric we trust enough to make ship-vs-don't-ship decisions on?
  2. Are our train/test splits temporal or random?
  3. What's our offline-to-online correlation for the metric we use — and have we ever measured it?
Source Document  ·  Wayfair Explainer Series Ch. 09  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
Part IV

Computer Vision.

Seeing the catalog — pose, geometry, simulation, and the human review loop that keeps every catalog vision model honest as the world drifts under it.

CH. 10
Computer Vision · Geometry From Pixels

Where the object actually is.

What Wayfair Built

Object pose estimationPose estimationRecovering an object's 3D position and orientation from 2D images. The math underneath every AR-furniture-in-your-room experience. is the problem of recovering an object's 3D position and orientation from a 2D image. Wayfair's article walks through the classic pipeline: detect 2D keypointsKeypointA distinctive 2D image location used as a correspondence anchor across views. Detected via SIFT, ORB, or modern learned detectors; the entry point for pose estimation, panorama stitching, and structure-from-motion., match to known 3D model points, solve the Perspective-n-PointPerspective-n-Point (PnP)The geometric problem of recovering a camera's 6-DOF pose from N known 3D points and their 2D image projections. The math underneath every AR-furniture-in-your-room demo; usually solved with RANSAC + nonlinear refinement. problem with RANSACRANSACRandom Sample Consensus. An algorithm for robust model fitting in the presence of outliers. The "consensus" component of pose-estimation pipelines.. This is the math underneath every AR-furniture-in-your-room demo and the building block of robotic manipulation.

Translation for AFG

Pose estimation is not something AFG should build from scratch in 2026. Off-the-shelf is now extraordinary: Apple's RoomPlan APIAPIApplication Programming Interface. The contract for how one piece of software calls another. Every external integration — payment, shipping, ad networks, LLMs — is, mechanically, a series of API calls., Google's ARCore, Meta's Quest pass-through, plus open-source models like FoundationPose. Where AFG should invest is in 3D model availability for the catalog: even imperfect 3D models from suppliers, run through pose-aware rendering, let you ship AR "see it in your room" experiences using vendor SDKs. The technical depth is in the asset pipeline, not the pose math.

Discussion Prompts  ·  For The Team Meeting
  1. What percentage of our top-1000 SKUs have any kind of 3D asset (CAD, photogrammetric, vendor-supplied)?
  2. Have we evaluated Apple RoomPlan, Google ARCore, or vendor 3D-conversion services for the gap?
  3. Is the bottleneck for AR-in-room experiences our 3D coverage or our front-end?
Source Document  ·  Wayfair Explainer Series Ch. 10  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 11
Computer Vision · Reconstruction From Catalog Photos

A bottle, three angles, and the open problem beneath every product page.

What Wayfair Built

Stitching panoramas is well-trodden ground. Reconstructing a real 3D object from a handful of catalog photographs — without anyone ever measuring it — is something else entirely. This article is Wayfair being honest about an unsolved problem: photogrammetryPhotogrammetryReconstructing 3D geometry from 2D photographs. Mature for controlled multi-camera capture; an open research problem for sparse, uncalibrated supplier photos. Buy 3D assets, don't build them. from sparse, uncalibrated supplier-provided product photos. The interactive walks through why the math is fundamentally underdetermined.

Translation for AFG

This is the do not attempt at home chapter for AFG. Multi-view 3D reconstruction from supplier photos is genuinely an open research problem. The right move for a mid-market retailer is to buy or commission 3D assets — vendors like Cylindo, ThreeKit, and Imagine.io operate at SKU-level pricing that's a fraction of internal R&D. Reserve internal CV investment for downstream uses of those assets (room-scene rendering, AR, image generation) where the moat is in the pipeline, not the reconstruction.

Discussion Prompts  ·  For The Team Meeting
  1. What's our current per-SKU cost for high-quality product photography, and could 3D-asset commissioning compete with it?
  2. Have we evaluated Cylindo, ThreeKit, Imagine.io, or other vendor 3D pipelines?
  3. Is there a CV problem we're tempted to solve in-house that we should be buying?
Source Document  ·  Wayfair Explainer Series Ch. 11  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 12
Computer Vision · Teaching By Simulation

To teach a model the world, first simulate it.

What Wayfair Built

When real labeled data is scarce, expensive, or unsafe to collect, you simulate. Wayfair's article walks through synthetic-data generation for catalog vision tasks: rendering 3D models with varied lighting, backgrounds, occlusions, and poses to produce millions of perfectly-labeled training images. The trick is the sim-to-real gapSim-to-real gapThe performance drop when a model trained on synthetic data is deployed on real-world data. Closed via domain randomization, sensor-based randomization, and (2024–2026) NVIDIA Cosmos Transfer. — training on synthetic dataSynthetic dataComputer-generated training data — typically rendered 3D scenes — used when real labeled data is scarce. The "sim-to-real gap" is the central technical challenge. and deploying on real photos requires careful domain randomizationDomain randomizationTraining a model on synthetic data with deliberately randomized parameters (lighting, textures, camera angles) so it generalizes to real-world variation. The standard trick for closing the sim-to-real gap..

Translation for AFG

Synthetic data is the unlock for any AFG vision model that needs labels at scale. Three concrete applications: (1) damage detection on incoming-package conveyors (label thousands of synthetic damaged-box images cheaply, then fine-tuneFine-tuneContinuing training of a foundation model on your own labeled data to specialize it for your task. The third step in the prompt → few-shot → RAG → fine-tune ladder; reach for it only after the cheaper steps have been exhausted. Cost: GPU hours and labeled data; payoff: a model specialized to your domain. on a few hundred real ones), (2) shelf compliance if you have stores or showrooms, (3) defect detection in returns processing. For all three, off-the-shelf simulators (NVIDIA OmniverseNVIDIA OmniverseNVIDIA's collaborative 3D simulation platform, used for synthetic-data generation in robotics, autonomous-vehicle, and retail-vision pipelines. Replaces bespoke renderers with USD-based scene composition., Unity Perception, Blender + Python) get you 80% of the way. The expensive part is not the synthesis — it's curating a small, real-world test set to validate transfer.

Discussion Prompts  ·  For The Team Meeting
  1. What are our top three vision problems that we've avoided because real-world labels are too expensive?
  2. Have we evaluated open-source simulators (Omniverse, Unity Perception) for any of them?
  3. How many real-world labeled examples do we have for each problem to validate sim-to-real transfer?
Source Document  ·  Wayfair Explainer Series Ch. 12  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
CH. 13
Computer Vision · The Human in the Loop

The model decays. The humans keep it honest.

What Wayfair Built

This is a structural insight more than a technique: every catalog vision model in production decays as the catalog drifts. New product categories appear, photography styles change, supplier image conventions shift. Wayfair's article walks through the operational pattern: a continuous loop of model predictions, sampled human review, retraining, redeployment. The labeling team is not a one-off cost; it is a permanent component of the production model.

Translation for AFG

Any AFG team shipping a vision model must staff a review loop alongside it from day one. The cheapest version: 1-2 catalog ops people sample 50-100 model predictions per day, label them, feed corrections back into a retraining queue. Tools: LabelboxLabelboxCommercial data-labeling platform with annotation UI, workflow tools, and quality control. The default vendor choice for production labeling teams that don't want to build their own UI., Scale AIScale AICommercial labeling provider that combines annotation UI with managed-services contractors. Premium-priced; the default for high-volume autonomous-vehicle and content-moderation pipelines., or open-source CVATCVATComputer Vision Annotation Tool — open-source labeling UI from Intel, originally. The default DIY option; runs in Docker, supports image and video annotation, integrates with most ML pipelines. for the labeling UI. SnorkelSnorkelA framework for programmatic weak supervision — generating noisy training labels from heuristics, then learning a denoising model on top. Used by Wayfair for catalog tagging across 40M+ products. for weak-supervision label generation. The mistake to avoid: treating the initial training set as the model's complete diet — that's how you get a model that's 97% accurate at launch and 84% accurate eight months later.

Discussion Prompts  ·  For The Team Meeting
  1. Do any of our production ML models have a defined re-labeling cadence today?
  2. Who owns the operational decision to pull a model off-line when it decays?
  3. What's our budget reality for permanent labeling capacity vs. the engineer-hour cost of a model going stale?
Source Document  ·  Wayfair Explainer Series Ch. 13  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
Part V

Catalog Operations.

The unglamorous foundation under search, filters, recommendations, and ad feedsAd feedStructured product data file (Google Shopping, Meta Catalog) used for paid social and search advertising. Quality of attributes in the feed directly drives CTR, conversion rate, and ROAS. — and the single highest-leverage AFG investment per dollar in 2026.

CH. 14
Catalog Ops · Structure At Scale

How Wayfair teaches a catalog of millions to describe itself.

What Wayfair Built

Product tagging — extracting structured attributesStructured attributesCatalog data fields with discrete, validated values (color: red/blue/green; material: wood/metal/fabric). Distinct from unstructured text; the substrate of search filters, faceted browse, and ad-feed quality. (color, material, style, dimensions) from supplier-provided unstructured text and images — is the unglamorous foundation under search, filters, recommendations, and ads. Wayfair's article walks through the full stack: regex and rules for high-precision attributes, classifiers for medium-confidence ones, weak supervisionWeak supervisionGenerating noisy training labels from heuristics, rules, or other models, then training a denoising model to learn from the noisy labels. The Snorkel-popularized pattern for labeling at scale when human labels are too expensive. (Snorkel) for labeling at scale, and now LLMsLLMLarge Language Model. The class of foundation models behind GPT, Claude, Gemini, Llama. The 2026 default substrate for any text-heavy capability. for free-text extraction across millions of SKUs.

Translation for AFG

This is the highest-leverage AFG investment per dollar in 2026, and it requires almost no proprietary ML. A modern LLM (Gemini, Claude, GPTGPTGenerative Pre-trained Transformer. OpenAI's family of LLMs (GPT-3, GPT-4, GPT-5). Genericized in usage to mean any large autoregressive transformer. The model class that ignited the 2022–2026 generative AI cycle.-4-class) can extract structured attributes from supplier text and images at a few cents per SKU. Build the pipeline once, run it across the whole catalog, validate with a 1-2 person editorial review team, and you have structured data that powersPowerThe probability of detecting a real effect of a given size with your sample. Most A/B tests are underpowered for the effect sizes that actually matter to the business — running a test that can only detect a 10% lift when the real effect is 2% is an expensive way to learn nothing. search, filters, recommendations, and ad-feed quality for years. The downstream lift is large enough that this single investment usually pays for several other initiatives.

Discussion Prompts  ·  For The Team Meeting
  1. What percentage of our catalog has structured attributes (color, material, style) today?
  2. Have we run a 100-SKU LLM extraction pilot to benchmark accuracy against human editors?
  3. Which downstream system would benefit most from a 90%-tagged catalog vs. our current state?
Source Document  ·  Wayfair Explainer Series Ch. 14  ·  Full text + interactives
Embedded inline · scroll within frame ↑ Back to TOC
Part VI

Strategic Synthesis.

Pulling it all together — the longitudinal shape of what Wayfair has built, the method-versus-scale comparison, the translation filter, and the fourteen-initiative AFG roadmap.

CH. 15
Strategic Synthesis · The Shape, The Comparison, The Roadmap

A decade of Wayfair output, condensed into fourteen initiatives.

The previous fourteen chapters covered the techniques. This chapter covers the strategy. Embedded below is the full meta-analysis: a longitudinal read of every Wayfair Tech Blog post, conference paper, and cloud case study from 2017 to 2026, organized into four interactive figures and a phased fourteen-initiative roadmap for AFG.

The Four Figures

Figure 01 — Timeline. Stacked-area chart of eight problem domains across nine years. The 2021–22 plateau is the platform consolidation onto Vertex AI; the 2023–26 surge is generative AI absorbing the catalog, the search box, and the shopping experience itself.

Figure 02 — Method × Scale. For each of six problem domains, the comparison between what Wayfair runs and what AFG should run instead. The Wayfair line is struck through and labeled they spend; the AFG line is bold and labeled you need.

Figure 03 — Translation Filter. Drag a slider from "smaller" to "Wayfair-scale" and watch fifteen initiatives migrate between BUILD and SKIP columns. At AFG-class scale (~15 on a 100-point scale), nine are worth building and six are Wayfair-scale moves to deprioritize.

Figure 04 — Phased Gantt. All fourteen initiatives across thirty months, color-coded by horizon. Quick Wins (A–E) self-fund the platform investment that the Compounding Bets (F–J) need; the Compounding Bets generate the data-quality and measurement infrastructure that the Differentiators (K–N) rest on.

How To Use This Chapter

This is the chapter to share with the board and walk a new technical hire through on day one. The four interactives are designed to be projected and discussed live: each one is a manipulable claim, not a static diagram. Open the embedded document below in a separate tab if you want to present it full-screen — the interactives respond to mouse, slider, and button input.

Discussion Prompts  ·  For The Quarterly Strategy Review
  1. If we adopted the fourteen-initiative roadmap as-is, which of our current projects would we have to kill to free up the team capacity?
  2. Where on the Translation Filter (Fig. 03) do we believe AFG actually sits today — boutique, AFG-class, or already crossing into national-retailer scale?
  3. Which Quick Win (A through E) would we ship first — and which Differentiator (K through N) are we most tempted to start prematurely?
  4. What's our honest assessment of which Wayfair-scale moves we are tempted to copy because they're glamorous, even though Figure 02 says they're not the right answer at our scale?
Strategic Document  ·  Meta-Analysis & Roadmap Ch. 15  ·  Four interactive figures + 14-initiative Gantt
Embedded inline · all four figures interactive ↑ Back to TOC
Part VII

The Lindy Audit.

A nine-year retrospective separating the operational patterns that survived from the implementations that didn't — and naming where the bottleneck has moved in the post-2020 frontier era.

CH. 16
The Lindy Audit · A Nine-Year Retrospective

What endured, and what was quietly replaced.

Every chapter so far covered a Wayfair technique on its own terms. This chapter steps back and asks a different question: now that we're four years into the post-2020 frontier — foundation modelsFoundation modelA large model trained on broad data that can be adapted to many downstream tasks. The 2026 default starting point for any new ML capability — adapt via prompt → few-shot → RAG → fine-tune (Chip Huyen's order)., diffusion-based synthetic data, agentic systems, MCPMCPModel Context Protocol. Anthropic's late-2024 protocol for letting LLMs interact with external tools and data sources. The single biggest practitioner-acknowledged shift at the AI Engineer World's Fair June 2025; widely adopted within six months., evals-as-CI/CD — which of Wayfair's published patterns held up, which got their bottleneck moved somewhere else, and which were quietly replaced? The embedded long-form below is the full audit. Seven interactive figures, drawing on practitioner voices from the AI Engineer World's Fair, Latent Space, Karpathy's Dwarkesh interview, Eugene Yan, Hamel Husain, Shreya Shankar, Simon Willison, and the Shopify Sidekick team's published Cohen's KappaCohen's KappaInter-rater agreement metric, ranging from -1 (worse than chance) to 1 (perfect agreement). Used to calibrate LLM-as-judge against humans.… journey.

The Four-Layer Frame

The audit organizes Wayfair's corpus by four layers: Goals (immortal — relevance, lift, safety), Disciplines (durable — A/B test hygiene, eval harnesses, human-in-the-loop), Methods (turn over fast — Faster R-CNNCNNConvolutional Neural Network. The dominant deep-learning architecture for image tasks from 2012 (AlexNet) until ~2022, when vision transformers began displacing them. Still ubiquitous in legacy production CV systems and edge devices., PnP+RANSAC, hand-engineered features), and Artifacts (durable only when they encode local truth — WANDSWANDSWayfair Annotation Dataset for Search — a public benchmark of search-relevance labels released in 2023. The free starting point for any retailer evaluating search quality., customer judgments). The Lindy effectLindy effectThe older an idea is, the longer it's likely to live. Useful when applied at the right layer (goals, disciplines) and misleading when applied at the wrong one (methods, generic artifacts). — the older a non-perishable idea is, the longer its remaining life expectancy — applies cleanly to the Goals and Disciplines layers, and brutally to the Methods layer. The thesis: the durability of an operating pattern is inversely proportional to how much of it is "the model" and directly proportional to how much of it is the discipline of measurement.

Why This Chapter Matters For AFG

This is the chapter to read before any technical-strategy meeting in 2026. The single most actionable insight: any team that built its identity around a specific implementation (a CNN attribute-tagger team, a hand-feature-engineered marketing model team) is structurally exposed; any team that built its identity around a discipline (the eval team, the experimentation platform team) compounds. Two new architectural concepts the chapter introduces and that AFG should adopt as planning vocabulary: JIT instructionsJIT instructionsJust-In-Time instructions — context injected by the platform at inference time, not by the model. The Shopify Sidekick team credited JIT with most of their reliability gains. Platform > model, again. (the Sidekick framework's just-in-time, platform-injected context that any agent gets at runtime), and Simon Willison's lethal trifectaLethal trifectaSimon Willison's June 2025 framing for any agent that combines (1) access to private data, (2) exposure to untrusted content, and (3) ability to externally communicate. Each pair is fine; all three is exfiltration.… (private data, untrusted content, external communication — any agent that combines all three is an exfiltration vector). The roadmap chapter that follows (Ch. 17) translates this into specific revisions of the original 14-initiative plan.

Discussion Prompts  ·  For The Strategy Review
  1. Of our currently-running ML projects, which ones are betting on a method that's likely to be replaced within the next 24 months?
  2. Where in our team's identity is the discipline of measurement actually owned by a named individual or function?
  3. If we adopted the four-layer frame as our planning vocabulary, which currently-funded initiatives would we re-categorize from "method investment" to "discipline investment" — and which the other way?
  4. What's our honest answer to Karpathy's framing: "The decade of agents, not the year"? Are we planning for one or for the other?
Wayfair Explainer Series  ·  Volume 03 Ch. 16  ·  7 interactive figures
Embedded inline · all seven figures interactive ↑ Back to TOC
CH. 17
The Roadmap, Re-Examined · Through The Lindy Lens

Fourteen initiatives, revised.

The original 14-initiative roadmap (Chapter 15) was opinionated but written before the Lindy audit. With the audit complete, six of the initiatives need a revised note. Three need their method spec updated; two need their priority bumped up; one needs to be partially descoped. The principle behind every change: shift weight from method investment toward discipline investment, and from custom training toward foundation-model adaptation.

Revisions to the Original 14

§ B (Search reranker) — bump priority. The LLM-as-judgeLLM-as-judgeUsing a large language model to label or evaluate data that previously required human annotators. Has cut the cost of search-relevance labeling by 10–100×. pattern collapses the labeling cost asymmetry (Etsy 2025; Eugene Yan, April 2025). Move from Phase 1 mid to Phase 1 first.

§ C (LLM catalog enrichment) — already first; expand scope. The Cornell-Nordstrom 2026 paper and Shopify's 40M-VLMVLMVision-Language Model. Class of multimodal foundation models — GPT-5, Gemini 2.5, Claude Vision, Llama 3.2 Vision, Qwen2-VL, Florence-2, Molmo, NVLM. Has functionally replaced bespoke CNN attribute classifiers for most catalog work.-calls-per-day pattern both validate this initiative's centrality. Add an explicit "applicability detection" sub-initiative — that's where humans add the most marginal value and where VLMs are weakest.

§ E (Starter ML platform) — re-spec. Original spec was dbt + Airflow + MLflow. Revised spec adds: an eval harness component from day one (not as a Phase 2 add-on), an MCP-aware tool registry, a feature store for offline-online consistency, and a Ground Truth SetGround Truth Set (GTX)Shopify Sidekick's term for an evaluation set sampled from real production traces, not curated golden examples. The discipline distinction matters: curated examples produce systems that pass tests but fail in the wild. sampling pattern from production traces. Don't ship the platform without these four.

§ F (Visual search) — revise method. Original spec said CLIP-class embeddings. Revised: SigLIPSigLIPGoogle's sigmoid-loss CLIP variant (2023). The practitioner default for new builds in 2025–2026 because of its sigmoid loss, better scaling, and retrieval quality. for retrieval embeddings, DINOv2DINOv2Meta's 2023 self-supervised visual foundation model. The default visual backbone in 2025–2026 when text alignment isn't needed (segmentation, pose, fine-grained retrieval). Used as the heart of the December 2024 efficient generative classification pipeline. as the visual backbone where text alignment isn't needed, SAM 2SAM 2Meta's Segment Anything Model 2 (2024). Replaces nearly all task-specific segmentation models for catalog work; SAM 2 added video. for segmentation, vector DB for ANN. The change is small but compounds.

§ G (Two-tower recommender) — descope partially. Original spec said two-tower as the next step after CF baseline. With LLM-as-judge offline eval and SigLIP-based item embeddings, the two-tower investment can be deferred. Skip directly from a tuned LambdaMART reranker to a hybrid LLM-RecSys pattern (per Eugene Yan's September 2025 hands-on project) when scale justifies.

§ I (Decorify-class room generation) — bump priority slightly. Diffusion-augmentation pipelines (Stable Diffusion 3.5, FLUX, Imagen 3) plus ControlNet make this a six-week build in 2026, not a six-month one. Move from late Phase 2 to mid Phase 2.

§ L (Conversational LLM assistant) — replace pattern. Original spec said RAG over catalog. Revised: build the merchant-simulator pattern (per Shopify Sidekick's published architecture) before the assistant ships customer-facing. The simulator is the Phase 2 deliverable; the assistant graduates to customer-facing only after Cohen's Kappa hits 0.5+ against humans.

The Three New Initiatives To Add

The Lindy audit also surfaces three initiatives the original roadmap missed entirely.

§ O (new) — Eval harness as a first-class platform component. Treat the eval harness as a separately staffed and budgeted system, not a feature of the ML platform. This is the "infrastructure around the agent" that Hamel Husain says beats model improvements. Phase 1, alongside § E.

§ P (new) — Synthetic-data pipeline on top of CAD assets. Wayfair's 3D library is a strategic asset; AFG should treat any owned 3D / CAD assets the same way. NVIDIA Cosmos and Omniverse Replicator make every CAD file a synthetic-data factory. Phase 2.

§ Q (new) — MCP-aware integration layer. Anthropic's Model Context Protocol was the single biggest practitioner-acknowledged shift at the AI Engineer World's Fair 2025. Build a small, well-governed MCP integration layer that the conversational assistant (§ L) and any future agents will use. Phase 2.

Discussion Prompts  ·  For Roadmap Re-Approval
  1. Of the seven revisions above, which two does our team agree with most strongly, and which one are we most likely to push back on?
  2. If we add the three new initiatives (§ O, § P, § Q), which existing initiative do we slow down to free up the team capacity?
  3. What's our honest assessment of the lethal-trifecta risk for any agent we ship in the next 12 months — private data, untrusted content, external communication?
  4. How do we measure whether our eval harness investment is actually paying off, separate from any individual model's metrics?
APP. A

Glossary — key terms used throughout this guidebook.

This glossary is for the executive who wants to follow technical conversations with their team without feeling lost. None of these definitions are rigorous; all are operational.

APP. B

Six reading paths through this guidebook, by role.

The full guidebook is fifteen chapters and a meta-analysis. Few people will read it cover-to-cover. These are six suggested orderings depending on what you're trying to accomplish.

Path 1 — The New Tech Executive's First Week

Read in this order: Chapter 15 (Strategic Synthesis) for the big picture and the roadmap. Then Chapter 14 (Product Tagging) because it's the highest-ROI investment and a litmus test of catalog data hygiene. Then Chapter 5 (Price Effect) because pricing is where most retailers leave the most money on the table. Finally Chapter 7 (Collaborative Filtering) as the operational primer for the team's first ship.

Path 2 — The Pricing & Merchandising Track

Chapters 1, 2, 3 for the foundational uncertainty primers, then 5 (Price Effect) and 6 (Synthetic Controls) for the methods, then Chapter 15 for where this fits in the broader roadmap. Skip Computer Vision unless you have a specific catalog-imagery problem.

Path 3 — The Personalization & Search Track

Chapters 1 (Information Theory), 4 (Comparisons), then the full Recommenders trilogy 7, 8, 9. Add 14 (Product Tagging) because catalog structure determines search and recommendation ceiling. Optional but recommended: 3 (Imbalanced Data) for click-prediction modeling.

Path 4 — The Computer Vision & AR Track

Chapters 10, 11, 12, 13 in order. Pair with 4 (Comparisons) if you're building any kind of subjective-rating labeling pipeline (style scoring, photo quality). Read 14 (Product Tagging) because most "vision" problems are actually attribute-extraction problems wearing a different hat.

Path 5 — The Marketing & Causal Track

Chapter 6 (Synthetic Controls) first because geo-experimentation infrastructure is the unlock. Then Chapter 5 (Price Effect) for the broader causal toolkit. Then Chapter 2 (Bayesian) for calibrated uncertainty in attribution. Skip the Vision and Recommender chapters unless you're touching them operationally.

Path 6 — The Onboarding Curriculum For A New Senior IC

Read every chapter in order over four weeks, one part per week (Parts I-IV) plus a final week on Parts V and VI. Use the Discussion Prompts at the end of each chapter as the agenda for weekly 1:1s with your manager. By the end you have a coherent operational map of the entire field as it applies to AFG.

Path 7 — The Annual Strategy Refresh

Once a year — most naturally in Q4 budget planning season — read Chapter 16 (the Lindy audit) first, then Chapter 17 (the roadmap re-examined), then revisit Chapter 15 (the strategic synthesis) with the audit findings in mind. Use Chapter 16's seven interactive figures as the visual material for the strategy presentation. The deliverable is a revised next-year roadmap with explicit notes on which of the original 14 initiatives have been re-spec'd, which have been bumped up or down, and which new initiatives (§ O, § P, § Q) are being added. This is the path that keeps the guidebook from becoming an artifact-of-its-time.