Designing data for development

sampling

recommendation

Training a model requires data. In the era of big data, deep learning, and large compute, full production datasets are often unwieldy and cost‑prohibitive to iterate on.

The first step in building any model is to assemble a development dataset that is workable. A workable dataset looks and behaves like the full dataset but is small enough to load quickly, explore interactively, and train on repeatedly. Equally important, that dev dataset must be representative: it should preserve the distributions and relationships that matter so the model you build on it will translate to the real thing.

This is where experimental design belongs in your daily ML workflow.

Randomization helps you avoid accidental bias.
Grouping (blocking) keeps tightly related records together to prevent leakage.
Stratification preserves critical proportions (heavy vs light users, rare vs common items, class balance, etc.).

If you apply those three ideas when you carve out your development data, you’ll end up with something that is both pleasant to model on and predictive of how your final model will behave in production.

A relatable dataset: MovieLens 100K

To make this concrete, we’ll use the MovieLens 100K dataset, a classic recommendation benchmark with 100,000 ratings across users and movies. It’s small, well‑documented, and perfectly suited to demonstrate why grouping and stratification matter when you split data.

It has natural groups like userId and movieId.
It has long‑tailed activity and popularity (some users rate a lot, some movies get most of the ratings).
It has ratings you may want to stratify (e.g., the 4–5 star skew) and user/movie activity you may want to balance.

You can download it directly from the GroupLens site: MovieLens 100K. We’ll treat ratings as a tidy table with columns like userId, movieId, rating, and timestamp.

Why experimental design belongs in data splitting

Think about building dev data like an election poll. You can’t (and shouldn’t) run a full national vote every time you want to see how a message lands. Pollsters run smaller, cheaper samples that still mirror the electorate. They don’t sample at random alone—they stratify by geography, age, party, and more; and they keep households together when needed to avoid weird dependencies. With that care, a tiny fraction of the population gives you a read that’s close to the real outcome.

Your dev dataset is the poll before the production election. You want it smaller (to be fast and cheap) but still representative (so your model’s performance and failure modes on dev reflect reality). Randomization, grouping, and stratification are your polling toolkit for ML.

Walkthrough: from raw data to a modelable dev split

We’ll walk through three increasingly careful steps:

Start with a small random sample (workability)
Make the split group‑aware so no group leaks across train/validation (blocking)
Add stratification so key proportions are preserved (representativeness)

We’ll also point out what you’re testing with each choice (e.g., unseen users versus unseen items) so you can align the split to your modeling objective.

0) Load the ratings

Below is a simple way to load ratings with pandas. Replace PATH_TO_ML_100K with the location where you unpacked the dataset.

import pandas as pd

# The raw u.data file has tab‑separated columns: userId, movieId, rating, timestamp
ratings_path = "PATH_TO_ML_100K/u.data"  # e.g., "/Users/you/data/ml-100k/u.data"

ratings = pd.read_csv(
    ratings_path,
    sep="\t",
    names=["userId", "movieId", "rating", "timestamp"],
    header=None,
)

ratings.head()

At this point you have ~100k rows. That’s already small, but most production datasets will be much larger—pretend this is your giant table.

1) Make a small random sample

Random sampling gives you a quick, workable slice that keeps a bit of everything. The goal is speed: this is the slice you’ll iterate on while prototyping features and model architectures.

# 10–20% is a good starting point for iterative work
dev_frac = 0.2
dev = ratings.sample(frac=dev_frac, random_state=42)

# Keep the rest around as a holdout or for later expansion of the dev set
rest = ratings.drop(dev.index)

dev.shape, rest.shape

This is fine for exploration, but random sampling alone can leak information between train and validation later. For recommenders, the same user or the same movie appearing on both sides often inflates performance (the model “recognizes” them). We fix that next.

2) Group‑aware split (blocking) to prevent leakage

Decide what you want to generalize to:

If you care about recommending to unseen users (a cold‑start‑ish stress test), split by userId so each user appears in only one of train or validation.
If you care about recommending new or tail items, split by movieId so items don’t leak across splits.

We’ll show a user‑grouped split. The same pattern works for items.

import numpy as np
from sklearn.model_selection import GroupShuffleSplit

groups = dev["userId"].values
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, val_idx = next(gss.split(dev, groups=groups))

train_dev = dev.iloc[train_idx].reset_index(drop=True)
val_dev = dev.iloc[val_idx].reset_index(drop=True)

# Sanity: no overlapping users
overlap_users = np.intersect1d(train_dev.userId.unique(), val_dev.userId.unique())
len(overlap_users)

Now train and validation are cleanly separated by user. You can confirm there’s no user overlap, which reduces leakage and gives you a truer read on how well the model handles new users.

3) Stratified split to preserve key proportions

Group‑aware is great, but you can still skew the data accidentally. For example, if heavy raters (power users) end up mostly in train, validation will look harder than it should. The fix is to stratify at the group level.

Since we can’t simultaneously pass both groups and stratify to a single split in scikit‑learn, we do it in two steps: build group‑level features and stratify on those while selecting groups.

We’ll bin users by activity (number of ratings) and stratify on those bins so train/validation have similar proportions of light/medium/heavy users. Then we’ll expand those chosen users back to their ratings.

from sklearn.model_selection import StratifiedShuffleSplit

# Step A: compute user activity on the dev sample
user_counts = (
    dev.groupby("userId", as_index=False)["movieId"].count().rename(columns={"movieId": "numRatings"})
)

# Step B: bin users by activity (tune bins to your data)
bins = [0, 10, 50, np.inf]
labels = ["light", "medium", "heavy"]
user_counts["activityBin"] = pd.cut(user_counts["numRatings"], bins=bins, labels=labels, right=False)

# Step C: stratified sample users into train/val
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
user_idx_train, user_idx_val = next(
    sss.split(user_counts[["userId"]], user_counts["activityBin"])  # X is ignored, y are bins
)

train_users = set(user_counts.iloc[user_idx_train]["userId"].tolist())
val_users = set(user_counts.iloc[user_idx_val]["userId"].tolist())

# Step D: expand back to ratings (group‑aware by user, stratified on activity)
train_dev = dev[dev["userId"].isin(train_users)].reset_index(drop=True)
val_dev = dev[dev["userId"].isin(val_users)].reset_index(drop=True)

train_dev.shape, val_dev.shape

You’ve now created a split that:

Keeps users entirely in train or validation (no leakage)
Preserves the balance of light/medium/heavy raters across splits (representative)

If item popularity matters more for your use case, repeat the same approach grouped on movieId and stratify on item popularity bins instead.

Optional: quick model sanity check with Surprise

If you want to do a smoke test, you can fit an SVD model using Surprise. This isn’t about chasing the best metric—it’s about verifying your dev split trains quickly and the validation results are plausible.

from surprise import Reader, Dataset, SVD
from surprise.model_selection import train_test_split as surprise_split
from surprise import accuracy

# Surprise expects specific column names and value ranges
reader = Reader(rating_scale=(1, 5))

train_data = Dataset.load_from_df(train_dev[["userId", "movieId", "rating"]], reader)
val_data = Dataset.load_from_df(val_dev[["userId", "movieId", "rating"]], reader)

trainset = train_data.build_full_trainset()
valset = val_data.build_full_trainset().build_testset()

algo = SVD(random_state=42, n_factors=50)
algo.fit(trainset)
preds = algo.test(valset)
rmse = accuracy.rmse(preds, verbose=False)
rmse

Again, the goal here is speed and signal—not leaderboard chasing. If this dev split is healthy, you’ll see stable validation metrics across reruns and only modest drift when you expand the sample size.

Practical checks and pitfalls

Check leakage: confirm no overlap in the grouping key (userId or movieId) across train and validation.
Check distributional similarity: compare rating histograms, user activity, and item popularity between splits.
Beware cold‑start extremes: a user‑grouped split is strict (all users in validation are unseen). If that’s too harsh for your objective, consider time‑based splits or allow mild overlap but block only the most leakage‑prone entities.
Keep time in mind: for temporal systems, a forward‑chaining (past→future) split may be more faithful than pure random.
Iterate the fraction: start at 10–20% for dev. If your training loop is too slow, go smaller; if your metrics are too noisy, go bigger.

The big picture

Like polling before an election, a thoughtfully designed dev dataset lets you move fast and still learn the truth. Randomization gives you breadth, grouping prevents leakage, and stratification keeps the proportions that matter. Put together, you get a small dataset that is both modelable and representative—so the model you build on Tuesday behaves like the model you ship on Friday.