FilmFace | Richard Sonnen

A model trained on standard face recognition benchmarks achieves 99.66% accuracy. Test that same model on a different dataset, faces extracted from films and TV shows, and it drops to 19.42%.

Train on film data instead? The model achieves high accuracy on film data. The problem isn’t that film content is harder. The problem is that these models don’t generalize. They learn features specific to whatever dataset they’re trained on, not features that represent human faces.

FilmFace is a research project exploring whether you can build face recognition models that generalize across domains, not just to held-out samples from the same distribution.

Why I Started

Back in the 2000s, Google Photos had face recognition that worked surprisingly well for its time. Then it got dramatically worse overnight. I assumed either people found it creepy, it was being misused, or someone decided that level of capability shouldn’t be public. No evidence for any of that, but it caught my attention as a research area worth understanding.

When I started working in AI professionally, I wanted a project that would teach me something real. Not MNIST, not a toy problem. Face recognition offered exactly that: large datasets, complex architectures, difficult trade-offs. It has delivered on all counts.

Building the Dataset

FilmFace isn’t a public benchmark. I built it: 525,000 face images from 613 actors extracted from films and TV shows. Creating a usable dataset from video content is complex: frame extraction, face detection, quality filtering, identity labeling, handling edge cases like prosthetics and aging makeup.

The public datasets aren’t much easier. GLINT360K has 17 million images across 360,000 identities, but web-scraped data is messy. Mislabeled identities, duplicate images, corrupted files, inconsistent quality. Getting these datasets into a state where training results are meaningful takes substantial cleanup work before any experiments can run.

What I Found

The published models excel at their benchmarks. Train on GLINT360K, test on held-out GLINT360K data: 99.66% accuracy. The numbers look solved.

But test that same model on FilmFace and accuracy craters to 19.42%. Train on FilmFace and test on GLINT? Same story in reverse. The models learn features specific to their training distribution (lighting patterns, image quality characteristics, pose distributions) rather than generalizable representations of faces.

When I first saw the cross-domain numbers, I assumed I’d made a mistake. The drop was too severe. I spent weeks verifying the evaluation pipeline, checking for data leakage, testing different configurations. The numbers held. These models memorize their training distribution rather than learning transferable features.

What I’ve Tested

Hundreds of training runs across architectures, loss functions, hyperparameters, and augmentation strategies. Each hypothesis requires systematic experimentation; a single architecture comparison might involve dozens of runs to control for learning rate, batch size, and regularization effects.

Classifier-based vs metric learning. Most published face recognition uses classifier-based training: teach the model to categorize known identities, then use the learned embeddings for new faces. I tested CircleLoss, ArcFace, and AdaFace, all classifier-based, all hitting the same ceiling around 20% cross-domain accuracy. The current experiments use pure metric learning (supervised contrastive), which trains directly on embedding similarity without a classifier. Early results show improved cross-domain accuracy, currently in hyperparameter tuning.

Architecture matters, but not how you’d expect. ConvNeXt (a modern CNN) outperformed ResNet by 2x on cross-domain accuracy. But ConvNeXt v2, published as an improvement over v1, performed 27-48% worse at every model size. The v2 improvements (Global Response Normalization, self-supervised pretraining) help ImageNet classification but hurt face recognition. The self-supervised features are optimized for image reconstruction, not identity discrimination.

Scaling is non-monotonic. A 50M parameter model outperformed 89M, 197M, and 350M versions of the same architecture. The larger models have capacity to memorize training identities rather than learn generalizable features. There appears to be a sweet spot relative to the number of training identities.

Hybrid approaches underperform. Adding metric learning regularization (triplet loss) to classifier-based training (ArcFace) gave only +0.8 percentage points improvement. The classifier objective dominates. If pure metric learning is the answer, it needs to be pure.

Why Benchmarks Persist

If these benchmarks don’t predict real-world performance, why does the field use them?

Academic research is detached from practical applications in this area. Researchers focus on the math and getting published, not deployment. Transfer learning has only recently become a research focus in itself, and usually isn’t part of the original hypothesis being tested.

Face recognition is also controversial (the technology can and is abused) and lucrative (meaning the best work happens inside companies and doesn’t get published). The academic literature reflects what’s easy to benchmark, not what matters in production.

Where It Stands

The project is active. I’ve moved beyond classifier-based approaches (which hit a ceiling around 20% on my test set) to pure metric learning methods. I’m also running experiments on larger datasets to see if the patterns hold at scale.

The core question remains open: can you train a model that genuinely generalizes to unseen identities in unseen domains? The published literature suggests yes. My experiments so far suggest “not without significant changes to how these models are trained.”

If You’re Evaluating Face Recognition

If a vendor shows you 99% accuracy on standard benchmarks, the right question is: how similar is their training data to my deployment environment?

If your data looks like their training data (similar lighting, image quality, demographics) those numbers might hold. If your environment differs significantly, ask for a demo on your data. The gap between benchmark performance and cross-domain performance is larger than most people assume.

Tech Stack

PythonPyTorchMLflow