Frame AI is a predictive intelligence system built by our lab for a leading skincare brand. It analyzes 62 observable traits across visual, auditory, and textual dimensions of short-form video content — and returns a performance forecast before the content goes live.
Built for a skincare startupThe skincare brand was producing dozens of reels per week across Instagram and TikTok. Each one required creative direction, production time, editing, and copywriting — a significant spend of both time and money. But the outcome was always a coin flip. Some reels exploded. Most didn't. And there was no reliable way to know in advance which would be which.
The team had intuition. They had rough guidelines — use trending audio, keep it under 15 seconds, open with a hook. But intuition doesn't scale, and guidelines based on anecdotal observation miss the vast majority of what actually drives performance. The brand was spending heavily on content production with no predictive framework to prioritize what was worth producing and what wasn't.
The core question they brought to our lab was deceptively simple: can we know — with real confidence — whether a piece of content is going to work before we spend the resources to publish and promote it?
Frame AI is a supervised predictive model that ingests a reel — either as a finished asset or as a structured pre-production brief — and evaluates it against 62 distinct observable traits that our lab identified as statistically significant drivers of content performance.
These 62 traits span three dimensions. The visual dimension covers attributes like scene composition, color palette dominance, face presence and positioning, motion pacing, text overlay density, and transition frequency. The auditory dimension covers audio type classification, beat alignment with visual cuts, voiceover presence and tone, and trending sound detection. The textual dimension covers hook structure in the opening seconds, caption sentiment, call-to-action placement, hashtag relevance scoring, and keyword alignment with high-performing historical content.
The model was trained on a proprietary dataset of thousands of the brand's historical reels, each labeled with engagement outcomes — view count, watch-through rate, saves, shares, and follower conversion. It learned not just what "good content" looks like in the abstract, but what good content looks like for this specific brand, this specific audience, and this specific platform behavior.
The output is a performance score accompanied by a trait-level breakdown — showing exactly which of the 62 traits are contributing positively, which are dragging performance down, and what specific adjustments would improve the forecast. The brand's content team now uses Frame AI as the final checkpoint before any reel goes to publish.
Frame AI is not a single model — it's a multi-stage inference pipeline that orchestrates several specialized components to evaluate content across modalities simultaneously.
The visual analysis layer uses a convolutional neural network fine-tuned on short-form video content to extract frame-level features — scene segmentation, color histogram analysis, object detection for product placement and face positioning, and temporal motion flow mapping. These features are then compressed into a fixed-dimensional visual embedding using a custom autoencoder architecture trained specifically on the brand's visual style.
The auditory layer runs parallel processing through a spectrogram-based audio classifier that identifies audio type — original voice, trending sound, licensed music, ambient — and a beat detection system that measures synchronization between audio peaks and visual transitions. A separate sentiment classifier evaluates the emotional tone of any spoken-word or voiceover components.
The textual layer applies a natural language processing pipeline to captions and on-screen text. This includes hook classification using a fine-tuned transformer trained on high-performing opening lines, sentiment polarity scoring, keyword extraction with TF-IDF weighting against the brand's historical high-performers, and a call-to-action effectiveness model.
All three layers feed into a late-fusion ensemble model — a gradient-boosted decision architecture that combines the modality-specific embeddings with metadata features like posting time, day of week, and content category tags. This ensemble was trained using a custom loss function that weights watch-through rate and share rate more heavily than raw view count, reflecting the brand's strategic priority of depth of engagement over surface-level reach.
The model was validated through rigorous k-fold cross-validation on a held-out test set of reels that the model had never seen, and it consistently outperformed the brand's internal editorial judgment on predicting which reels would land in the top performance quartile.