close
close

Decoding viewer emotions in video ads

Significance of the results

This study tackles the challenge of predicting viewer emotions from video content, an important problem in the realm of effective video advertising. Existing research has emphasized that videos eliciting robust emotional reactions from viewers-whether uplifting or distressing–are approximately twice as likely to be shared and generate higher engagement compared to less emotionally impactful ones15,16. Consequently, the ability to determine if a video advertisement will evoke notable emotional responses in viewers is vital for impactful marketing strategies.

To address this problem, we leverage a unique dataset and analytical approach. Our investigation utilizes an annotated database of 30,751 video advertisements, with approximately 75 viewers annotating eight distinct emotions for each video. Unlike existing research that primarily focuses on physiological signals or explicit behavioral indicators, our work emphasizes viewer-reported emotional responses, providing a direct measure of the emotional impact of the video content. We introduce the concept of ’emotional jumps’ – significant shifts in viewer emotional responses identified within brief time intervals of video content. These emotional jumps represent moments where a strong emotional response was evoked across the entire panel of viewers, helping to address issues of subjectivity by capturing a collective emotional resonance. The jumps are crucial for pinpointing moments that profoundly impact viewer emotions, a vital aspect for understanding the emotional effects of multimedia content. Our approach leverages these emotional jumps to build a video classifier.

A key contribution of this work is the development and application of a multimodal convolutional neural network, with the intention of establishing a strong baseline for predicting emotional jumps from video and audio data. This model makes joint use of both modalities and achieves a mean balanced accuracy of 43.6% for 5-second clips featuring pronounced emotional jumps, and an average Area Under the Curve of 75% for full advertisement analyses. These results significantly outperform the random guess baseline of 12.5%, demonstrating the model’s promising ability to distinguish nuanced emotional states within complex video content.

In addressing the classification challenge presented by our dataset, the model exhibited promising accuracy and highlighted the important role of audios signals in emotion identification. Adjustments in frame count or the use of pre-trained weights from INET21K [1, 2] had minimal impact on the outcomes, suggesting a nuanced balance between model complexity and dataset subjectivity. Additionally, our strategic use of a sliding window technique for predicting emotion jumps within complete video ads showed promising efficacy, particularly for emotions pivotal to advertising effectiveness: Sadness, Happiness, and Fear.

Furthermore, we provide an important contribution by making available an unprecedented dataset for research purposes. The dataset consists of a processed subset of System1’s Test Your Ad collection, comprising 26,635 labeled 5-second clips that encapsulate the core of viewer emotional responses to the video advertisements. To the best of our knowledge, this is the first large-scale dataset to provide self-reported, sample-based emotional labels associated with video and audio data. Together with our open-source code and trained models, these resources represent a significant step towards fostering innovation in the field of affective video analysis, enabling researchers to build upon our work and advance the state-of-the-art.

Connection to other video understanding tasks

Our contributions lie within the domain of automated video understanding, a significant challenge at the intersection of computer vision and artificial intelligence. Video understanding extends traditional image recognition by introducing a temporal dimension, enabling the exploitation of motion and the intricate patterns that unfold over time. Among the multitude of tasks within this domain, human action recognition stands out, involving the identification of various human actions in video frames9,17. In comparison to our research, human action recognition focuses on explicit pattern recognition. Actions are observable, tangible, and can be directly inferred from visual content. Extensive efforts from the community have led to the annotation of numerous videos showcasing diverse human actions, resulting in the creation of high-quality, large-scale datasets that serve as benchmarks in the field18,19,20,21.

Approaching the core of our study, our trajectory intersects with affective video content analysis, a niche area that aims to unravel how videos evoke emotions in humans and predict these emotional responses to dynamic visual stimuli1,22. Another tangent within video understanding is facial emotion recognition, involving the detection of human faces in videos and decoding facial expressions to infer underlying emotions23,24. It’s essential to note that our study differs in focus. Instead of relying on direct visual cues related to video viewers, we predict emotions evoked by video content itself-a challenge compounded by the fact that emotions, unlike actions, are intangible entities, not directly displayed on video canvases but rather elicited within viewers.

Choice of neural network architecture

Over the past decade, convolutional neural networks (CNNs) have found success in video understanding, particularly in action recognition7,10. Mapping a video sequence to an outcome involves processing both spatial and temporal information. While 3D CNN architectures have been used for video feature extraction10,25, they come with a high computational cost. High-performance 3D CNN architectures are often hindered by this computational expense11. Various solutions have been proposed to address this issue10,18. More recently, transformer-based video networks have shown superior performance on benchmark datasets compared to CNN networks26,27. However, transformer-based architectures still come with substantial computational costs and tend to outperform CNNs only when a large amount of data is available28.

In contrast, 2D CNNs offer computational efficiency. While 2D CNNs have been used to learn spatial features from individual frames29,30, they cannot model temporal information on their own. To overcome this limitation, the Temporal Shift Module (TSM)7 was introduced, significantly improving performance on action recognition datasets, even when compared to 3D CNN architectures. TSM employs a ResNet50 architecture to process input frames in parallel and shifts features between temporally neighboring blocks before the convolutional layers. This feature exchange across neighboring frames greatly enhances the architecture’s ability to learn temporal features.

In our work, we leverage recent advancements in this area and extend existing state-of-the-art architectures to suit our problem setting. We chose to adapt TSM for the classification of emotion jumps due to the trade-off between available data, computational efficiency, and classification performance. Furthermore, both visual and auditory stimuli contribute to a video’s affective content31. Therefore, we aimed to jointly account for video and audio content when developing our convolutional neural network (CNN) architecture (see the appendix for more details).

Connections to related studies and datasets

Affective video content analysis exists within a diverse landscape of datasets, but it often lacks depth and breadth when compared to other domains like action recognition, which boasts ample datasets32. One notable dataset in this realm is the DEAP dataset33, which provides insights into reactions to music videos through both physiological signals and viewer ratings. However, it is limited in scale, containing only 120 one-minute videos. Additionally, other datasets such as the WikiArt Emotions Dataset34 and the Discrete LIRIS-ACCEDE dataset35 take different approaches, focusing on art emotion annotation through crowdsourcing and affective video annotations using a pairwise comparison protocol, respectively.

Some research in affective video content analysis incorporates external mechanisms like facial expression imaging or physiological measurement devices to discern viewer reactions to videos36,37,38. An example of this approach is the EEV dataset39, which utilizes viewers’ facial reactions for automatic emotion annotation in videos, albeit with the challenge of converting facial reactions into distinct emotion labels. Moreover, datasets like the EIM16 dataset, derived from the LIRIS-ACCEDE database, and the Extended COGNIMUSE dataset emphasize different aspects of emotional annotation. The EIM16 dataset, for instance, is designed for both short video excerpt emotion prediction and continuous emotion annotation for longer movie clips, while the COGNIMUSE dataset delves into the distinction between intended and expected emotions.

In comparison to these datasets, our dataset exhibits distinct features. Rooted in manual annotations, it ensures a richer and more personalized emotion spectrum. What sets our collection apart is its remarkable scale, comprising over 30,000 video ads-only the LIRIS-ACCEDE dataset comes close in magnitude. Furthermore, our dataset benefits from the consistency of cultural backgrounds among annotators, offering a unique perspective. This contrasts with datasets like LIRIS-ACCEDE, which source annotations from a global participant pool. The diversity in the emotional spectrum our dataset covers, spanning eight distinct emotions, establishes it as a robust and comprehensive resource for in-depth affective video content analysis.

Limitations of this study

While our dataset represents an important step forward, we must acknowledge the inherent subjectivity in manual emotion annotation based on viewer recall. Labeling sentiments poses more reproducibility challenges compared to labeling relatively objective actions. However, we can reasonably expect more consistent labeling for pronounced emotional reactions, where a large majority of viewers concur within a short span, as opposed to subtle responses.

To help mitigate subjectivity, we introduced “emotion jumps”—brief clips that trigger particularly intense responses tied to specific emotions. By focusing analysis on these pronounced spikes, we aimed to capture the most vivid and consistent reactions across viewers. We restructured the dataset into a video classification format by selecting clips that provoked heightened emotional reactions for each category. This process yielded improved consistency and reliability in the labels compared to subtle responses. It enabled our model to more effectively discern emotions evoked by videos based on detected patterns in these pronounced responses.

Our approach analyzes full-length videos by extracting overlapping 5-second segments. This sliding window technique aims to capture even brief, transient emotional peaks that non-overlapping segments would likely overlook. Focusing solely on complete videos risks missing these potent yet ephemeral reactions. However, overlapping windows introduce potential drawbacks like redundancy and overemphasis on momentary blips rather than overall sentiment. Ultimately, the sliding window methodology strikes an effective balance. It combines fine-grained localization of emotional bursts with consolidation to assess overall video sentiment. This twin perspective enables nuanced evaluation attuned to the intricacies of emotion dynamics in videos. Alternative methods may neglect critical nuances – either losing transient peaks in holistic views or lacking broader context from non-overlapping segments. Our approach fuses detailed and big picture analysis to provide the layered insight essential for video emotion understanding.

While our study employs a CNN-based model complemented by the Temporal Shift Module to establish a straightforward and computationally tractable baseline, we acknowledge the potential benefits of more advanced architectures, such as those incorporating attention mechanisms or visual transformers. These architectures have demonstrated superior performance in various video analysis tasks, and exploring their application to video emotion recognition presents a promising avenue for future research. Building upon the foundation established in this study, we plan to investigate the potential of attention-based models and visual transformers to further refine our understanding of video-induced emotions, while carefully evaluating the trade-offs between model complexity, computational requirements, and performance gains, considering the unique characteristics of our dataset and task.

Conclusions

Our prototype represents an original advancement in emotion recognition, leveraging deep learning to predict emotional reactions directly from video content. By analyzing multimedia signals to identify moments that elicit strong responses, our model enables advertisers to create more impactful campaigns that intimately resonate with target audiences. By utilizing our evoked emotion recognition model, marketing professionals can rapidly analyze vast video databases, identifying moments that trigger powerful emotional responses. These emotionally charged segments serve as valuable resources for marketers, enabling them to create compelling and influential advertisements that resonate seamlessly with viewers. The implications are manifold-not only enhancing marketing effectiveness but also fostering deeper and more meaningful engagements with the intended audience. Beyond advertising, applications are far-reaching in luxury brand marketing, entertainment analytics, and empathetic AI systems.

There are numerous avenues for future exploration and improvement. A pressing need exists to elucidate our model’s inner workings and decision-making rationale. As it currently operates as a black box, deciphering techniques to highlight precise emotion-evoking frames or elements within those frames would provide valuable insights. Illuminating which visual motifs or ambiance the model associates with certain emotions could allow advertisers to refine strategies for maximal emotional impact. Diversifying the cultural demographics in our dataset is another important area for improvement. While currently spanning the UK and US, expanding to broader global regions could enhance applicability. Emotional triggers and norms differ significantly across cultures. For instance, reactions in Asia or South America may deviate markedly from the current Anglo-centric data. By diversifying cultural representation, our model could learn more universal emotional patterns.

Table 1 System1’s Test Your Ad data: distribution of video ads by duration (in seconds).
Fig. 1

Facial expressions used by System1 Group PLC’s FaceTrace method during the video annotation process. (Source: System1 Group PLC, reproduced with permission).

Fig. 2
figure 2

System1’s Test Your Ad: At each time point throughout a video clip, we can measure the proportion of viewers in the panel (approximately \(n=75\)) who self-declared experiencing one of the eight emotions. This example illustrates the changes in emotional profiles within a video.

Fig. 3
figure 3

TSAM model: the multi-modal CNN architecture takes as input a predefined number of video frames (video segments) and audio converted into mel-spectrograms (audio segments). The ResNet50 backbone is used to extract features from both video and audio segments. Features from video segments are shifted between each other at different blocks of ResNet50. The audio input is represented by the mel-spectrogram) and is processed by the same backbone without shifting. The extracted features are fused by averaging and mapped to the output classes using a fully connected layer.

Fig. 4
figure 4

System1’s Test Your Ad data: Average number of user clicks per emotion for every 30 seconds of video, adjusted to account for different video lengths.

Fig. 5
figure 5

System1’ Test Your Ad dataset: Distribution of 5-second clips based on the percentage of users expressing various emotions. The x-axis shows the response strength as the percentage of viewers feeling the emotion. The y-axis shows the percentage of clips evoking that response level. Clips in the top 0.5% of the distribution (highlighted red) were used to define the emotional jumps, which were labeled for classifier training and testing.

Table 2 System1’s Test Your Ad data: Number of labeled 5-second video clips and corresponding number of videos, categorized by emotion.
Fig. 6
figure 6

System1’s Test Your Ad data: distribution of number of emotion clicks per user per video, not adjusted for video duration.

Table 3 System1’s Test Your Ad data: Average accuracy of the video classifier broken down by input modality, number of video frames and pre-training method.
Table 4 System1’s Test Your Ad data: Balanced accuracy and test set size achieved by the video classifier broken down by emotion.
Fig. 7
figure 7

ROC curves for predicting the presence of emotion jumps in full-length video ads using our best CNN model (16 frames, RGB + audio input, pretrained on INET21K).