-
Aryan Nanda authoreded745f14
Enhanced Media Experience with AI-Powered Commercial Detection and Replacement
Introduction
Leveraging the capabilities of BeagleBoard’s powerful processing units, the project will focus on creating a real-time, efficient solution that enhances media consumption experiences by seamlessly integrating custom audio streams during commercial breaks.
Summary links
- Contributor: Aryan Nanda
- Mentors: Jason Kridner, Deepak Khatri
- GSoC Repository: TBD
Status
This project is currently just a proposal.
Proposal
- Created accounts across OpenBeagle, Discord and Beagle Forum
- The PR Request for Cross Compilation: #185
- Created a project proposal using the proposed template
About
- Resume - Find my resume here
- Forum: :fab:`discourse` u/aryan_nanda
- OpenBeagle: :fab:`gitlab` aryan_nanda
- Github: :fab:`github` AryanNanda17
- School: :fas:`school` Veermata Jijabai Technological Institute (VJTI)
- Country: :fas:`flag` India
- Primary language: :fas:`language` English, Hindi
- Typical work hours: :fas:`clock` 9AM-5PM Indian Standard Time
- Previous GSoC participation: :fab:`google` This would be my first time participating in GSOC
About the Project
Project name: Enhanced Media Experience with AI-Powered Commercial Detection and Replacement
Description
I propose developing GStreamer Plugins capable of processing video inputs based on their classification. The plugins will identify commercials and either replace them with alternative content or obscure them, while also substituting the audio with predefined streams. This enhancement aims to improve the media consumption experience by eliminating unnecessary interruptions. I intend to explore various video classification models to achieve accurate detection and utilize TensorFlow Lite to leverage the native accelerators of BeagleBone AI-64 for high-performance, real-time inferencing with minimal latency. I believe real-time high-performance would be the most critical thing for this project and I intend on testing a few different ways to see which one works best.
Goals and Objectives
The goal of this project is to detect and replace commercials in video streams on BeagleBoard hardware using a GStreamer pipeline which includes a model that accurately detects commercials with minimal latency. Comparison of different model accuracy can be done by doing some manual analysis and trying different video classification models and to finally use the best performing option to be included in the GStreamer pipeline for inferencing of real-time videos. This would be the result presented at the end of the project timeline. For phase 1 evaluation, the goal is to build a training dataset, preprocess it and fine-tune and train a Video Classification model to identify commercials segments in a video accurately. For phase 2 evaluation, the goal is to use the the best model identified in phase 1 for commercial detection and build a GStreamer pipeline which would do video processing based on commercial segments classification and using native accelerators present in BeagleBone Ai-64 for high-performance.
In order to accomplish this project the following objectives need to be met.
-
- Phase 1:-
-
- Develop a dataset of videos and corresponding labels indicating the presence of commercials in specific segments.
- Preprocess the dataset to ensure it's suitable for input into deep learning models. Moreover divide the datset into train, validation and test set.
- Fine-tune various deep learning models and train them on the prepared dataset to identify the most accurate one for commercial detection in videos.
- Save all trained models to local disk and perform real-time inference using OpenCV to determine the model that yields the best results with high-performance.
-
- Phase 2:-
-
- Based on all the options tried in Phase 1, decide on the final model to be used in the GStreamer pipeline.
- Compiling the model and generating artifacts so that we can use it in TFLite Runtime.
- Building a GStreamer pipeline that would take real-time input of media and would identify the commercial segments in it.
- If the commercial segment is identified the GStreamer pipeline would either replace them with alternative content or obscure them, while also substituting the audio with predefined streams.
- I will also try to cut the commercial out completely and splice the ends.
- Enhancing the Real-time performance using native hardware Accelerators present in BeagleBone Ai-64.
Methods
In this section, I will individually specify the training dataset, model, GStreamer Pipeline etc. methods that I plan on using in greater details.
Building training Dataset and Preprocessing
To train the model effectively, we need a dataset with accurate labels. Since a suitable commercial video dataset isn't readily available, I'll create one. This dataset will consist of two classes: commercial and non-commercial. By dividing the dataset into Commercial and Non-Commercial segments, I am focusing more on "Content Categorization". Separating the dataset into commercials and non-commercials allows our model to learn distinct features associated with each category. For commercials, this might include fast-paced editing, product logos, specific jingles, or other visual/audio cues. Non-commercial segments may include slower-paced scenes, dialogue, or narrative content.
To build this dataset, I'll refer to the Youtube-8M dataset,
which includes videos categorized as TV advertisements. However, since the Youtube-8M dataset provides encoded
feature vectors instead of the actual videos, direct usage would result in significant latency. Therefore,
I'll use it as a reference and download the videos labeled by it as advertisements to build our dataset.
I will use web scraper to automate this process by extracting URLs of the commercial videos. For the
non-commercial part, I will download random videos from other categories of Youtube-8M dataset.
After the dataset is ready I will preprocess it to ensure it's suitable for input into deep learning models.
Moreover I'll divide the datset into train, validation and test set. To address temporal dependencies during
training, I intend to employ random shuffling of the dataset using
`tf.keras.preprocessing.image_dataset_from_directory() with shuffle=True`
. This approach ensures that
videos from different folders are presented to the model randomly, allowing it to learn scene change detection
effectively.
Video Classification models
MoViNets is a good model for our task as it can operate on streaming videos for online inference. The main reason behind trying out MoViNets first is becaue it does quick and continuous analysis of incoming video streams. MoViNet utilizes NAS(Neural Architecture Search) to balance accuracy and efficiency, incorporates stream buffers for constant memory usage, and improves accuracy via temporal ensembles. The MoViNet architecture uses 3D convolutions that are "causal". Causal convolution ensures that the output at time t is computed using only inputs up to time t. This allows for efficient streaming. This make MoViNets a perfect choice for our case.
Since we don't have a big dataset, we will use the pre-trained MoViNets model as a feature extractor and fine-tune it on our dataset. I will remove the classification layers of MoViNets and use its pre-trained weights to extract features from our dataset. Then, train a smaller classifier (e.g., a few fully connected layers) on top of these features. This way we can use the features learned by MoViNets on the larger dataset with minimal risk of overfitting. This can help improve the model's performance even with limited data.


If MoViNet does not perform well than we can use other models like ResNet-50+LSTMs. Since a video is just a series of frames, a naive video classification method would be pass each frame from a video file through a CNN, classify each frame individually and independently of each other, choose the label with the largest corresponding probability, label the frame, and assign the most assigned image label to the video. To solve the problem of "prediction flickering", where the label for the video changes rapidly when scenes get labeled differently. I will use rolling prediction averaging to reduce “flickering” in results.
The Conv+LSTMs model will perform well as it considers both the spatial and temporal features of videos just like a Conv3D model. The only reason it is not my first choice is because MoViNets are considered to be better for real-time performance.

Optional method with Video Vision Transformers
This is a pure Transformer based model which extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. I have kept this as an optional method because our problem is of binary classification(either Commercial or Non-Commercial), so using such a complex model for this small problem may not be as efficient as other models.
Choosing the Best Performing model
After training the models, I'll assess their performance using evaluation metrics and conduct real-time inference on a sample video containing both commercial and non-commercial segments. I'll select the model with the highest accuracy and integrate it into the GStreamer pipeline for further processing.