Fighting Harmful Content with AI: Scalable Moderation for Real-World Safety

Video size:

Abstract

Discover how we’re building an AI-powered system to detect and moderate harmful content like CSAM in near real time. Learn how it works, how it can serve as a trust and safety tool for customers, and the vision for deploying scalable, privacy-conscious AI moderation.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. It's great to be here. I am Georgina Fu, a machine learning engineer at gco, where I focus on building real time AI systems for video and speech analysis. Today I want to walk you through a project that's really important for everyone in our team, how we are using AI to detect and intercept harmful content online, especially things like. Child sexual abuse material, and other high risk abuses. This isn't just a technical show and tell, I'll take you through the challenges, the architecture and what it takes to make this work reliably at scale, across different regions and real customer and environments. So why does this matter? Let's see, first why we are even trying to give this type of solution to the, to our customers. We are living in a era where people upload everything from personal vlogs and live streams to gaming education and more. A platform like YouTube gets over 500 hours of video per minute, and that's this one platform. But within that sea of content, even a single instance of abuse material is devastating legally, reputationally, and morally. And it's not just in big tech. We work with smaller platforms too. Niche streaming sites, educational platforms, even creators using tools such as Use Screen or W to host content. They don't have armies of moderators who are able to keep up with this extreme rate of video uploading. So we ask how can we detect dangerous content early, automatically, and responsibly before it reaches the audience? Let's be honest, this is not really easy. There are four big challenges we identified quickly and we had to overcome. First, we have class imbalance. Harmful content is extremely rare. If you train a model naively, it'll learn to say everything is fine and to still be 99.999% accurate. That's obviously useless. Class imbalance must be taken into account during system design, model training and monitoring. Second video data is huge. A 10 minute video in a high definition for math might be 500 megabytes or even more. Scanning thousands of videos means dealing with terabytes of data often in real time. Smart triggering is critical. It can help to significantly reduce the amount of requests the system. Third, we have privacy. We can't just dump our user content into a central server, scan it all. We are bound by GDPR and more importantly, by ethics users expect platforms and infrastructure providers to respect boundaries, to avoid scanning, storing, or analyzing their private content unnecessarily or invasively. And finally, adversaries evolve. We have seen harmful content hidden behind innocent looking cat memes, obscured by translucent overlays, distorted with TikTok style filters, or even encoding into the audio track using state iconography techniques. It's wild and it is intentional. These are not accidents. There are attempts to bypass any type of detection system, so our models can just rely on surface patterns or static features. They need to generalize across manipulations, formats and behaviors. We haven't explicitly trained them for. So let's zoom now into our architectures because this is where everything we have talked about becomes real. When we set out to design the system, our goal wasn't just to build good models, it was to build the moderation pipeline that is modular, scalable, and resilient. Something that could plug into different environments, serve different types of customers, and grow over time. So here's the full pipeline we use. It all starts with a trigger, and this is how we design, decide what to scan. This can be random sampling, useful for stored content. It can be met database triggers, so for example, a spike in uploads from one account or file with a suspicious name, similar to hashes that are available online. And it can also be custom, some type of customer signal. So some customers can take videos for post publish moderation, and we'll ingest these events too. This trigger phase is very important because we don't want to scan everything. We prioritize scanning intelligently, which will save a lot of cost and will preserve privacy. Then we have the orchestration layer. Once the content is selected, it enter our task engine. We use seller here, which is a distributed task queue system, which will allow us to run tasks, synchronously, prioritize jobs based on urgency. We try failed steps and load balance across. Multiple worker nodes. So for, as an example, if a live stream segment needs scanning in five seconds, it, this task will jump ahead of a batch job processing multiple stored content. We also chart jobs by modality. So we have vision, audio, and sequence models that can scale independently. And then we have pre-processing layer. Before inference, we need to prep our media. So this will include splitting video into short segments of a few seconds, extracting and normalizing audio resizing or re encoding our model for compatibility. And data duping content we have already scanned before this step is essential to ensure inference runs consistently and across formats. Some of our customers upload 4K while others will use some other cent code deck. And this layer will standardize every input so it is ready for the next step. So right after that, we have the inference layer. This is where multi multimodal ai. We have multiple models running in parallel, so we have frame based image classifiers. We have whisper style speech transcription and keyword spotting, lightweight temporal models such as their user, three Ds. And each of these models will return its own confidence score and tag. For example, we have a video that is marked as high probability of not safer work content. And then there is a keyword like grooming found in the audio. And then the last model attack the output as violent motion sequence. The inference is run in containers that can scale horizontally and will log model versions for traceability and adaptability. So then we have the post-processing layer where it will take all the signals that I described before, and we'll try to combine them either by normalizing confidence scores and applying some aggregation logic or just applying some other classifier on top of the previous models. So here we'll attach some text and optionally enrich with additional metadata. For example, the video title, the language of speech, and so on. This logic evolves with feedback. For example, we recently added a rule that time creates the not safer works scores. If the video is clearly educational, like anatomy, lectures. Then we have the decision layer. This is where we make the final moderation decision. This might mean flagging the content for manual review, auto blocking if the score is high enough, sending an alert to the customer or just locking the result silently if it is borderline. We also store a decision trace. So what the model outputs contributed to the flag when it happened and why. And this entire stack is loosely coupled. So it this means that we are able to scale or replace some individual component without breaking the whole system. So this pipeline is the core of our moderation engine. It is designed not to just analyze content, but to do it in a way that is scalable, auditable, and adapt, adaptable to real world use cases. Let's give a little bit more information about the models as I, me, I mentioned we don't rely on a single AI model, but we use a mo multi-model ensemble. So for video frames, we use CNNs. They're similar to what powers not safer work filters, but fine tuned on stricter data sets. So think this step as detecting nudity or graphic violence in individual frames. Then we have audio models. We transcribe with a whisper based pipeline. And then we check the transcription for keywords, for example. We have a video where code can transcribe a phrase like this is our little secret, and this will be will flag will cause relevant tag in the output of this model. For temporal analysis, we use lightweight JRU or 3D CNN models to detect patterns over time. This is critical for spot actions like slapping or some other type of violence, or generally content that will unfold slowly, sub across five or 10 seconds of video. So all the signals are passed, as we discussed in a decision layer. This layer will fuse the output of all the models and either using score logic or some meta classifier. It'll try to make sure that nothing is missed. So even if one model misses something this layer will combine everything to make sure that the accuracy is much better than the accuracy of the individual models. So now let's move from what we built to how it can actually be used in the wide. This system can be integrated with a variety of platforms, each with different needs. For video platforms, moderation happens at upload time. Think of it like a pre-flight check. So before the video goes live, it gets scanned to ensure it's safe, live streaming services, use it in real time. The content is processed in short chunks, about five seconds at a time, and then we can flag problematic content midstream while it is still happening. And for hosting providers, the system can scan stored content passively. For example, it might sample and review 10% of a library of a video library every hour with no disruption to users. For example, a customer can ask to help them scan a backlog of 10,000 old videos they have inherited. During a platform migration, the system will flag just a tiny percent of them, and then even that small percentage may contain a handful of series violations that will be that otherwise would be legal and reputational risk if they had gone unnoticed. So everything plugs in via simple APIs or webhooks. No deep integration required. And this way, even link teams can add Indus industrial grade moderation without building, building it from scratch. Now I know that some may wonder after all this discussion so far that since we're scanning video uploads, analyzing live streams and scanning whole collections of video, how do we make sure that we are not just building some surveillance? Infrastructure. And this is a fair point when you're dealing with sensitive content, trust is not just technical, it is ethical. So let show you what we have done to build privacy first, responsible moderation into the system from the start. So first of all, we never start our persist user content. Unless it is absolutely necessary. Once a video audio is scanned or stored the original media is dis discarded. We only keep the metadata we need, like tax scores or timestamps, and only when a flag is raised. Then we rely on randomized sampling, as I mentioned, or event based figures. That means we're not just scanning everything all the time. Instead, we design our pipeline to mimic how a human trust and safety team would triage so it would target high risk content while still preserving fairness and unpredictability in their work. Third, our system is compliance. Aware by design, we respect regional data laws like GDPR and others, meaning we never scan without consent and don't retain content outside allowed timeframes. And we deploy different lodging depending on jurisdiction. For example, we avoid certain types of scanning countries with stricter laws. Around personal data. Finally, we regularly audit our models for bias and fairness. We uncovered early on that some of our models was di disproportionately flagging windows with darker lighting or skin tones, and that wasn't acceptable from the beginning. So we retrain the models with more balance data sets and that it bias detection hooks into our validation process. In short, we are trying to build a system, not just to be technically strong, but also ethically responsible. Now looking ahead while we have built a strong foundation, we know that the real power of this system lies in how we scale it. Our goal is to make AI powered moderation feel native to the internet infrastructure, not the bolt on service, but the core part of how content flows across the web. So here is what we are working on. First, we're exploring an edge based moderation, meaning our detection models run directly on CDN Edge nodes, close to where content is uploaded or streamed. That reduces latency sales bandwidth and keeps data local. We are also building region aware moderation, pipelines. Vary widely around the world. What's acceptable in one country might be illegal in another. So we're designing a system that can dynamically adjust thresholds or swap jurisdiction specific model variants depending on the geography of the user or the customer. Another priority is lightweight deployment. We are optimizing our models to run under 200 megabytes so they can live comfortably on edge servers without taking up a valuable compute in practice. This means that we strip down dependencies, simplify architectures and compressing inference graphs as much as possible without sacrificing accuracy, of course. And finally, we're developing a set of. Customer facing tools. We have APIs and dashboards that allow our customers to review moderation events, customize their thresholds, and receive real time alerts. Imagine a streaming platform getting notified within seconds if using content appears mid broadcast, and being able to act on it instantly. All of this is in active development and early stage rollout. So when we talk about the future of trust and safety online, we don't just imagine it sitting in the cloud. We see distributed, embedded at the edge, tuned for local context, and accessible to platforms of all sizes. So let me wrap up here with some lessons we have learned along the way. First orchestration is key. Our models are great, but without task use retries monitoring, it would all break down under load. Then class imbalance isn't just about lik it affects how our customers perceive our system. False positives erode trust, so product taking is essential. Then infrastructure really matters. We have hardbacks where a queue failed silently and nothing can count for hours. Logging and fall, ba fallback logic is really important and can save such systems. And finally, trust isn't earned by being perfect. It's earned by being transparent. So logs, scores, explanations, any co context we can offer to our customers help them with building trust. So if you are building an ai model for safety try to think beyond the model and think the whole system and the product in scope. Thank you everyone for giving the opportunity to talk about this system. It's we are still early in this journey, but I truly believe that AI moderation is becoming not just practical, but essential, and whether you're building platforms. Delivering content or just care about the safer internet, this is something that we need to really focus in building altogether. So thanks. Thank you. Thanks a lot for your time, and I would love to chat if you are working on anything in this space.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Fighting Harmful Content with AI: Scalable Moderation for Real-World Safety

Video size:

Abstract

Summary

Transcript

Slides

Georgina Tryfou

ML Engineer @ Gcore

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Fighting Harmful Content with AI: Scalable Moderation for Real-World Safety

Video size:

Abstract

Summary

Transcript

Slides

Georgina Tryfou

ML Engineer @ Gcore

Join the community!