Conf42 Machine Learning 2022 - Online

Greenfield vs. Brownfield Data Labeling to improve AI performance

Video size:

Abstract

In this talk, we will focus on the data perspective when building machine learning pipelines. Using two examples, I will show how greenfield and brownfield data labeling differ, what you should focus on in each, and how to best leverage new technologies, frameworks, and products to build high-performing models.

The goal is to give you a better understanding of what data options you have for building machine learning pipelines (whether for classification or extraction). The ideas and concepts are based on research results from the Hasso Plattner Institute and three years of experience in consulting AI projects.

Summary

  • In data labeling, you don't only need raw data, but labeled data. In Greenfield labeling, we basically want to start from scratch. In Brownfield labeling we already have existing training data. You can then improve on it continuously during brownfield labeling.
  • weak supervision is a perspective from a machine learning point of view on integrating information. The basic concept is quite easy, but you can build really cool applications using weak supervision. You can combine several heuristics to create sophisticated weekly supervised labels.
  • The main idea is that you label to build, whereas for building a classifier do this for real inference, real time. In labeling you can have access two data that potentially is not available at runtime. There's a trade off and you decide for the confidence in data labeling.
  • Even in greenfield labeling, manual labeling still matters a lot. You not only want to automate, but you also want to explore your data. For automatic labeling, needs some reference data so that you know how good your automation actually is.
  • I'm first jumping into neural search. You can compute embeddings for data using for instance, pretrained transformer models. Can also be used to find very representative data. Greenfield labeling is really a good usage to automate your labeling.
  • In real world, we mostly have messy data, and that is why in brownfield we want to improve on the data quality. This is where technologies like confident learning come into place. It essentially helps to estimate how large the error rate is. How training is an integral part of machine learning applications.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
So before I jump directly into the different terms, Greenfield and Brown field, I want to give you a very short intro into data labeling. Most likely every one of you already heard of it, but just a very short basic concepts. So of course, if you want to build some classifier or information extractor, you don't only need raw data, but you need labeled data, right? So for instance, if we have some email, we want to mark or categorize different information that we want to basically predict using an AI model. And this is typically done via manually labeling the data, right. So you have some person that labeling the data and you can then use this labeling data to build your model. Well, I want to do some analogy and talk about this kind of data labeling. And we talk a lot about Greenfield and Brownfield in it projects. Right. So in Greenfield is typically, is when you can start from scratch, you can build in a completely new environment. And on the contrary, in Brownfield you basically build on an existing system. You have to work with legacy code and basically work with integrations and stuff like this. So it's not about designing something new, but improving something continuously. And we can transfer those concepts also to data labeling because we have something quite similar there. So in Greenfield labeling, we basically want to start from scratch. We only have the raw data, but we want to build some proof of concepts really fast and go from basically zero to 90 or something like this. And on the contrary, in Brownfield labeling we already have existing training data, but we are most likely sure that there are some potential quality improvements so that we can continuously work on that and by doing so improve the performance of our model. That's basically the idea about Greenfield and Brownfield data labeling and from some other perspectives, basically is that you mostly focus on training quantity in Greenfield, and that you really create a lot of training data. You know that the quality won't be perfect, but you can then improve on it continuously during brownfield labeling. So I'm going to talk first about Greenfield, give you some ideas about that, talk about really cool technologies. Then we're going to Brownfield. Then we are making our conclusion, right. And so when we talk about Windfield labeling, we first talk about how you label from scratch, what kind of options you have. And we also want to understand what are the real problems we have there. So the first two options you have if you want to labeling data is that you either go via crowd labeling, which is globally scalable, so you can have lots of people working on that. But typically you have issues when it comes to very difficult tasks where you need a lot of domain knowledge, like for instance in insurance companies. And the contrary to crowd labeling basically is in house labeling, so that you let your experts in house label the data, but of course, then you don't have the global scale, right? So that becomes a lot more expensive and also oftentimes a bottleneck in your projects. So it really is difficult to create a large training set that you can start with, that you can use to prove your concept, basically. And that's why a lot of people think about how you can automate your labeling such that you can create large training sets easily. And one of those ideas is weak supervision, which is basically a perspective from a machine learning point of view on integrating information, right? And the basic concept is quite easy, but you can build really cool applications using weak supervision. And in a sense that for each heuristic that you can come up with. And the heuristic can be something, we go into this in a bit more detail, but can be something like a labeling function that is not perfect, like predicting the right label 100% of the time, because then you would also have the classifier, but it's something that gives you the right label in 80% of the cases, or 70% of the cases, and not for all records, but just for some small subset. And you want to come up with several of those heuristics and compute them for each of every of your records, so that you then have like a matrix with all of your heuristics that create noisy labels. And the task of weak supervision is to basically combine them, right? And again, weak supervision is not one algorithm, but it's a family of algorithms that you can use. Two, look at your noise label metrics and come up with the potentially best synthesized labels for your data, right? So you also, most of the times, don't have like a discrete label, but you have a probabilistic label. And one algorithm could be, for instance, majority vote, that you just look at the counts of the heuristics that make a vote for each record. But you can also go into more sophisticated algorithms that analyze precision, that analyze coverage, that analyze conflicts, stuff like that, so that you can really create sophisticated weekly supervised labels. And if we now talk about heuristics, I just want to showcase you what kind of types heuristics can be. So one of the most simplest one is labeling functions, which can be just like a very simple python function, just few lines of code that basically just takes this input a record and returns then the label that you want to products other heuristics can be, for instance distance supervision, which is basically looking up values that are very much associated to some kind of label or active learning modules that continuously learn on the data that you already labeled manually. It can be zero shot classifiers, for example from hugging phase, which are very similar to labeling functions, but you don't really write the code of the labeling function, but you instead just provide the label names, which is really cool. Can be something like unexperienced labelers. So still manual labeling, like crowd labeling or interns, and it can of course be anything that you can integrate. So third party systems, legacy systems, so in general you really have a generic interface and the idea is that you collect noisy labels and the relevance of each heuristic is then determined and you can combine them into wiki supervised labels. If you can now automate the labeling, we have to think about why do we even want to train a classifier? And for that we can have a short comparison. And the main idea is that you label to build, whereas for building a classifier do this for real inference, real time. So for labeling, the runtime doesn't essentially matter. You can also just run a program over the night, whereas for inference you oftentimes only have milliseconds. Also in labeling you can have access two data that potentially is not available at runtime. And hasso, you don't aim for like 100% coverage, but instead you want to have a really high confidence. So there's like a trade off and you decide for the confidence in data labeling. Hasso, it's much more that in labeling you produce an artifact. So like something that you can build software with, whereas on the other side, again, that software is what you want to use for inference. So there's like this comparison. And also what's oftentimes seen and also great studies about it, is that if you have a model and you then use the labeling training data from something like supervision, the calcifier that you train oftentimes generalizes very well, so you still gain a lot of value from just training your model on the data. But as we talk about automating labeling, we also have to talk about that. Even in greenfield labeling, manual labeling still matters a lot because you have different problems also here that you want to tackle, because you not only want to automate, but you also want to explore your data. So you want to see what kind of patterns there are and get familiar with that. You also oftentimes for automatic labeling, needs some reference data so that you know how good your automation actually is. You want to measure how good is the human performance. Is there any subjectivity in my data so that two people that label the same data might disagree every now and then? And of course, manually labeling data helps a lot when you want to come up with techniques for automation, right? And also there are different strategies that you can follow if you want to manually label your data. So for exploring, for instance, you can make great usage of neural search, which is for instance, using so called embeddings, which are vector representations of, for instance, your texts or emails. And you can then use that information or like metadata to navigate through your data. I'm going to give you an example in a second. Of course, then for reference data, you can just use random sampling if you want to understand the performance of people on your data. You can filter for subsets that are already labeled by other people so that you can easily calculate something like an internetator agreement. Also going shown to you this in a second. And of course you also want to validate how good, like come up with new ideas and validate your heuristics and use therefore also filters for that. Also going to show this in a second. I'm first jumping into neural search and as I just mentioned, you can compute embeddings for data using for instance, pretrained transformer models that put your textual data into numeric vector representation. And if you have that data, you can make very good usage of it. So for instance, if you want to find outliers in your data, one very sophisticated approach is to use diversity sampling. And it basically is that you start by grabbing one random sample, like for instance, this one. And if you have it labeled, calculate the most distant record, that is to this reference point, which it will be, for instance, this green record. And if you do this continuously and always compute a vector that is the most distant to your current pool of labeled data sets, you will find the outliers of your records, which is super interesting because you can then really analyze what kind of obstacles will you face when you really have your model deployed and when it's running and you want to infer new predictions using that model. So that's always super interesting to understand what kind of outliners there are in texts or images or whatever kind of data you have. On a different side, neural search can also be used to find very representative data, but not to stay in one cluster. But for instance, if you have your embeddings clustered, find a certain number of data points per cluster that really are representative. And if you do so, you will very easily explore the data that really helps your model to learn, right? So for instance, if you have the three clusters, you can decide that you want to have, for instance, two or three data points per cluster really depends on how complicated your text, data, image data is, but it can really help you to progress and label efficiently. So now we talked about greenfield labeling and we saw that it is really a good usage to automate your labeling and that you can use great technologies. I'm going to show some more in a second also that are very tightly bound to Brownfield labeling. But you really see that you can automate labeling to some extent using technologies like Wix provision, so that you can achieve the goal of greenfield labeling much faster, which is that you want to prove concepts, you want to show that some kind of task might be automated using machine learning models. So that is really helpful for that case. And now we are jumping into the second case, which is we have now shown, we now see that our usage, like our use case, actually works, but we have not achieved the performance that we want. We need to continuously improve the performance, and only in the rarest occasions it is that we really have a super clean data set. In real world, we mostly have messy data, and that is why in brownfield we want to improve on the data quality, right? So for instance, we have those examples from very well known data sets, and you can see that they have been labeled incorrectly. So by correcting them, you would improve the data quality and thus also would improve the performance of your model. The problem here is you don't know which records are the ones that have been mislabeled. Of course, you don't want to label everything again. So how do you make best usage of your time and money? And this is where technologies like confident learning come into place, right? So confident learning basically uses already trained models, like for instance, your products model or the model that you use per inference, and it calculates the outcomes of the predictions and puts them into probabilities so that you have basically not only one discrete prediction, but one prediction per label. And using that, you can not only compare that to your noise labels, where noise label doesn't essentially have to be some weekly supervised label, but any kind of training data where you show that there might be some flaws in your data. And in confident learning, you try to estimate the joint distribution of your true labels and your noisy labels, so that you can then compute or at least estimate the number of errors, right? So for instance, if we sum up the true cases, we have a probability of 75. And if we sum up the false cases, we have a probability of 25%. So in general, what we could say here that we most likely have something of roughly 25% potential errors in our data, which is quite a large number, right? So of course, this is not a perfect computation, but it is an estimation. And what you can now do is that you can, using the model outcomes, compute confidence scores per each prediction and then sort so that you have the lowest scores first. And if you know that you have an error rate of 25%, chances are hide that in those first quarter of your data, there are a lot of label errors. And by looping over them or looking and investigating them, you will most likely find some labeling issues. So, confident learning essentially helps to estimate how large the error rate is, which ultimately helps you to determine the quality and to find those records which potentially are still mislabeled. Then again, if we use technologies like weak supervision, which makes use of lots of heuristics, you can also debug those heuristics in many cases. And this is what I meant, that manual labeling also is very important for burn feed labeling. And that, for instance, if you have set up a labeling function that looks for whether a text starts with a digit and then products, that this is clickbait. What you could do in Burnfeed labeling is that you filter for those records that are hit by this labeling function and then try to investigate in which cases this heuristic is wrong. So, for instance, we see that clickbait is only the case if it starts with a digit and the sentiment is also rather positive, so that we can narrow down our heuristics, they become better over time, and we can work on that and debug basically our heuristics. Of course, this works when you use weak supervision. Then again, if you also labeling with multiple users, and you use strategies to label on data that also has been labeled by other people, you can calculate the inter annotator agreement to see where potential disagreements are and also how subjective labeling might be in your task. But you can also, again, if you use something like greek supervision, use your existing heuristics to estimate where there might be some explicit bias, right? So, for instance, if this, my coworker Zimon and I were labeled on the same heuristic, and he has a very different position than I have, then it might be that we need to talk about this heuristic because we have a different understanding about it. So it really helps us understand the bias that we potentially have. So you see, for both Greenfield and Brownfield, we have a lot of very interesting technologies that we can use to really help us create the training data that we need, both for creating prototypes quickly, but also to continuously improve our models. So if we now think about how data labeling can change over time, we also can think about if training data is an integral part of machine learning applications, how this will look like also with respect to maintenance and documentation, because if it is an integral part of the software should also, at least that we can think about, should also be documented. And it's essentially different than software artifact would look like, right? So you don't have a doc string, but what could it be looking like? Again, this is just something I want to provide us with some food for thoughts. Not a definitive answer, but just something to think about because it's quite interesting. And also, if we talk about labeling, we see that now technologies like neural search or weak supervision create lots of metadata. Then will this potentially shift our idea of labeling? That is where we've focused, to an idea of enrichment that is rather holistic. So, for instance, if we're able to quickly automate data labeling, this might lead to us creating more classifiers in shorter time and ultimately helps us to iterate on product reviews, product feedbacks, stuff like this. So how will the future of this look like? And what part will data labeling look like in building machine learning applications? So, as promised, if you're interested in those techniques or want to research a bit about it, I have some great resources that you can look into. And if this is not enough, you can also just reach out to me and I will show you some further cool resources because those are very interesting and state of the art technologies. And we also have what I just described, technologies integrated to an open source application that you can try out if you want to. We're going to publish it very soon, and if you're interested, you can just register for our newsletter on our web app, sorry, not web app, website. And we will reach out to you once we have published it. So thank you so much for your attention. I'm super happy to be able to talk here at Conf 42. If you have any further questions, please don't hesitate at all to contact me. And thank you so much for your attention.
...

Johannes Hotter

Co-Founder & CEO @ Kern

Johannes Hotter's LinkedIn account Johannes Hotter's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways