Conf42 Machine Learning 2024 - Online

Why Most Data Projects Fail and How to Avoid It

Abstract

Unfortunately, the majority of data projects fail. Yet, they fail for the same reasons. This talk will explore the most common reasons data projects fail and how to avoid them. We will do this by introducing the who, what, when, where, and how of data projects.

Summary

  • 85% of data projects fail, according to Gartner. Technology is just one small piece of your success with data. Why can't more companies create more value with data?
  • We need data scientists, data engineers, and operations engineers. Who is the right people at the right ratios? All three of these teams are necessary for success.
  • Oftentimes a company's data strategy sometimes is just AI. What we have to do is say, what is the business value? Projects that take too long to generate value are going to get canceled. Be careful how you set yourself up for failure.
  • It's always important to have clear data on, or clear information or clear numbers on what is the value of what you're doing. Not all gaps are technology. Sometimes gaps are people and technology. These gaps are part of what are going to make you fail in your projects.
  • Sometimes what I like to do is turn a situation upside down. What happens if all of our plans go wrong? What would have happened? How good is a data engineer at a data science task? Don't overengineer and don't copy.
  • My book is called a unified management model for successful. What we need is all three teams working together to create value from data. Only once we have all three of these things that we can truly be successful with our data teams.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, and welcome to this session, why most data projects fail. And I would say, more importantly, how to avoid it. So let's start out by talking about what you see when you go out on social media. When people talk about, why do data projects fail? They think they fail because you chose snowflake or databricks, or it was a data lake versus something else, data warehouse, what have you, or it was the programming language that you chose. You should have chosen Python, should have chosen Java, should have chosen rust. But in reality, is that why projects fail? Well, let's talk about why. And to help back this up, we have this number from Gartner. This also corresponds with my experience of how many data projects fail. And that is 85% of data projects fail. And it brings up the question, why is this so bad? Because that's a really, really terrible number, 85%. So if you invert it, then you say only 15% of data projects actually succeed, which is still really bad. So why can't more companies create more value with data? That's a really good question, and it's one I've spent a lot of time researching and trying to figure out why. And part of this is that technology is just one small piece. It's initially smaller, definitely important. Not going to lie, it is important, but it is one small piece of your success with data. It is not the entire thing. Normally, when people talk with data teams, they just focus on the data, on the technology. What technologies did you use? Programming language. But that's just one small piece. What we're instead going to focus on are the questions, the right questions that you should be asking. We need to know who, what, when, where, and how. These are really important questions that we need to answer before we start these data projects. Who is the right people at the right ratios? This is incredibly important because oftentimes companies will say, I'm just going to hire a bunch of data scientists, and those data scientists can go do this project. The reality is that those projects don't really work well because we're not missing. We don't have all the people we need. We need data scientists, data engineers, and operations engineers. And more importantly, we need them at the correct ratios, where we need more data engineers than we need data scientists. This is a really important part. Oftentimes teams will have no data scientists or just a few data engineers relative to a data scientist. So this brings up this question that I often get, especially for management, and they say, which team is most important? Which team? If I just can only do one, which one should I do. And the reality is that all three of these teams are necessary for success. We need data engineering, we need data science, and we need operations. Each one of these three things does a very specific part or is an important part of this ecosystem. I have another talk that you can watch where I talk about what happens when each one of these is missing. But suffice it to say, if we're missing data engineering, we're missing key software engineering, distributed system skills that data scientists do not have, and this is not their core competency. Likewise with operations. If we're missing operations, we're missing somebody who is making sure that things keep running well. And likewise with data science. If we don't have data science, I go so far as to say, why are we doing all this effort? We're not getting that advanced analytics part. That is a key part of the value creation, the optimal value creation there. Next we go on to what. What is the business value? This is incredibly important. Doing data in and of itself, just for data or because everybody else is doing it is not enough. What we have to do is we have to say, what is the business value? If you're just saying now, especially AI, we're just going to do some AI, we will do this or that. AI, that isn't enough. That isn't a business value. AI is not a business value. Applying AI, or applying data and AI to something to create business value, yes, that's certainly applicable. But there should be a clear and attainable path to value creation. It is not just, we're going to go do AI now, that doesn't work. That does not create any value. It just creates a lot of strife and a lot of extra work for the teams. So having a data strategy isn't enough either. There needs to be a plan and there needs to be execution. Oftentimes a company's data strategy sometimes is just AI, frankly, or it says, we're going to use our customer data to understand our customers better. Not a data strategy either. This isn't clear, this is part of why teams fail, is that the executive management, perhaps the board c level, says, here's our data strategy. We're going to use our data to do something nebulous, something that you don't really know what's happening, or there's no real way to measure it, no direction to it. So what we need to do is we need to have a data strategy that has a clear plan, it has clear execution. If we don't do this, we won't be able to go through and actually execute on it. When are you going to generate value? Oftentimes there's kind of two opposite ends of this win. Oftentimes the win is as soon as possible or this unattainable deadline. So these unattainable deadlines just aren't feasible. They set the team back from the very beginning and make it so that they are not going to be successful. They're never going to be successful because they've been set up for success, for failure with a deadline or a timeline that is not feasible at all. It usually starts with a misunderstanding of what the difficulty of data projects and the value or how much time it takes to do a good data project. They just think or have bought into a vendor or perhaps read something. And that thing that they read said, it's easy. Data projects are easy. You just need to use my software and they out they go. But that isn't feasible. And then on the opposite end, we have this when it's ready. The issue with when it's ready is that those teams often never get something out the door. It's always, well, there's another thing we need to do and another thing and another thing and another thing. And we have this real issue with this dichotomy of these teams never generate value because the teams that are always behind, they're always behind, they're never getting, they're never going to generate or live up to those expectations. The, when it's ready, people, the biggest problem, and I would say, especially now with the economy being what it is, that projects that take too long to generate value are going to get canceled. And this is really important because oftentimes boards, c level people will say, oh, you have enough time, don't worry. Well, they also have short memories and they also have pressure from various stakeholders, shareholders, for example, or money or something's changed in the market. And when they said, you have all the time in the world now, you don't actually have that time. You actually have, well, you need to get something done yesterday. And so what I really strongly advise team leads, managers of teams, data teams, is to not take what they say, not at face value, but to actually have a more feasible timeline in mind. So don't take too long to generate value. Those projects do get canceled. So what will happen is that they'll promise too much. They'll, they'll just never deliver on those expectations. So be really careful there. This is how you set yourself, you set your team up for failure. Data teams need to have a very clear and plan and architecture of where each piece is going to be done, where are you going to put this piece of, where are you going to run things? Very, very important. You need to have this plan and or create this plan if you don't know where it is. One of the most important parts that I would say here for companies is, is that the data teams are often novice. Frankly, they don't know what they're doing. They've never done this before and it's difficult. And so what they'll do is they will say, reach out to a vendor. Sometimes that's a cloud vendor, sometimes that's another vendor, and they'll look at their website and the website gives them this. Oh, wow, here's this plan. Oh, you just do this, this and this. Oh, that sounds good. I can do that. And so they'll go, follow that vendor plan. And guess what? That vendor plan does not have any nuance to it. It says, use our technology for every single thing. And it's that way because it's written by a marketing person who's paid to say, use our technology for everything. That technology may not be the right tool, probably isn't the right tool, but marketing people are paid to say, use our technology for everything. And so as you look at those vendor diagrams, those vendor white papers, those vendors sort of things, yes, it's going to be wrong because their recommendations are not going to recommend that you use the right tool. They recommend that you use their tool. And especially new teams, new data teams don't understand that nuance. They don't understand that a team or a team that is new will make these sorts of mistakes. So it's really important that you get either somebody who knows what they're doing, an architecture review, something along those lines, to make sure that you are doing the right and using the right things. Next, it's how, how will the plan be executed? So data teams need a clear plan that they're executing. And oftentimes when a team is tasked or perhaps a brand new team is there, they're going to be executing and they're going to say, here's our plan. Except that plan will often change. Or they'll go and say, we want to do ten things all at once. That's not going to work. I've seen it way too many times. Those sorts of teams that go and they're brand new, or they're very excited about this and they try to go do ten things, guess what happens? They don't do any of those ten things. So what I always talk about when I when I mentor a team and I tell them, get a singular focus, one to three things, ideally one or two things, not three things. But sometimes executives get mad when there's only three things or in their perception, not enough. And so what we need is that focus, that focus on one to three things. By doing those one to three things, you actually get them done instead of trying to do ten things at once. So what happens with ten things at once is that you try to, you make very little progress. There's a lot of switching in between them. And as a direct result, instead of it taking maybe one x the amount of time, so one times ten, it actually takes two or three times that. Ten X, where if we would just done the and focused, we would have gotten them done at a one X amount of time. So by focusing, we actually get more done faster rather than trying to do ten things and never get anything done. Very, very important to get that focused. If we don't do this, the teams will get bogged down in too many different directions. Now, I have a bonus here for you, and that is why. Why is our data valuable? Eventually, you're going to be be forced to come to an accountant or some accounting meeting or somebody who's looking at numbers, and they're going to say, you need to justify why you're spending this amount on people, on cloud costs, on licenses, whatever, and you need to have very clear justification for what you're doing. Otherwise, that bean counter, that accountant is going to say, oh, well, it doesn't matter what you're doing. We should just cancel that, or we should reduce your staff from 20 down to two people. You're just not generating enough value. So it's always important to go through, to have clear data on, or clear information or clear numbers on what is the value of what you're doing? Why is that data valuable? A very common question here is where do you get that number from? And the number comes from the business comes from the customer. And how do you get that number? You ask them. You actually work with your customer? Oddly enough, yes. This may be surprising that we actually work with our business customer not just to deliver on what they're asking, but to have good numbers about what the value is that we're delivering. Because eventually somebody is going to say, what does ThAt Mean? What does that do for us? I look for a ten x Roi in this investment. What that means is if we're spending €100,000 on investing in either team or technology or something like that, or on a project, we would look for a ten X investment ROI. So spending that 100,000 euro should net. And we, and I usually target that for a few reasons. One is that as we talk to the business people about what the value is that we're creating, they will actually perk up and they'll listen, oh, well, we'll make a million, for example. They'll listen to that. They'll actually put an investment into that. But it also gives you some CusHiON where, let's say you don't hit that ten x for whatever reason, you only hit five x. Well, you're now at 500,000. So gives you some cushion, and it gives you this ability to talk to the business in the numbers or in the way that they want you to talk to them. What we've seen here is we've answered those questions. What you have to do, you as the audience, you listening, you need to start by answering those questions, then move on to the execution. If you're in the middle of a data project right now and you heard one of those questions and you don't have an answer for it, you need to answer that now, because you will eventually need to answer that question in some fashion. Another thing that's important here is that not all gaps are technology. Sometimes gaps are people and technology. We need to check for that. So we need to check for those gaps in the people and technology. So AI projects, they're rarely just, hey, we need to put this new technology in place. Let's put some spark in place now with Genai. Let's put chap GPT in place, and that will solve all the problems. It will, may solve a problem. It may add ten problems. It's just a, it's just one piece of the puzzle. So, thinking, oftentimes, technical teams especially think it's, oh, we just, it's a Lego piece. We're missing that Lego piece. Stick it in there. All good. What is, is that it's often an organizational, a people and a skill change, where even if we make that change in data, we start putting this data in place, what will happen is that the organization isn't ready to use it or the people aren't ready to use it, or the people in the data team don't have the skill. Programming especially. Sometimes they've taken their data warehouse team and said, oh, you're data engineers now without the programming skills and they have no programming skills. These are big problems. So it is really important that we check for these gaps. These gaps are part of what are going to make you fail in your projects. In my experience, and I have a pretty significant amount of experience not just consulting in this, but talking to others, interviewing others, doing surveys. And that is around when you get help. It's really important to do this because these problems don't fix themselves without specific and concerted effort. Put it different ways, if you don't put effort into this, if you see a problem and you don't put the direct effort into fixing it, they will not change. It will not change for you. It doesn't. There's the saying, time heals all wounds. That isn't going to happen. What is going to happen is that the same problems are going to keep on repeating. So there's a lot of different types of help out there. Outsourcing, technical consulting, management consulting, lots of different possibilities, each one. For example, let's say you're a company that has no software engineering prowess. Maybe outsourcing is the way to go rather than building a team. Or if you do have a team that's floundering, probably need some technical consulting or management consulting. There's a lot of possibilities out there, but unless you avail yourselves of them, you won't get your help and your fix. So most of you who are watching this, you probably have a project that's going to start, about to start. Sometimes what I like to do is actually invert it and say, turn a situation upside down and say, rather than saying, where do you want to go and how do you get there? What? Maybe we just say, what happens if all of our plans go wrong? What would have happened? Now think about that for a second. What if sometimes you come into these projects with rosy colored glasses that says, oh, yes, we're going to, everything's going to work out, but if we invert that and say, okay, well, what if it goes wrong? What would have went wrong? Well, it would have went wrong because the CEO will change his mind every millisecond, or the marketing people will do this or we won't get the resources for that. And by looking at that and looking at the possible failures, you can start to change your plan and make sure that your plan is more resilient to change. Another question I like to ask is, how good is a data engineer at a data science task? So think about that for a second, would you, if you're a data scientist or you're a data science manager, would you say, oh, we need an advanced model. Let's have a data engineer do that and say, okay, why are we having our data scientists do data engineering tasks? Because they're really not well, or they're not good at it. Quantifiably not very good. And I do have a lot of data on this as well. So why are we doing this? Another inverted question that we would say is, what would the business say if our project was canceled or the cluster was turned off or whatever is providing that data just went away. What value? What cluster? I don't care. Those are really bad signs that the project is generating. No. Or low value. And that means that the business doesn't care about it. So when somebody's looking to cut costs, somebody's looking to do something, those are the projects that they're going to look at. But on the other side, if. When we say, oh, the business is going to. If the business says, oh, you can't do that. I look at that report every day. I look at that report every morning, and I base all these decisions on that. Those are two very different sorts of views of data and value creation. And if you aren't on that side where people are saying, I need that, I use that every day, that's a big problem. So why shouldn't you copy someone else's architecture? That's a pretty big problem. You heard me talk about it a little bit and say, hey, if you go to a conference, and maybe even at this conference, people will say, here's our architecture. More likely, you shouldn't copy it because they're doing something different. They have a different goal. In fact, I've worked and consulted at companies who are. Who are competitors to each other, so same exact space, same exact industry. And their architectures were different because they were trying to do two different things. And what often happens in these architecture talks is that they don't talk about the nuances, they don't talk about the reason I chose this or we chose that. There's some problems there. I think an interesting example of this was Uber. Sometimes companies will just show you this diagram all the way at the end of here's the most complicated thing that's possible, but what they don't show is how they got there. I think Uber was very interesting. You can see the link down there to their diagram, but you can see their data, their diagram. They started pretty simply, and their first generation started very simply. Second generation got a little bit more complicated. You see a few more things. Their third generation got even more complicated, even more things. So there is a progression. And so what I really want you to think about and think through is don't overengineer and don't copy. Just because you saw that diagram from Uber. Doesn't mean that you, you should be copying Uber. Uber did something for Uber. They didn't do something for you. So just doing that, that could mean overengineering. That could mean underengineering. Could be doing something that is completely unnecessary, complete overkill for you, and only move when, when it's necessary. As you saw, each one of those generations was in response to a limitation that they started hitting. Maybe you won't hit that limitation. Maybe you don't have that limitation. Another thing to think about is I ran a survey for data teams in 2023 and I've received 81 responses, responses from all different sizes of companies maturity levels. You can see the URL there if you want to go read it. And what I did is I tried to look at the worst and the best. I tried to look at the value generation that they were doing and say, who is the worst? And then looked at what they did, what their responses were. For example, here we see the lowest and highest creation. So the red being the high value creation, blue being low value creation. And these are their challenges. And what's interesting here is that they both faced basically the same challenges. You can see them pretty even across. However, the high value creation, they faced some early mistakes. High value faced more advanced things. But when you looked at the methodologies, they were using more methodologies, they had put effort into saying, this is how we're going to organize ourselves. This is how we're going to structure ourselves. Not everything is about technology. What we see right here in this diagram is methodology. Organization actually mattered quite a bit here. We also talk about the number of teams. Sometimes teams would say, oh, we can get by with just data engineering or just data science. And we can see in this diagram that the highest value creation in teams was with one to three teams. Two to three teams. Really, really important that if you are trying to struggle with, let's say, just data science, you're going to hit that low value creation very important. As well as best practices. The high value creation teams did the most number of best practices, whereas the low value creation teams did the fewest. It's very important to do those best practices, to consistently do those best practices. And that brings us to data teams. My book is called a unified management model for successful. What we need is all three teams working together to create value from data. We have our data science. These are the ones that are the consumers. They consume those data products. They create that advanced analytic and they work with data engineering, where data engineering is creating these data products. They make sure that the software engineering sound, that the distributed systems are sound. And then we have operations. Operations make sure that everything is running smoothly. It allows the customers and the business to depend on a working system. It's only once we have all three of these things that we can truly be successful with our data teams. Some of this is, I had to go pretty high level. I have a book, it's called data teams, if you'd like to read it. It goes much deeper into the teams, how the teams work together, as well as case studies. Now, those case studies are really important because I put a lot of effort into not just my research and my own experience, but I wanted to hear others. So there's full case studies of companies with ten plus years of experience doing this and what that looked like. So data teams is a great resource for understanding this. So with that, I'd like to thank you for attending this session and I wish you the best of luck in your data projects.
...

Jesse Anderson

Managing Director @ Big Data Institute

Jesse Anderson's LinkedIn account Jesse Anderson's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways