Conf42 Machine Learning 2024 - Online

From Code to Insight: Using NLP and Pattern Analysis in Git History

Abstract

Discover how NLP and pattern analysis can transform raw Git history into valuable insights. Explore innovative ways to leverage Apache Tika, Git archives, and data aggregation to reveal trends, behaviors, and project dynamics.

Summary

  • Pavel will talk about NLP techniques for getting more insights from git commit messages. We will use the open source project for the analysis. The use cases which I will be describing here are theoretical. But hope examples are close enough to the real processes in software development companies.
  • Paolo Perfilov: In this video we'll be using NLP to analyze git commit messages. NLP techniques are used for sentiment analysis and categorization of the text. The examples that I would be showing you are theoretical and the projects we would use are open source.
  • Let's look at the deep learning models and let's try to get some emotions out of our git commit messages. Let's take just 2000 and try to enrich the messages by emotions with a pre trained dataset. And we can see how the dynamics of these emotions.
  • The next thing that I wanted to show you is summarization. The idea here is to reduce amount of text that we would need to read. Let's try to use the chat Jpt API. It's quite fun. NLP programming is very iterative so be ready.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi Im Pavel and today were gonna talk about NLP techniques for getting more insights from git commit messages. Ill show you what we can do with git commit message history to learn more about our projects, team members or project maturity stage or even portfolio of the projects. I hope this video would be interesting for the team leads, managers and hrs who is interested in getting more context about their projects and organization. The use cases which I will be describing here are theoretical. We will use the open source project for the analysis, but hope examples are close enough to the real processes in software development companies. Before we start, let me introduce myself. My name is Paolo Perfilov. I'm having 15 plus years of experience in fintech and during my career I was working as a developer, engineer, project manager and product manager. I have a master degrees in finance and master's degrees in computer science. I'm very enthusiastic about the data engineering and practical usage of the ML. A small disclaimer here, I'm not representing any of my employers and I'm speaking for myself. Again. The examples that I would be showing you are theoretical and the projects we would use for the analysis are open source. Feel free to reach me out on LinkedIn and download the notebook from my GitHub. Let's begin with the theory. Here are the four building blocks of the classical management, planning, organizing, leading and controlling and four building blocks of the people recruiting, training, evaluating and motivating. Is it enough to start managing people and projects? It's just a theory which is missing the information about the culture, environment, missing emotions and sentiments of the individuals. I'll give you a practical example of the problem. Imagine that a software development company is hiring new project manager and he gets five projects which were running for quite some time already. He needs to read and process huge amount of information to get up to speed. Most likely the main sources of information would be there is as requirements to project, plan, the documentation and he would need to talk to many people to get the overview. But it might be not colorful enough to get the sense of what is going on in reality. From time perspective, it might take a few months or even a year to get some understanding of people's behavior, get their feelings, get the knowledge about the individual profiles and communication style to become to be efficient in the team. But how to get this insights fasting? I'll try to answer these questions in this video and we'll be using NLP. We'll be using one non obvious data source which is git commit messages. Let's look at the git messages from the angle of different roles in the team, the most of the roles would not use it as a data source. It's too noisy, it's too low level, it's a lot of text, and most people would not be able to extract meaningful information. But NLP could help with that. From my personal experience, I can tell you that comment messages might produce enough insights for all of the managerial roles in the company. I'll try to show you some examples to prove it. Okay, now we understand the problem and there is a lot of questions and inspiration, but how we could turn data into the insights. Let's talk about NLP. What is NLP? NLP stands for natural language processing, which helps to turn words, sentences, or any text into the numbers. Well, skip the theory as I want to focus on the practical usage. NLP techniques are used for sentiment analysis and categorization of the text. It could tag the data, classify the data, and provide some emotional levels. Here are some python libraries which I will be showing you. And there are many more libraries which are not in the scope of the video. Let's begin with coding. Here are the libraries that you would need to install to run the notebook. Select Python, explore NLK. I will download the repo and GitHub pandas. This repo, most popular data science library takes some time to download it. Okay, let's run the second symbol. Let's grab the messages, the commit time, and emails. All right, we have a result in the resulting dataframe. We have three columns as I expect. The shape of the dataframe is 335 thousand commits. Let's pre process the messages. Let's delete the git keywords, CI, CD keywords, some emails, some HTTP links, and some purge pull request messages. We need to make sure that the message and the text is looking good before we start doing the sentiment analysis. And also we see a lot of abbreviations here. Look doc, es, zero, one, and something else. So it might make sense to clean this up as well. So here is the cleaned version of the message. We just apply the regex to delete the verse that we don't want. We also extracted the abbreviations. Here are the longest abbreviation. It seems like the developer was a little bit annoyed by the somewhere this let's start with descriptive statistics. Here's a chart which is showing you the amount of contributions per year and number of unique contributors, unique developers per year. This reminds me very well the classical product life cycle. So it does look like MT was a peak indicator has reached the maturity. Let's look at the seasonality. If there is any patterns. Indeed there is. In the summer time there is a less amount of contributions. And let's look at the top contributors. It seems there are like about seven top main contributors who is contributing to cadcastly. We could extract the word frequencies as well. But what do we see here? We see not a lot of meaningful words. There are some words like two in four. There is a concept of stop words in the NLP. So the stopword word is the words which has to be deleted because it doesn't add any additional information into the sentence. Let's check the stop words. Yes, indeed, there are quite a lot the words marked as true as stop words. After we deleted the stop words, the vocabulary look as we would expect. Okay, let's start with the tokenization and lemmezation. This concept basically standardized the form of the message the it takes into account the NLTK library has built in Wordnet Lemodizer. You can look at the lexical database from Princeton University and you can search for some words and that would give you the part of speech and basically the explanation of the words that appear dictionary. So let's apply the tokenizing functions and tag the words by the part of speeches. And let's try to count the words again because this would be the more appropriate and more filtered. Yeah, here's the how message look like through the lemmatizing. So it's very standardized. There is no noise at all pretty much. And here are the most frequent words in our it does look like a developer's vocabulary just to compare the original message versus the lemmatized message. By using some lemmas, we can classify messages as features and as a bugs. Okay. And we can build the vocabulary for the bugs and features. And let's see what are the statistics or the features and the bugs over the time here we clearly see that the project kickstarted in 2012. There was a stable period of development to 2020, and in 2022 there was rapid growth features. As we are trying to look at the sentiments, the best way of finding the negative sentiments is to search for the bad words. Let's try to find them. Oh yeah, indeed. There are quite a few comments with the bad words and there are a few developers who are using bad words more frequently than others. Yeah, we can analyze this. I hope in your organization you have a policy around that. But definitely the empty messages with the bad words are looking negatively and they would provide you a negative sentiment and negative emotions. After we run the sentiment analysis for sentiments, we would use the same word note dictionary it has some additional information on the top of the words and the part of speeches. So we could get some scores, negative scores and positive scores for every single word like that. As you can see, the negative words include the error still problem difference, and the positive words are, well, improving, refinement and so on. We can calculate the total score and average score per period. As we can see in constant 2014, there was a representative positive sentiment. At this time AP running. And we can calculate these colors per person per period. To see if there is any dynamic. Let's plot the charts. Okay. See that the green developer was improving his negative score. There was the orange guy also was improving his score. And we can get some context about what people were doing and see I talk to them, maybe get some more feedback in the organization. Let's look at the sentiments. There is one nice library, which is called text plot, which is providing you quite nice features. And you don't need to write a lot of code to get and extract some polarity and subjectivity. Let's add the polarity and subjectivity fields into our data sets. Here's how it looked like. There's a polarity column over here. And the polarity could be positive or negative. And here is a polarity over the years. It does look like a sinusoid, very interesting pattern. After 2013, the negative polar g goes down, the positive polar g goes up, likely at this time, developers were very satisfied of the project. And we can calculate the dynamic of the changes of the polarity. It's red and green. When the features are delivered, the bugs are being fixed, and we can look at the polarity of all three individual contributors. We can calculate the ratio and ratio of the subjectivity so you can make a judgment. We can look at the polarity of the overall project per year. And it's interesting to see that the polarity of the bugs and polarity of the sentiments are different. The features have polarity more positive. It's biased towards the right hand side. Let's look at the deep learning models and let's try to get some emotions out of our git. Commit messages. The easiest way is to run existing models and run the transformers. You can get the models from the website tagging face. There are a lot of models. It's available for everyone. And you can download any of these and run it. Let's try to find the model which we search for the model. There is a description over here. There's a 1.5 billion downloads. And we can try the API as well. So we have this model we just downloaded it. Sometimes the models are very big. This one might be like one gig or something like that. As you can see, it provides us with the attributes of the sentence, provides the emotions like love, annoyance and anger. Let's make a sample of our data frame because it's too big. It's 35K commits. Let's take just 2000 and try to enrich the messages by emotions with a pre trained dataset. It might take some time to run. I usually on my laptop, I get the results within five minutes. Running about five minutes. Okay, we got emotions. Here are the dopamine steps we get from our 2000 messages and we can do some analysis further on and group the data and look how the dynamics of these emotions. Let's look on the particular examples. Here's the confusion. I think the confusion is caused by the word. Yeah, it looks at least in the second sentence. Okay, let's look at some others. Yeah, we can select any. Let's look at the anger. The anger probably caused by this line and the capital layers. The model has to be fine tuned because the git commit messages are very specific. Let's look at the git discussed. Not very clear why this emotion popped up, but let's look at the dynamics of of our movements and let's see how they look for developer. Of course, the top per developer as well. Neutral and approval. Let's drop these first two columns and look at the remaining part of the motions. And the remaining part are annoyance and disapproval. Let's look at how was the dynamics of every single emotion over the time. And you can see that annoyance correlates a lot with the dynamics of the project and disapproval as well. There are not a lot of positive promotions, by the way. Let's look at the last cycle again in 2020, the annoyance was the top and was among the highest amount of contributions. So yeah, probably developers don't like much of the periods when there's a lot of features and a lot of bugs are being submitted to create the pressure on them. And let's look at the heat map graph. Oh yeah, and white is the top. And disapproval as well. Disappointment, a little bit surprise. In 2014, there was a lot of surprises, sadness, anger. The positive emotions are not very present and we can look at the dynamics. That's just a different chart, just to see how the scores are growing or failing. Yeah. The next thing that I wanted to show you is summarization. Again, we will be using the hugging face model. There are a bunch of models and we would use the, one of the most popular Facebook learned the model, the CNN Daily Mail News and yeah, let's see what we would get with this summarization. The idea here is to reduce amount of text that we would need to read. So we would run the summarization function over the text. If you need to store a huge text, which is having a very different context than small pieces, I would recommend you to run it two times or three times. So basically first layer you run on the original message, then you combine all these summaries that you got and then you run the summarization again as a second layer. That would improve the quality of the output that you get. Otherwise the outputs might be very messy, not let's run it over there, let's take a sample, we'll take one top contributor and twin last minutes and let's build the let's enrich and let's get the summary of every individual message and then get a summary of the joint text. It might take some time to process. Okay, we got the results. These are individual summaries for every single message. Look at it and yeah, the messages are a little bit more clean and clear. The summary over the last ten messages combined, so we could see what the person was busy with and we can specify what should be the length of the outlook message. Here is the results. Yeah, the text look better than it used to be and is a nice summary. But again, this model that we were using is the model create on the news. Let's try to use the chat Jpt API. It's quite fun. It does provide a nice quality of the summaries. We also can specify what amount of tokens we need to have in output and we specify the content. Basically is prompt request as we would write it to the chatbot summary is the which is like joint text messages over the past. Then I want to change the prompt middle and we see output. We can play with the prompt a little. If I want to have an emotional response, I can make it and ask Judge beauty to make it in a shorter way. And I can ask I want to summarize it in a way, in a binary way, what is bad and what is good. We get the result. The result is very structured. I highly recommend you to try this out on the copy my notebook and run over your twitter some insights that you have never seen before. The most of the words of the dev slang are having the negative sentiments, so don't be surprised if you get the horrible scores. Check the original message and check the dates that you get. NLP programming is very iterative so be ready. Hope my video was interesting. Thanks, Ocon 42, for hosting me.
...

Pavel Perfilov

Pavel Perfilov's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways