Conf42 Python 2021 - Online

String Comparison In Real Life

Video size:

Abstract

Text analysis in real life can often yield unsatisfactory results due to typos, alternate phrasing, abbreviations, and more.

In this talk, we’ll cover practical and efficient string comparison methods using Python libraries & functions, as well as tackle some commonly faced issues.


A common problem faced by data analysts, data scientists, and many developers who need to analyze and compare data, is that texts are often similar, but not quite identical to one another. This can result from the existence of multiple ways to say the same thing, typos and abbreviations, common yet unindicative words (such as “the”) and punctuation, that can all skew the results.

During this talk, I will walk you through several methods to compare inexact texts, using a few different libraries, cover the usages as well as advantages & disadvantages of each method, and tackle some commonly faced issues.

By the end of the talk, you should have a good basis to start comparing texts efficiently and elegantly in your code.

Summary

  • Naomi is a software developer with previous experience in risk and data analysis. This lecture is inspired by various projects that I've conducted at my workplace. One of my main goals today is to provide you tools and ideas that you can apply at your day to day work.
  • There is no single definition for similarity or difference of strengths. A more flexible string comparison algorithm could be useful for fraud detection. By the end of this lecture, you will have a wide enough set of tools to make your next data oriented project efficient, elegant and useful.
  • Jellyfish is a python library that has two main sections of functions. First is of Eddie distance functions. Second section is of phonetic encoding. Under editistance, we will learn about Levinchen distance and the Marvel Levin distance.
  • String metric receives two strings and produces a distance score by applying some sort of magic. A step is considered as addition of a character or a replacement of one character in another. The higher the score is, the bigger the difference. Both Levenstein distance and Marl Levente distance are different types of string metrics.
  • The moral Levin distance counts the swap of two adjacent characters as a single step instead of two steps. When is it useful? When can we make use of that? Today we will learn more about the Faziwazi Python library.
  • Fuzzywazi is a library that has a few different functions. Each such function receives two strings and produces a similarity score. On a scale of zero to 1000, no match indicates no match, and 100 indicates either an exact match or something very close to that.
  • Now let's move on to our next function. Fuzz token sort ratio. Words the tokens practically substrings before making the comparison. It is not case sensitive and it sorts the tokens before making a comparison. Also has the added value of a set, which means it does not count duplicates.
  • Fuzzywazi produces similarity scores on scale of zero to 100 for pairs of strings. It also has tolerance for what we would call as typos, which is minor changes in the strings. Another thing about fuzzywazi is that it simplifies the data preprocessing step for us.
  • Today we learn about two python libraries. The first is called jellyfish and the second is called fuzzywazi. If you have any questions, comments, or simply want to keep in touch, I invite you to reach out to me using my LinkedIn profile. That's all for today.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm really happy to be here today, and thank you all for joining. I'll introduce myself. My name is Naomi. I'm a software developer with previous experience in risk and data analysis. And this lecture today, strain comparison in real life, is inspired by various projects that I've conducted at my workplace. One of my main goals today is to provide you tools and ideas that you can apply at your day to day work. Now, before we dive in and start talking about all the different methods that we will discuss today, when they're useful and why, I want to start by sharing with you a story. This is a story about me in the world of string comparison. But most of all, this is a story about why what we will discuss today is really relevant for you. So my story starts about a year and a half ago. I was at work, and I noticed that one of my colleagues was quite struggling with comparing two large sets of data, or of strings, to be precise. So this colleague of mine, he knew that the two sets of data he was trying to compare were very similar to one another, but not quite identical. And he wanted, for each pair of strings that he was comparing to be able to determine the similarity level, or difference level, if you'd like, by producing some sort of a similarity score. So he decided to come up with his own set of logics, aiming to tackle different scenarios that we face when we compare strings. Now, of course, trying to come up with your own set of logics, it's difficult, it's complicated, it is not very intuitive. But looking back at what he did, I have to say that he had a point. Because when we talk about strain comparison, actually many things would go wrong. I mean, we have typos and abbreviations. Maybe the words have been reordered in the sentence or repeated with no added value to the meaning of the sentence. Or even if we talk about punctuation, we have dot and comma and a dash, and many things that could affect the result of the comparison. But most of all, bring us to one important conclusion. There is no single definition for similarity or difference of strengths. So my goal is that by the end of this lecture, you will not only be familiar with the different obstacles that we face when you go to work, but you will also have a wide enough set of tools to make your next data oriented project efficient, elegant and useful. Because at the end of the day, when we talk about string comparison, we talk about something that has a lot of real world applications if we only know how to use that correctly. For example, let's talk about fraud detection. Fraudsters many times steal other people's identities in order to obtain credit line and make financial purchases. Now, if we are a company working in financial services, we would probably like to know if the person on the other side of the screen is a fraudster or suspected to be one. This way we would be able to prevent a purchase from being made or take additional security measures in order to verify the identity of the person on the other side of the screen. So let's take this a bit more to practice. Let's say that we are indeed a company working in financial services, and we have access to a list of identities that we know are commonly used by fraudsters, either stolen identities or made up identities. And in this list of identities we have access to the people's full names, addresses and other relevant such identifiers. So let's look at George Forge. George is a legitimate person whose identity has been stolen and is now being used by a fraudster. The fraudster wants to open an account in our service by using the info of George. Now as you can see, the fraudster decided to tweak a few minor details when opening the account. If we were trying to compare the info that was provided to us with the info that we already have by using only exact match, the comparison would fail. But if we would be able to use a more flexible string comparison algorithm, we would also be able to say this looks at usage in the info of George. Let's dig in a bit deeper. Another usage that I want to discuss is flexibility for typos. So let's say that you are developing a guessing game, and in our game the user needs to guess a word or a sentence, which is practically a string, by typing in free text. So for long up strings we would like to have some flexibility because let's say that the user missed out a character, maybe added one, removed one, or replaced one character with another one. So in case a human being would look at the pair of strings and would say, this practically looks like the same thing for me. We, as the developers of the game would like to say, let's not reject the input. Another usage that I want to discuss is in the Meditech industry. Did you know comparing sequences of DNA is done by comparing sets of strings? Now we will look more in depth to this DNA sequencing example later on in this lecture. And of course the list could go on and on. There are many different usages for string comparison. We just need to be familiar with the tools, and Python gives us some built in operations for string comparison. But I want to briefly discuss them and see why they're not exactly what we're looking for today. First, but in operation is exact match. When using exact match, even a single change of a character, replacing it with another one, or changing it to a capital letter is already enough to have the exact match returning the value false. Now, this is great for many applications, but if we're talking about edge basis typos, a boolean answer is not good enough for us. Another built in operation is the ability to check if one string is contained in the other. So if one string is fully contained in the other one, this is great. But if we're talking about partial containment typos, such different edge cases. Again, this is not flexible enough for our needs. And that is when the comparison methods that we will discuss today will get into action. So today we will learn about two python libraries. The first is called jellyfish and the second is called fuzzywazi. Yeah, don't worry about the bear, we will get to it later. So, jellyfish, let's begin. Jellyfish is a python library that has two main sections of functions. The first is of Eddie distance functions. Here, each such function receives two strings and produces a distance score by counting the amount of steps required in order to convert string a to string b. The second section is of phonetic encoding. This is not about comparison, and we will not discuss this today. But I will briefly mention that here each such function receives a string and produces a code which indicates about the phonetic pronunciation of the string. So under editistance, we will learn about Levinchen distance and the Marvel Levin distance. But before doing so, I want to introduce a term, a definition of string metric. String metric receives two strings and produces a distance score by applying some sort of magic. As you can see now, both Levenstein distance and the Marl Levente distance are different types of string metrics, each of them with a slightly different logic to calculate the distance score. Levente distance is a very fundamental and important string metric in the string comparison world. Many other string metrics and string comparison algorithms are based on that one, including all the ones that we will discuss today. So leverage and distance counts the minimal amount of steps required in order to convert string a to string b. A step is considered as addition of a character, a deletion of a character, or a replacement of one character in another one. The higher the score is, the bigger the difference. So let's look at a few examples to make this much more clear. In our first example, we can see a comparison between exit and exist here in order to convert string a to string b, all that we would have to do would be adding the letter s. Now going the other way around. If you would like to convert string b to string a, all that we would have to do would be removing the letter s. So in this case, no matter which direction we choose to take, the minimal amount of steps required in order to make the conversion would be one. Now in our second example we can see a comparison between great and great. In this case, in order to convert string a to string b, all that we would have to do would be removing the letter e from the first string and then adding the letter e again at the end of that string. In this case, the minimal amount of steps required in order to convert string a to string b would be two. And on our third example we can see a comparison between look and lock. In this case, technically, in order to convert string a to string b, we could have removed the second o from the word look and then added the letter c instead. And this would result in two steps. But we also have a step of replacement. So we can replace the second o with c and then the minimal amount of steps required in order to make the conversion would be one. We will note that replacement is when we are looking at a single index and replacing for that same index one character with another one. That is why replacement only takes place on the third example and not on the second example. Because in the second example e, the letter e was relocated from one index to another one. Now we are moving on to the moral Levinchen distance. The moral lensian distance is very similar to levente and distance, but it also has an added value. The moral Levin distance counts the swap of two adjacent characters as a single step instead of two steps. So let's look at our next example to make this much more clear. Here we can see a comparison between swap and sub. Now the Marl Levin distance recognizes that two adjacent characters a and w have been swapped and therefore it counts this as a single step. But for levention distance we can see that this would be two steps because for levention distance, in order to make the conversion we would have to remove the letter w and then add it again, or remove the letter a and then add it again. So this would be two steps. Now we are getting to our most important question. When is it useful? When can we make use of that? So let's look at our next example. Here we can see dna sequencing. Dna sequencing is the act of taking two sequences of dna and comparing them in order to determine the distance level between them. DNA sequencing is commonly done by using Levinstein distance, which is something that I, by the way, find as pretty cool. And now, moving on to our next example, we can see two comparisons where the first is Mr. Bean versus Mr. Bean, and the second is Johnny Depp vs. Johnny Depp. In the first example, we can see that the only difference is a dot, which we, by the way, know indicates about an abbreviations. And on the second example, the only difference is a swap of two adjacent characters. Now, we could say that if we know our expected input is a person's full name, it is very probable that for small enough changes, maybe one step, two steps of difference would be small enough for us to say this is practically the same thing for me. Let's not reject the input. Now we're getting to our next question, when is it not useful? So I want to show you the next example. Here we can see I love comparing strings versus comparing strings. I love. Now, if we check out the distance score, we can see that it is 14, which is quite high. Without any additional information, we would not be able to determine if the distance score is so high because the strings are indeed unrelated from one another, or if this is the case, that a person, a human being, will look at a pair of strings and would say, this is practically the same thing. For me, let's not reject the input. And that is when fuzzy wazi is getting into action. Now, before we talk about fuzzy wazi, there's one thing that we have to talk about. I mean, did you know that fuzzywazi was the bear that had no hair? So fuzzywazi wasn't very fuzzy, was he? Now apparently fuzzywazi is a rhyme or a tongue twister, and I was not familiar to that at all until I started working with Faziwazi Python library. So I guess there's always something new to learn, and today we will learn more about this library. So let's begin. Fuzzywazi is a library that has a few different functions. Each such function receives two strings and produces a similarity score. On a scale of zero to 1000 indicates no match, the strings are completely unrelated from one another, and 100 indicates either an exact match or something very close to that. Now, today, under Fuzzy wazzi, we will learn about three functions, fuzz ratio, fuzz token sort ratio, and fuzz token set ratio. So let's start with our first function, fuzz ratio. Fuzz ratio is a function that receives two strings and produces a similarity score based on the following equation. Given two strings where t is the total number of elements, which is practically characters, and m is the total number of matches where matches are defined as characters that appear in both sequences and in the same order. So the similarity score is given by the equation twice m divided by t all multiplied by 100. Now to give you a bit more sense, a bit more intuition about this equation, I want to show you what happens when we are comparing two identical strings. So here when we are comparing same and same, we can see that m number of matches is four, t total length is eight. And then using our equation, we would see that the similarity score is 100. And if we would be comparing two strings that are completely unrelated to one another, we would be able to see that m number of matches is zero, which is also our numerator. So in this case the similarity score would also be zero. Now moving on to a more realistic example, something in between. We can see that the comparison between great and green is a comparison where m number of matches is three because gr and e appear in both sequences and in the same order. T total length is ten. Therefore, the equation gives us the twice three divided by ten, all multiplied by 100 is 60. Now I want to discuss this example because I think it's a pretty interesting one. Here we can see two comparisons where the first is for two other cells that are very similar to one another, let alone a few minor changes. And the second comparison is a versus ABCDE. Now if we'd be checking the distance score, we would see that the distance score for both comparisons is the same, it is four. But if we would check the similarity code, we would be able to see that using the funds ratio function, the similarity score of the first comparison is 96, which is on a scale of zero to 100, relatively high. And the similarity code for the second comparison is 33. And here on a scale of zero to 100, it is relatively low. So this is interesting because now we can see the added value that fuzz ratio gives us where the jellyfish functions that we saw before could not give us. Another thing that we want to take into account when we're using fuzz ratio function is that it is case sensitive. As you can see here, capital a and non capital a are considered as two different strings. So if we do not care about capitalization, we would want to apply lower on our strings before making the comparison. The case sensitivity is true for fuzz ratio function, but it is not true for all fuzzywazi functions. Now when we are talking about fuzzywazi that we know, all of the fuzzywazi functions produce similarity scores on scale of zero to 100. We also want to be familiar with an important concept which is called a threshold score. So, let's say that we have a project, and in our project we made three comparisons. The first comparison resulted a score of 90. The second comparison resulted a score of 95. And the third comparison resulted a score of 96 or 97 or 98. So we would have to ask ourselves for which score are two given strings considered as identical or similar enough to one another? And for which score are two given strings considered as different. Now the answer for that is a threshold score. So let's say that in our project we decided that all comparison with scores of 80 and above would be considered as similar, and all comparisons with scores of 30 and below would be considered as different. And all comparisons and resulted scores in the range in between would be considered as undecisive, inconclusive. So our next question now would be how do we choose the threshold code? Now the full answer for that is a bit more lengthy than the scope of this lecture, but I will briefly mention that it depends on the requirements and the needs of our project. For example, it is possible that in different projects we would have a higher or a lower tolerance level for what we would call as errors, maybe false positives, true negatives. So we would want to adjust of torture scores accordingly. Also, another thing to take into account is the characteristics of the data that we are working with. For example, let's say that we have two different projects. In both projects we would compare people's full names, but in one project these would be people coming from the US and in the second project these would be people coming from Europe. So it is possible that in different continents, people's names would be longer, shorter or more similar to one another. So it is also possible that same scores in different projects could indicate completely different things. Therefore, we will need to take these considerations into account when we are choosing our first scores. Now let's move on to our next function. Fuzz token sort ratio. Fuzz token sort ratio words the tokens practically substrings before making the comparison. So here we can see sort and compare versus compare and sort would produce a similarity code of 100. Now moving on to a more realistic example, we can see Emma walked home with Lisa versus Lisa walked with Emma home. In this case, that similarity score would be again 100. It also teaches us that fuzz token sort ratio is not case sensitive. Now moving on to our last function for today, fuzz token set ratio. Fuzz token set ratio is very similar to the previous function that we saw. Now, Fuzz token sort ratio in the meaning that it is not case sensitive and it sorts the tokens before making a comparison. But it also has the added value of a set, which means it does not count duplicates. So as you can see in our next example, sort lower and nor Pete versus lower nor peaks. Sort and sort produces a similarity score of 100. Now if we would take the same comparison but use the previous function, we would see that the similarity code is lower 91 in this case because the previous function, as we know, does not remove duplicates before making a comparison. Now moving on to a more realistic example, we can see I love chocolate and ice cream compared to I love ice cream and I love chocolate. True story by the way. So in this case we can see that the similarity score is 100. Now I want to discuss the usages and advantages of fuzzywazi. Fuzzywazi produces similarity scores on scale of zero to 100 for pairs of strings. It also has tolerance for what we would call as typos, which is minor changes in the strings. So it is possible for two given strings that are very similar to one another and not quite identical to produce similarity score of 100. Because eventually fuzzy wazi is here to answer the question, would a human being who looks at a pair of strings say, this practically looks like the same thing for me? And if not so, to which extent? Fuzzywazi also has tolerance for changes in order of substrings depending on the function that we choose to use, and it also has tolerance for repetitions, again, depending on the functions that we choose to use. Another thing that we want to know about fuzzywazi is that it simplifies the data preprocessing step for us, because when we approach a data oriented project, we would want to go through the data preprocessing step, which means cleaning the data before making a comparison, and therefore we would be able to increase the accuracy of our results. So if our data preprocessing step goes through removing duplicates, sorting substrings and applying lower, we know that fuzzywazi can do all of these things for us. Another thing that we want to know about Fuzzywazi, which is also true to the jellyfish function that we saw before, is that it is really easy to use. So in case we have a feature or maybe a plant feature that is really nice to have, it is not top priority, not very urgent. We would want to be familiar with fuzzywazi and jellyfish because in this case it would not waste expensive time of our software developers, data scientists, data analysts, because it is really easy to learn, easy to use, easy to change whenever we need. Now, if you want to know more about Faziwazi, I invite you to check out the articles that I wrote about it and posted on medium. The first article covers the functions that we learned today and also covers another function that we did not discuss today. And the second article talks more about the process. It gets to the before and after of making a comparison, gets more in depth to the data preprocessing step, as well as explains what to do after we already made a comparison and have scores with us. And now let's see what we learned today. Today we learned about two python libraries. The first is called jellyfish and the second is called fuzzywazi. Under jellyfish we learn about Levinstein distance and the moral Levinson distance. And under Fuzzywazi we learn about fuzz ratio, fuzz token sort ratio, and fuzz token set ratio. One last thing before we finish. If you have any questions, comments, or simply want to keep in touch, I invite you to reach out to me using my LinkedIn profile. Also, if you want to get more of the contents that I write, I invite you to follow me on medium. That's it. That's all for today. Thank you all for listening and I hope you enjoyed the lecture. I was Naomi Kriger explain about strain comparison in real life.
...

Naomi Kriger

Software Developer @ Behalf

Naomi Kriger's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways