Conf42 JavaScript 2022 - Online

Lyra: Disrupting the full-text search industry with JavaScript

Video size:

Abstract

How can a JavaScript-based search engine retrieve millions of records in a matter of microseconds? Why is JavaScript the right language to implement a true isomorphic application to be deployed everywhere, from mobile applications to edge networks? In this talk, we will see how Lyra, a full-text search engine written in JavaScript, is challenging the search industry with an incredible combination of performance and developer experience.

Summary

  • Michele lira disrupting full tech search industry with javascript. real time feedback into the behavior of your distributed systems. He wanted to learn more about how full text search engines work. Has no competition in his opinion right now, but has some problems.
  • Lira is a full text search engine written in typescript. It targets every single JavaScript runtime. You can run lira on cloudflare workers or netrify functions. It is also capable of running on edge networks.
  • We also wanted to give support from multiple languages. We use a stemming algorithm that right now it's supporting in 23 different languages. If you want to have fun, you can create your own stemming algorithm.
  • plugin data persistence allows you to export the index from one lira instance and re import it somewhere else. This only works on can and node JS. If there's anyone wanting to implement this on Dino, I'd be super happy to help.
  • And before I end my speech, I'd like to thank Nearform. We are a professional service company specializing node JS DevOps react native. We maintain a lot of open source software. And we are hiring so worldwide for remote. If you are interested, please feel free to reach out.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica real time feedback into the behavior of your distributed systems and observing changes exceptions errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close hello everyone and welcome to my talk lira disrupting full tech search industry with javascript before we start, I always like to introduce myself very briefly. I am Michele. I work as a staff engineer at Nearform. I'm a Google developer expert and a Microsoft MVP. I want to start by saying that I love elasticsearch. When I think of full text search engines, I always think of my favorite open source project ever, which is elastic. In the past I had the opportunity to work a lot on elasticsearch or with elasticsearch, of course on projects and products. And I got to say, I also tried working with solar. I tried working directly with Lucine. So the library underneath elasticsearch itself, but also underneath solar. But I got to say, every single time I come back to elasticsearch because it's just an amazing open source project and I truly, truly love it. In the past, of course, I've been contributing to open source a lot. And I've been working a bit on UnomI, which is a CDP, a customer data platform from Apache that uses elasticsearch as a leader database in the whole overall architecture. And I was amazed by the fact that every day we could just throw millions of data to Unomi, and elasticsearch would work just fine. And the performances in search couldn't be that much affected by the amount of data we just insert into the database. And this is true. Amazing. And one question I always wondered was how is it possible that Lucine, and of course then solar or elasticsearch can be that fast? And when we think about performances, when we think about how elasticsearch, how Lucine works, we have to make a distinction. Of course, we already anticipated that a couple of seconds ago, but yeah, Lucine is the actual full text search library, which is written in Java. And elasticsearch is not just a search engine itself because it uses Lucine, but it's also a restful server. It is a distributed system. It adds sharding on top of the overall distributed system so that if you have a lot of data, you can shard data amongst multiple nodes. It takes care of data consistency monitoring, cluster management, a lot of stuff. And as I already said, I love elasticsearch and I wanted to start recreating it from scratch. Not because I didn't like it, but because I wanted to learn more. And quoting one of the best of all time, Richard Feynman what I cannot create, I do not understand. So that's my life motto. This is something that truly explains the way I do learn stuff. So yeah, I wanted to create something in order to understand how that works, of course. Again, I love elastic, but I also had some problems with it. It's not the easiest software to deploy and set up, let's be honest with that. It's quite hard to upgrade, it has a big memory footprint, cpu consumption is not great, and it's really, really costly. So if you go on the cloud version, it has a cost. If you want to maintain it on your own, on your cloud provider, provider of choice, it also has a cost, of course. So at the end of the day, I find it to be a very good product. Has no competition in my opinion right now, but has some problems. Let's be honest about that. Before I continue, I want to say that all the problems that I found, maybe it was all my fault, maybe I was too inexpert and elasticsearch was a bit too much for me. We all know that making simple software is hard. We can give it a try. But I want this to be clear from the start, I love elasticsearch and I started recreating something similar just because I wanted to learn more, not because I wanted to replace in first place the whole system. So I set myself a goal. I wanted to give a talk on how full text search engines work. That's also for another nice reason. If I don't have a goal, I'm not able to understand how stuff works and basically study. I need a motivation. So that was my motivation. And yeah, I started learning more about full tech search algorithms, data structures, whatever. And yeah, I started going down the rabbit hole. So that was me the first few hours reading the theory behind full tech search. It's not easy, let's be honest about that. The hard truth is that I needed to study a lot of algorithms in data structures. I have no degree, so I didn't have any place in my mind where I could say like, okay, I remember I discussed this during a university lecture, I can reach out to that. I don't know, professor, for example, for learning something, asking questions, I started literally from scratch. And that can be kind of a problem for us. I thought developers, as I am, but was very interesting anyway. And of course after you start implementing stuff, after you start at least learning how stuff works, you need to implement it. And of course when you start implement something, you have to choose a programming language to implement your algorithms and your data structures. And of course I wanted to be a cool guy. Cool guy uses cool programming languages. So of course, oh no, Haskell, of course. I started working with Rust, and I gotta say, I've been working with Haskell for quite a long time, so I thought that rust could have been a nice option, easiest option, possibly. I was terribly wrong. It's not an easy programming language. It's super cool. Of course, cool guys uses rust all the time, it just wasn't for me. So I decided to implement everything from scratch in Golang. And also Golang is not super easy in my opinion. I mean, when compared to other program languages such as typescript or Ruby, for example, these are another kind of program language. So I started feeling a bit frustrated about that because I want to make stuff done, but I didn't have enough knowledge of those program languages to get stuff done, of course. And then I remembered a quote from Jeff Oatwood, the co founder of Stack Overflow, as known as the Otwood load, that says an application that can be written in JavaScript will eventually be written in JavaScript. And yeah, why not? JavaScript is the king of programming languages, right? So I decided to start implementing stuff with JavaScript, and that was kind of surprising to me. I started implementing stuff with rust. All the data structures that are required to work on search engines started with rust. Re implementing everything in Golang was quite optimized actually, because I've spent a lot of time on stack overflow, of course asking for code reviews to more expert people. So I was pretty confident in the performances. But I got to say, even though JavaScript couldn't outperform those languages, it was very close. And we will see how close we are in the next slides. That brings me to the next question. There is no slow programming language, but just bad algorithms in data structure design, which basically means that maybe my algorithms, of course, I ask for code reviews, I ask for many things and help, but maybe they weren't so well optimized. So rust cannot make your shitty code be better. But I have more experience in JavaScript, so my JavaScript code is written better than my rust code, of course, and performs not better. But we are pretty close, and that's a good point in my opinion, to understand. That's a lesson that I had to bring home. So basically, after spending a couple of months working on that search engine, I gave my talk at we are developers in Berlin on August, I guess. And yeah, this is how lira was born. So lira nowadays it's can open source project that of course, it's a full text search engine written in typescript. And one nice thing about lira that I'd like to highlight is that targets every single JavaScript runtime. So it's not a problem. If you want to run like JavaScript on, let's say node js, on Dino, on can, on Cloudflare workers, on react native, it's not a problem, we don't have any kind of dependency. We test everything on every single runtime, and we implemented everything from scratch with backward compatibility in every single runtime in mind, as well as performances, of course. So we implemented everything from scratch. We implemented prefix, true, inverted indexes, b trees, tree maps, we implementing the stamped algorithm, stopwords support for custom stopwords, we introduced support for multiple languages, everything from scratch so that you can use lira wherever you want on your favorite runtime. And when I say runtimes, I refer to the fact that you can run, let's say lira on cloudflare workers or netrify functions. So you can target edge computing, you can run it on browsers, on can lambda functions, on AWS lambda edge dino react native node js, you can literally run it wherever you want. And talking specifically about edge computing, it wasn't super easy for us to get there. I remember I was in a conference in Berlin a couple of months ago, and I was there with a colleague of mine and I said, you know what, it would be cool to run lira on the edge, right? He told me, okay, yeah, hold my beer, that's all he said. And the day after we've been able to ship a very basic version of the very first full text search engine capable of running on edge networks. And let me show you how we did it. So talking about Lira, how it works, we basically have a collection of functions actually, which are for example, create, insert, remove and search. So create creates a new lira instance, insert inserts new data into the existing instance. Remove removes data and search of course performs a search operation. But let's start from the beginning. It's not schemaless, so it's not really like elasticsearch in that sense, which is totally schemaless, but you have to define the types for the stuff that you are going to input into the database itself. So in that example we just create a schema containing author, which is a string, and quote, which is another string. Then we want to insert stuff. So as you can see there, we basically pass the DB, so we mutate the original DB instance by inserting new stuff. This is a synchronous operation, as you can see there. So we also provide an insert batch method which is asynchronous and prevents the event loop from blocking. So that's pretty important just for you to know. And once we insert stuff, we can eventually start searching for data. So in that example, we pass DB, so inside that database, because we might have multiple databases, why not inside that database, search for the term, let's say if on all the properties. So search it on quote and author. But you can also choose to search on quote only on author only. And you will see that elapsed. It's 99. We'll get back later. Count is two and two results. So these are the results. This is the API for searching data. But now you may be wondering, what is 99? It's records. Milliseconds. No, it's actually microseconds. And you might be thinking, wow, okay, you only inserted like four documents. Of course, it's fast. Yes, this is true. But we also made some benchmarks. So we took 1 million entries, which are actually 1 million movie titles from the international movie database. We inserted everything inside lira and we performed different searches. So for example, we searched for believe inside of all the indexes and on average it took 41 microseconds, so millionth of a second to get all the results. And if you go like in criminal minds, for example, it takes 187 microseconds. But as you can see, criminal minds has two different terms, so it performs two different searches, then intersects all the documents containing both terms. So there is a lot to do in that case. But it's so damn fast. And again, there's no slow programming language out there. There is just bad algorithm and data structure design. So that's something to keep in mind after we get there. We also wanted to give support from multiple languages because of course English is default, but I'm italian, so I might want to index italian data. And when indexing data, we also want to make our search engine smart. So let me give you an example. We perform stemming operations and stopwords removals. For example, if we have sentences containing commonly used words such as articles like a v, et cetera, and other similar words, we just remove them because they have no sense. And if we have words like lucky, for example, we stem it to luck, so that if you search luck, you will find exactly that word, or luckiest, you will find all the documents containing luck, lucky, of course you can also say, okay, give me the exact result. So luckiest, not luck or lucky, so you can filter out results that you don't want. But we also give you an interface for smart searches, and we do that using a stemming algorithm that right now it's supporting in 23 different languages. So that's an example of how it works. We basically expose a true shakeable italian stemmer. In that example you can see on the screen, so you choose the language and the tokenizer. So the tokenizer, sorry, the stemmer, it's part of the tokenizer process right now, will be the italian stemmer. Of course, writing in stemmers, it's not easy because every single language has different rules. So you can't stem italian words like english words, for example. We will see more example going on. So lucky gets stemmed to luck. But the same rules can't apply for Italian, for example, or Germany, or German, or Russian, or, I don't know, Swedish, Turkish, et cetera. So we relied on snowball. So snowball, it's a project created by Martin Porter, which is the author of the Porter Stemmer, which is possibly one of the most beautiful stemming algorithms out there. Super brilliant, and it's totally open source. So not only the stemming algorithm itself which is written in c, but can be compiled down to Javascript or imported into Golang, rust, wherever, but it also gives you idea on how to create your own stemmer. So in that example, as you can see here, we have like, I don't know, step zero, search for the longest among the suffixes. If we find that suffix, we just remove it. Then next step, search for the longest amongst following suffixes. And whenever we find one of those suffixes, we perform a given operation following the algorithm description. So it's really easy and convenient to follow these instructions to create more and more stemmers. There are battle tested, accurate and also widely used also inside other projects. I'm not sure that elasticsearch for example uses this stemmer algorithm, but I wouldn't be surprised. Let's say that I know that a lot of search engines out there are using the exact same algorithm, so the results are pretty accurate in every other, let's say, competitor we can find for lira. So just for giving an example, this is how stemming algorithm networks in English, so we have like consign, which gets stem to consign, but then we have the past form, so consigned in the past gets stemmed to consign, again, consigning gets stamped again to consign. So this is how it works. And of course I'm italian, I could write can italian stemmer. And these are the tests, for example, that I had to perform. I know nothing about German, but I'd love to, but I don't know how to speak it or how to write it. So following the porter's terminal algorithm description, it was possible to do that. And of course if you want to have fun, you can create your own stemming algorithm. So in that example, stemming function is nothing more than a function giving you a word and expecting a new word in return. So in that example we are just appending ish ish to the word. So that if you really want to have fun or you don't like the stemming algorithm, you can bring your own, which is really really convenient because there are many out there and you can just import a library and use it as you prefer. So we did also the same for custom stopwords. So for example, common stop words are, I don't know, a d me, you, these are all stopwords. That doesn't carry a lot of meaning to the search if we think of overall meaning for the search query and results. So basically, given the language that you support, for example, in that case we don't specify a given language, so it's English by default. We give you the list of english stopwords and you can filter them out, you can append new stuff, or you can bring your own stop words. So it's all up to you. So it's highly customizable. Lira had some project goals, of course, we wanted to run on any JavaScript runtime. We wanted it to be as small as possible, as fast as possible, and easy to maintain and deploy. And I gotta say, we had some great achievement. It works on every single JavaScript runtimes. It has a small modular core, as you can see, you can always interact with the core, so that, I don't know, for example, we have a hug system, so you can interact with all the processes to customize the experience as you wish. It's pretty fast, can be deployed literally everywhere, serialized data in different formats. So we will see what that means in just a second and has a powerful plugin system. Speaking of which, at a certain point you have the data, you want to deploy the data, you may want to persist the data somewhere so you don't have to index everything from scratch every single time. So we created a plugin which is called plugin data persistence. It's an official plugin and basically allows you to export the index from one lira instance and re import it somewhere else. And this is pretty important, let me show you why. So let's say we have this lira instance, we have a schema. We insert data into the original instance. Then we import the persist to file function which is runtime specific. This only works on can and node JS. As for now, if there's anyone wanting to implement this on Dino, I'd be super happy to help of course. And persist to file will return the file path where you persist the data. It's an absolute path, so you pass the original instance that we just created. So this one for reference, we choose the format binary by default. So message pack, but you can also choose DPAC or JSON serialization and an output file. So in that case quotes MSP, which is message pack of course. So we are basically taking Lira index, we are serializing it and saving it to disk in a binary format. Then we can use the restore from file from another totally different machine or service and we read it in memory. So restored instance constant in that case will be an in memory index for lira. So as you can see, we choose the file quotes MSP, the one we just created, and we can now immediately perform search on restored databases. That brings us to a lot of target architectures. Let me give you a couple of examples. We have the offline search architecture for example. So let's say we have a mobile app on react native which is totally supported from lira. So we perform search and whenever the connection is not established to the server, we can also rely on a local backup for the data. So we can fall back to the local DB that let's say every five minutes asks the server for new data. So if there's new data, we serialize that new data, we send it to the local database on our applications. And whenever the Internet connection fails, we will always be able to fall back to in memory database, be it, I don't know, sqlite or any database you like to use on your mobile applications. But that's not just it. We may want to have a kind of a CI process for lira so that you build your database every five minutes, let's say, or every minute, every three minutes you choose and you deposit your serialized index on s three. For example, it triggers a simple notification service SNS, which will deployed some lambdas containing the in memory index so that you can query the data directly on AWS lambda. And every time you put some new index inside s three, you will be able to basically redeploy lambdas and perform search operations on new data. So with that target architecture, for example, you have to forget about cluster management deployed data consistency because it's all managed by AWS in that case, so you only have to take care about performing search operations. Or if you're lazy like me, you can deploy everywhere using nebula. So I couldn't prepare a real live demo, but let me show you, what does that mean? So this is an example. We installed nebula, which is the build system. Oops, sorry. We installed nebula, which is the build system for lira. It's the official one, it's still in beta, but it's working pretty fine. So as you can see, we basically install it from NPM. And when we look inside our folder right now in that demo, we have two files, data JSon and Lira Yaml. So let's see what's inside those files. If we can lira YML, we will see that we have a schema, which is a normal schema definition for lira. We already saw that sharding automatic or we don't want sharding, I don't know, that's up to you. An output file. So in that case, bundle JS. So it will generate a lira application containing the data inside the bundle JS file. The kind of data that we have in that kind is of type JSON, and it comes from the data JSON file. You could also use type JavaScript and as a source use, let's say Foo JS, which exports default, can asynchronous function, so that you can call a database, get the data, and get the data from there, basically. So you can interact with the database, but that's up to you. Let's use JSON, that it's easier. And the target in that case is cloudflare workers. And we can configure, for example the cloudflare worker name, in that case Pockydex, because we are going to deploy Pockydex if we want to run tests, true or false. In that case, we want to run tests. If we go see the data that we have inside data JSON, as you can see here, it follows the schema definition. So we have a lot of pokemons, and that's really it. We can now run nebula bundle or nebula D, which stands for deploy, so it will bundle and deploy. And as you can see, in just 5 seconds, we've been able to deploy everything to cloudflare workers. So if we make a search with the CRL for Pica, for example, we will get a response and we are running on can edge network, in that case, cloudflare workers. So congratulations. In like 5 seconds, we just deployed the very first full text search engines capable of running on an edge network lira. It's free and open source and I will be there if any one of you needs some help setting up it, creating a target architecture. This is a service I can kind of help you with. So if you need something, I'd like to hear from you. I'd like to hear your feedbacks on Lira. If you have any kind of questions, please feel free to reach out to me directly at Michele Rivacode on Twitter. But we also have a slack channel where you can find help from a community, from me, from my colleagues working on Lira. So please join lirasearch slack.com. This is where we make lira happen. And before I end my speech, I'd like to thank Nearform. We are a professional service company specializing node JS DevOps react native. We maintain a lot of open source software. We are responsible for the maintenance of almost 8% of all the NPI modules used globally, which gets downloaded like 1 billion times per month, which is totally crazy. And we are hiring so worldwide for remote. If you are interested, please feel free to reach out. And I'd like to thank Nearform for letting me working on lira and presenting this to you today. So thank you so much. Thank you all for following this talk. And this is where you can find me, mainly on Twitter because this is where I live most. Thank again has been a pleasure. I hope to see you all very, very soon.
...

Michele Riva

Staff Engineer @ NearForm

Michele Riva's LinkedIn account Michele Riva's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways