Conf42 Python 2021 - Online

How to be Pythonic? Design a Query Language in Python

Video size:

Abstract

We created Python API calls that let you can make queries and manipulate data in our graph database. We thought about what will be best for Pythonistas? What will be the most Pythonic way to do it? (Is it a thing?) Here’s our journey in making WOQLpy and we want to make it useful to you.


Query language is an important part of a database system, it is how people manage their data and how they can make the data useful to them. Starting from the 70s, the world is full of relational databases and SQL was the way to make queries. However, SQL is valuable to an injection attack. Lots of efforts are being used to stop those attacks and it made workflow become less efficient.

We don’t want to make the same mistake. That’s why using a Python query language is good. Having the Python community in our mind, we created WOQLpy, an open-source query language that lets users build queries in Python, instead of JSON-LD which is the native query language for our TerminusDB database. Now users can store data with a knowledge graph and make graph data visualization with Python.

In the first part of the talk, we will talk about what challenges we have when creating a query language in Python, the method we use, the idea and theory behind, and how WOQLpy work. This part will include a quick live demo of using WOQLpy so audiences can have an impression on how to make a query and get the task done, that is, getting a meaningful graph visualization form the source CSVs. The process of how to create a database and schema, loading the data form many CSVs, making a query and visualization, will be demonstrated using just one python script.

In the second part of the talk, we want to stimulate a discussion of what is a good design in Python and what is not. This part will be more interactive with the audiences, as we want to hear from you all, what would be the best for Pythonistas. By first suggesting some possible design, we will use a live voting system to gather opinions. This part of the talk will extend to the Q & A sessions to allow further discussions.

This talk is for Pythonista at all levels who are interested in starting to design a package in Python, no matter if the audience has published a python library or not. By attending this talk, audiences will learn about how to design a Python package that will be useful to Pythonistas and hopefully encourage more people to publish open-source packages online.

Summary

  • Chuck talks about his journey of designing a query language in Pythonista. He works full time for TerminusDB, which is an open source graph database. Also do streaming online on twitch.
  • Pythonista means that the code is correct, it runs, but usually there's more than one way of doing it right. If it's done in a way that is accepted by the python community, that is easy to understand code then is pythonic. Whether something is beautiful or something is better is subjective.
  • Wacopy is the Python client of terminusDB, a graph database. It uses a query language called Waqwo. Sometimes the data itself is more natural to present it in a graph format. Wacopy thinks it's a good idea to create a client for Python.
  • For example, the method n in pythonista, some of you may know that n because it's a keyword, it can be used. Now you have to add the waco underscore prefix to it. But we may change the design in the future. Never know.
  • This is about the extra things that we float in for the Python client. The integration with Jupyter notebook. The schema checker. What we still want to do, but not quite there yet. We just want to make life much more easier for pandas user.
  • TerminusDB Pythonista. There's more to explore, of course, if you're interested in how we created the python client. We heres planning to organize more workshops for people to learn how to do the model build. Join our discord server.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Chuck and I'm going to talk about how to be Pythonista and my journey of designing a query language in Pythonista. So these are my contact details. Feel free to follow me on social medias. So I'm Chuck, who am I? I actually love open source projects. I have been involved in different open source projects in my life before, mainly in Pythonista. Frenchly I work full time for TerminusDB, which is can open source graph database. I also love organizing community events from conferences to, well, meetups before the pandemic and also sprints that people just contribute to open source together right now because we are going nowhere. I also do streaming online on twitch. Yeah, so if you follow me on twitch, sometimes you will catch me online doing some python stuff. So one questions thats I always ask is like, what is pythonista? Because you have hear this many, many times, people talking about it. So what is pythonic, what does Pythonista mean? So I found this answer from stack overflow. Obviously someone's asking the same question. So it says Pythonista means that the code is not just that the syntax is right. Well, it means that the code is correct, it runs, but usually there's like more than one way of doing it right. So if it's done in a way that is accepted by the python community, that is easy to understand code and what the language is intended to be used, then is pythonic. So in my mind I think that, okay, so it's kind of like artistic thing, right? So it's first subjective. Whether something is beautiful or something is better is subjective. How do I know what is pythonic, what is not? So I think a lot of things that come down will be like looking at what other people approach it, or sometimes it's just that if there's less line of code or to check whether you follow the sand of Pythonista. So in case you don't know what those sand of Pythonista try doing, import this in a python terminusdb. Then you will see the set of python. Yeah, so it's something that, it's kind of like you learn by doing it, you learn by looking like maybe contribute to open source or you learn by just reading other people's code. One example is that pandas, pandas basically is like its own ecosystem. So a lot of times when I browse stack overflow, people ask questions about pandas, how to do this, how to do that. A lot of times when I first learned how to use pandas, I was like, why? I can just use a for loop loop over the data frame row by row and find the answers. Instead. I have to do like aggregation, joining and all this stuff. Why do I do it that way? Well, the answer is that, well, because pandas, it kind of build on numpy and then numpy is a library that if you use the built in function that it's optimized, it's a lot faster because it uses the C extension of it. So it's like a lot more faster than doing a for loop. So that's very practical advantage of doing it. It's not just those style, it's not just beautiful. And people love reading your code, it's also for performance. So if you use Python in a way that is what is intended to be used, then there's benefit to it. So for example, for loop, again, we all love for loop. Well, this is not Pythonista code. Obvious. I hope it's obvious to you. So it's a JavaScript code or Java Javascript, kind of like that. I forgot where I copied this from. So it's actually something that I learned when I was in school. Like to use the index and then increment the index. So you access all the items in an array one by one. But in Python things are much simpler. You just use I would be you just care about what's inside a list, for example. So you just find all the items inside the list and then you just do whatever, or even sometimes it's more compact. Right. Like list comprehension calls this stuff. So usually pythonista way is actually a simpler way. So things is just like less code is simpler. Usually it's a good indicator of things thats things are doing it. Pythonista a lot of times, especially when I was developing the Thomasdb Python client, a lot of the work is translating the code from JavaScript to Pythonista. Then I have to think about, okay, so how should I do it in Python? It should not be like a for loop, I shouldn't be writing it, incrementing the index instead, but I should care about the items in it instead because that's pythonista. Yes, I mentioned a little bit about working on the Python client of TerminusDB. So yeah, my journey of all this thinking about how to design a python client started when I become a developer avocado of terminusDB. But before I tell you my journey, I have to maybe give you some idea of what terminusDB is, right? It's a graph database. So some of you may already have experience with graph database. For example, neo four j is a very popular one that is much more of a history and it's more well known. We are also a graph database, but of course different. But to make things short, imagine that you're not storing things in a tableau format. So this is how I used to store data when I work as a data scientist. Lot of CSVs lets of SQL databases, things are stored in tables with can index a key. On the left hand side there you see person id is the key, there, the primary key, and you have some information about this person, obviously, name, date of birth and. Okay, so mother and father, there's two columns there, some of them are null and some of them got some number in it or what are they? I hope at this point you already figured it out. It takes a while to make sense out of it and then you can imagine that it's actually a family tree. So if you put the information not in a tablet format, but in a graph format, it kind of looks like that. And it's very obvious for you that it's a family tree because, well, we named edge mother and father. So you see that mother and father, of course your parents is a family tree. And also you can obviously see who are the grandparents, which person and which person are in the same generation. So all of these very obvious. If we put it in those graph format instead of this, you have to think about, okay, mother and father. Actually it's a key to join back to the person. Then who is the mother, who is the father, then you have to make some joins. We do a lot of joins when working with SQL. Sometimes the data itself actually is more natural to present it in a graph format. So that's why graph database is kind of useful. Again, this is how we find the maternal grandmother of Joan. With SQL, you have to select from a table and then do some joins, maybe join it back together. I'm not a big fan of doing thats like awkward aggregation and then joining tables together. Yeah, I just found it quite difficult to think it in my mind. Instead, with Terminusdb we have a query language called Waqwo. So in Wako it's very similar to Prolog, if you know what Prolog is. So it's kind of like you're making statements, logical statements, and then you just find which variables kind of suited those logical statements. So you can see here that we will have four triples, which is those relations, two nooks and an edge is a triple. And then we have a person. So actually you want to find like for example the grandmother of join. Then you will put join here and then you find like oh, this person should satisfy this relationship, which is like John's mother. And then the other relationship here will be like the mother of the mother will be the grandmother, maternal grandmother. So here you can find a variable that satisfy these relations and give the name of those people. So you have the name of the mother and the maternal grandmother in this query. So everything is just done in one place instead of making multiple select and join statements. So yeah, like here you can see that we need one and two or two statements to find mother and grandmother, right. And then here we can just find all of them in one go. It takes some time to get around how you think about query the language, but actually if it works then it's much more efficient. So this is Waco, right? This is the query language of TerminusDB. And then when I first joined there wasn't any pythonista clients. Waco was natively a JSonld. This is the JSON format is what those front end talk to, the back end what the clients talk to, the server itself. So Waco, the native format is a JSON LD format. And at the point that when I join we have Javascript client. So I think that, okay, so if we want to help Python misters and data scientists to use this awesome graph database, we need a Pythonista client. So we have occupy a query language for Pythonista users. Great. So I got support from my team and everybody thinks it's a good idea. So we start working on Wacopy. So what is wacopy? Well, it comes with the Python client obviously, which you can pip install. It's on PYPI release on PyPI like calls other Python libraries. It includes multiple modules, so it includes the Python client itself, which is kind of like a wrapper for the API that you could carry out different manipulations to the server database. There is also the wacopy, the query part, Waco query. So that's how you can build your schema, how you can query your data, how you can insert the data in the database. So that's the second bit. The third bit is that there is a visualization tool that could give you an interactive graph visualization of your result data. I'll talk about that later. There's also this data frame which is an optional module. If you install, then you could use some of the functions inside to convert your result from a JSON format into a pandas data frame, which is quite cool. So this is an example of how to use terminusdb here. In this example we are building a schema for this bike data so you can see that we have like three objects that we created in the schema graph. So it's a station, which is a document type. So there is also like label and description of it. You can also add property, for example like the journey data type, you can add property to it. So you can imagine all of these actually describe, for example, I have a station object, then I would have, well have a name. It will have a label, a description. If it's a journey object on top of those you will also have these properties, right? You also have the end time, start time, the journey bicycle, and all these properties. So this is a schema, you can also add them all together. So I have created three different type of object and the schema consists of these three types of objects. So I'm just add them together and then execute it with the Python client. It's still not the most optimal way of doing it. You still have to make a lot of method calls on the Waco query object, but I'll show you a better schema building design that we have just come up with that is still under development. I show you this first because this is what is being used currently. But originally, like I said before, Waco is actually a JSOn format natively. So yeah, you can use the Python client, but if you don't use Waco API, you have to write the query like this, right? So obviously people don't want to write a query like this, they would rather do this with all the Python code. Or even better, when I show you the newest Python schema building regime, I would say I don't know how to describe that, the scheme for building a schema. Yeah. So also we have some flexibility in our query objects that you could design your document type by chaining the extras like label and description. Or you can just put it in as can optional variable. So you could do that because these two are quite optional. So you could put it in like this or just chain it up so both will work. There's some flexibility in the design, but there's also challenge when I try to translate the JavaScript one into Python. For example, the method n in pythonista, some of you may know that n because it's a keyword, it can be used, has a method name so it can't be directly translated, has the same as the JavaScript. So you have to use Waco n, which is not a very good name. So that's why now we have overloaded with the operator. You can see the plus sign that I have showed you before also right now for the oR, we also have the pipe operator to be used as or for not also a keyword. You can't use it as and from, you can just use it as it, because again, it's a keyword. So now you have to add the waco underscore prefix to it. But we may change the design in the future. Never know. This is about the extra things that we float in for the Python client. So this one is the integration with Jupyter notebook. So we have a few things that makes data scientists who use Jupyter notebook have a better experience. For example, like I mentioned before, there's this data frame module that lets you convert the result into a pandas data frame. And also this one is those interactive graph visualization that I mentioned. So you can see that all of these are customizable. You can change the color of the note, you can change the size of the graph. So yeah, there are different ways of visualizing your result in Jupyter notebook, which is quite nice. So it's not just Jupyter notebook, you can actually output thats as can HTML file as well, because it's just users d three to generate it. I have another talk about how I make this part of the Python client happened, but I don't have time for this talk. We're mainly focusing on the query language. But I think you can find my previous talk in YouTube or something. It's recorded. This is something new. So this is the schema builder that I talk about. So remember before, if we have to build a schema, we have to use those doc type thing here. Like this example here, thats we have to use the doc type and then add the property to the doc type, right? And we just think that it's not intuitive enough for Python neisters. So actually we kind of use more like a data modeling approach that you could use class and then it could be. So everything would be a class of object or document or enum. Actually I'm still working on enum thing. So you can have calls the properties as its attribute and then you can use annotation to fix the type of the property. So whether it's like a data type property like float or string, or it could be can object property like for example country, then I can have the parameters like I'm using coordinate, which is an object type that I just created. So this is what we are moving forward to. Right now it's under development. Also, instead of setting the label the description, it will just use the name of those label description will be the doc string if there's any. So yeah, this is what we are working on right now. So I hope that is a good design. I hope you like it. So let me know whether you found this has a better way to build a schema. Look into the future. What we still want to do, but not quite there yet, we just want to make life much more easier for pandas user. Now we can output the results back in the data frame format, but the other way around is not there yet. Hopefully in the future if you have some csv, things will be more automated. If it's simple enough that we could look at your CSV or data frame and then kind of create a schema automatically and then input all your data from your CSVs or data frames automatically. That's something that we are aiming for after the schema builder is finished. Also the network graph analysis, I know that neo four j has it, we don't have it yet, but it's something that I really want to do. But we don't know when we're going to have this. If you want to work on this, please let me know. I'm happy to collaborate with you. Also, the schema checker is something that since we have the schema builder like this, so when we add the object in, so all the objects thats the data adding in will be an object of these classes, right? So we can actually efficiently check whether the data followed the schema correctly. Right? Now the challenge is that when people get an error when they insert the data, they don't know what's going on, they don't know what went wrong, that they can't insert the data. But with the schema checker it's kind of like a linter for the schema that you can see that. Oh, which part? That make it wrong? Maybe the type of your data doesn't match the type that is described in the schema. It should raise a flag to tell you exactly which point is wrong, that you should fix it rather than just pure guessing. So yeah, your suggestions are always welcome. So if you have any questions or suggestions, feel free to leave an issue in our repo. So it's TerminusDB. TerminusDb Pythonista. I will show you the link later. Yeah, so that's basically everything that I want to talk about in this talk. There's more to explore, of course, if you're interested in how we created the python client, how to create the rockupy, the query language, if you want to learn more about how to do like graph data modeling. We have the TerminusDB academy. We heres planning to organize more workshops for people to learn how to do the model build, use the new model builder tools and all this. Also follow us on Twitter. You will get all the news, check out our website or what is better. Join the discord server. We have office hour every week which you can talk with those tech team directly, ask questions, give suggestions, feedback, whatever you like. Just hang out with us. Yeah we want to hear from you. So our GitHub repo is here at GitHub and then terminusdbclinepython here so you can see it's heres. This is our repo so feel free to open an issue, just suggest anything. So yeah, that's it for my talk and thank you so much for listening and feel free to ask any questions. Join our discord server. I will see you there.
...

Cheuk Ting Ho

Developer Relations Lead @ TerminusDB

Cheuk Ting Ho's LinkedIn account Cheuk Ting Ho's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways