Conf42 Golang 2021 - Online

Deserializing Python objects in Go with GoPickle

Video size:

Abstract

The Python Standard Library provides the “pickle” module for serializing and de-serializing object structures. Almost every Pythonista makes use of it, since it can easily and efficiently serialize even very complex objects… but what if you are a Gohper and want to read those objects back into Go?

In this talk I will illustrate the main peculiarities of pickle serialization and how data can be deserialized in Go with GoPickle, a lightweight and customizable library (https://github.com/nlpodyssey/gopickle).

I’ll also show you some examples of pickle serialization in the wild, and a practical usage of GoPickle with spaGO machine learning library.

Summary

  • Marco Nicola has been making software for more than 20 years. Main focus has been on machine learning and specifically natural language processing applications. Currently employed at Exop, it's a german company and our main business is mobility risk management.
  • In this presentation, I'm going to show you how you can effectively deserializing Python objects with Gopickle. We'll see how we can effectively and easily read Pickle formatted data from Go without even those need to run Python in the first place.
  • The pickle module comes with different protocol versions. Each protocol version identifies a set of instructions that the underlying virtual machine can handle. What if I'm a go developer and maybe I have around some files containing data serialized with Python Gopickle module. I might want to load that data from the Go language.
  • The Gopickle library is a library for loading pythonista data serialized with the Pico module. It turns out that mapping basic data types from Python to Go was a fairly easy process. The main goal was to quickly have a working implementation of the whole unpickling machine.
  • In go the plan is to somehow emulate the greeter class and objects here in go. A fairly natural way to port in go the original Pythonista Greeter class. Of course, in go we don't even have the object oriented concept of classes. So we somehow had to emulate that as well.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you very much for joining me. My name is Marco Nicola. I'm yet another software developer. I've been making software for more than 20 years. By now my main focus has mostly been on machine learning and specifically natural language processing applications. And in more recent years I've also tried to expand my skill set working on full stack web applications and also bit of software as a service and cloud applications as well. I'm currently employed at Exop, it's a german company and our main business is mobility risk management. If you want to be in touch with me, you can find me of course on GitHub or Twitter if you prefer, or LinkedIn. The references are there on screen. In this presentation, I'm going to show you how you can effectively deserializing Python objects with Gopickle with the help of a little library called Gopickle. As you can see, this is going to be a sort of cross languages talk. We will start by analyzing the Pythonista Pickle serialization module. We'll see exactly what it is. How does it work? Why is it interesting? We'll have a quick look at the Gopickle serialization format and finally we'll reach our beloved go programming language and we'll see how we can effectively and easily read Pickle formatted data from Go without even those need to run Python in the first place. First of all, Pythonista Gopickle is a Pythonista built in module. In Pythonista programming language modules are something just similar to Go packages. The Gopickle module in particular implements binary protocols for serializing deserializing Python objects. With Gopickle, here is all about data serialization and persistency. You can imagine to have your Python script which builds some data structures. Maybe you have an object and with the pickle module you can serialize it to a file. For example with the pickle dump function. This process is also called data pickling. You'll then have a binary representation of your original data, and later on you can read data back from this file with a function called gopickle load. And this deserialization process is also called data unpickling. In this context, I think it's interesting to talk about the pickle module, especially because at least according to my own Python programming experience, the pickle module seems to be a very popular choice for data serialization in Python, and it seems to be very popular, especially when those aspect or format of the actual deserialized data is not particularly a big concern. The popularity of this choice also seems to be reflected by a high number of particularly prominent Pythonista projects and libraries that you can find around just to name a bunch of them. Perhaps you've already heard about numpy Python library for scientific computation. Maybe you've heard about Pytorch, a machine learning framework for Pythonista or pandas, a library for data analysis and statistics. These libraries and many others as well, provide high level functions for saving and loading your custom data and behind the hood, either by default or you can choose that with some option. They seems to make use of the Python pickle module to actually achieve data persistency. Now you might be wondering why in the first place is it interesting for Pythonista programmers to use this weird and exotic pickle module over more popular and traditional data representation formats such as JSON or YamL or XML? Let's see this with a couple of simple examples. Let's start with a very straightforward Python data structure. In this case, we have a dict. Dicts in Python are similar to go maps. There are a bunch of keys and values. Values have many different data types. There are strings, there are numbers. There is an array which also contains mix of data types, number and a string. And it turns out to be straightforward and it works out of the box to deserialized this data to JSON format. The JSON representation even looks almost identical to the original Python code. But then let's see what happens if instead of using built in and simple data types, we define our own types and classes. Let's take a little moment to familiarize a little bit with these greeter class, since it will appear again in later examples. Let's define in Pythonista this class called greeter. It has a constructor, this devinit function which accepts name string argument and this name value is saved by the constructor in an internal instance variable underscore name. And then let's add a simple greet method to this class. And all it does is to print to standard, output the string high and then interpolating those name from the underscore name instance variable. That should be simple enough even for non experienced Python programmers, I hope. And of course you can instantiate an object. You can create a new instance with greeter parentheses, and then we can pass our name, let's say gopher. And sure enough, if you call object greet, you will get on your console the message hi gopher. So far everything is still fairly simple. But now if we want to try to represent to a format like JSON our little greeter object instance, we don't get this feature for free anymore. You might try that and you might get an error just like this one. This might be a very well expected behavior. You might think yourself about super easy solutions for representing the humble greeter object to JSoN, and then to load it back again. But the whole point here is that in real world applications, the complexity might escalate very quickly. For example, when we talk about custom objects, we should think as well about external libraries. Maybe your project is using third party libraries which don't provide out of the box the ability to export to your preferred data representation format, and in that case you might have to implement that by yourself. Also, think about object identity and shared objects. Maybe you have an object instance which is referred twice from an array when you serialize and later deserialized this array, you might expect as well to have a single object instance which is again referred twice from the array and not, for example, two different copies of the original object. Also, think about recursive objects. Consider having a list an array, and then you append the very same array to the array itself. And this might be very hard to represent in formats like JSon or YaMl. In order to elegantly solve this and other interesting situations, the pickle module adopts a fairly interesting and original approach. In fact, instead of more traditionally mapping your original data almost one to one to a certain data representation format, and also later on requiring a parsing step for reading the format and rebuilding your objects, the Gopickle module instead implements a fully qualified virtual machine. So when you are serializing data with the pickle module, it will create for you a binary pickle program that you can store somewhere, perhaps to a file. And later on this program can be given to a so called unpickling machine, which is in charge of running the pickle program and rebuilding those original objects. This approach is highly flexible. Pickle programs can instruct the unpickling machine in order to reconstruct arbitrarily complex data structures. Moreover, the virtual machine itself doesn't need to know anything really specific about custom classes, so custom classes and data types just work out of the box without farter intervention. The only downside is that the implementation of this virtual machine is highly tied to python specific functions, methods, and types. We can also have a quick, high level look at the virtual machine implementation. We saw that serializing data with Gopickle produces Gopickle programs, and a Gopickle program is really just a sequence of instructions where each instruction is identified by a one byte opcode. Certain opcodes might be followed by one or more additional bytes values, and these values correspond to instruction specific operands. They are just like instruction arguments. The Gopickle module actually implements a stack based virtual machine, so there is a traditional stack structure, and the virtual machine can push and pop elements against the stack. Additionally, there's also an additional data area which is called the memo, which is just something that makes the virtual machine implementation fairly simple. At the end of the program interpretation, the stack will contain just a single object, which will be those fully deserialized object. Also, the virtual machine instructions are not too many and not particularly complex as well. In no way you can perform any sort of looping or testing. There are no conditionals, there are no arithmetic instructions, and no function calls. The structure of pickle programs is really simple, and the virtual machine just read one time from start to the end. The pickle programs to deserialized data let's now see a practical use case and example. Here we are in Pythonista. We are defining again the greeter class. We already saw it, nothing has changed. Here. We can instantiate an object, and then let's say that we want to serialize it. So let's import the pickle module. Let's open a file object pickle in writing mode, and finally, let's simply invoke pickle dump, passing to it the object and the file. This code will effectively write some content to the object pickle file, and in fact, this file is now supposed to contain the Gopickle program, which can later be used to deserialize our object. We can try to have a look at the content of the file, for example, with an hexadecimal editor, and all we see is just a bunch of bytes. Here and there you can see some human readable sequences, but still it's hard to get a good idea about what's going on. However, if you are curious enough, you might go on with your exploration, perhaps making use of another built in Python module called Gopickle tools. For example, from the command line you might want to run a command just like those one that allows you to get the annotated representation of your pickle program of the content of your file. It's very likely that you'll get a highly dense output just like this one. Don't worry, we are not going to explore every detail about those screen, just to name a few things. On the very left you can see the bytes positions. Then in yellow color, I highlighted for you the opcodes. They're just single bytes. They are followed by the name of the instructions, and they are in turn, sometimes followed by the values of certain operands and then on the right you can see short annotations describing what each instruction is supposed to do. But now let's go back to some simpler Pythonista code, especially to see how to deserialize our data and objects. First, let's make sure that our custom classes and functions and data types are defined in our current scope. Here's again those greeter class just as a reference. And after that, let's simply import again the pickle module. Let's open our object pickle file for reading, and let's give those file to the Gopickle load function. This will actually run the unpickling machine, which will execute our pickle program, and we'll get back our object, which is almost identical to the original greeter instance object. And of course we can try to invoke the greet method on this object, and we get, as expected, our high Gohper message. Yet another important thing to say, but the pickle module is that it comes with different protocol versions. At present, there are six different versions, number from zero to five. And simply put, each protocol version identifies a set of instructions that the underlying virtual machine can handle. So from time to time, new protocol versions were introduced for reasons such as providing better efficiency in the virtual machine implementation. Perhaps new instructions were added for better handling specific Python types coming with newer Python releases. And important things to know in general is that each protocol version is back compatible with all previous versions. So that's enough Python stuff for now. If you are curious about further details, you can visit the official Python documentation for the Gopickle module, and also have a look at the Gopickle tools module, which provides even more extended documentation and details about the implementation of the unpickling virtual machine, as well as analysis tools. Okay, so everything was particularly cool and simple enough in the world of Pythonista. But what if I'm a go developer and maybe I have around some files containing data serialized with Python Gopickle module, and I might want to load that data from the Go language. Just some time ago I found myself in that exact situation. I was working on a machine learning library for the Go language. It's called spago. I recommend to check that out. And we wanted to load from go pretrained neural network models, which are exported from Pythonista from the popular machine learning framework Pytorch. And for doing that, we discovered that apart from other technicalities behind the hood, the Pytorch serialization process involves heavily the pickle module. And so the problem was, how do we load pickle data from Go a possible solution might have been to simply write a Pythonista script that would read the initial data and then transform it to a data representation format more suitable for being read from go. But instead of doing that, I decided to write a little wish list, and with this I was wishing for the existence of an easy to use go library that would allow me to unpickle data in go, possibly supporting all Gopickle protocols. It should handle out of the box basic simple data types such as numbers, integers and floating points, or strings and booleans, et cetera. It should be yet easy to be expanded with custom data types or types coming from external libraries, and it will be cool to do that without having to run Pythonista at any step for the deserialization process. And it will be also cool if such a library would have minimal, if maybe no dependencies at all, and possibly also not making use of unsafe data types or the Spago extension. I tried to look around a little bit for existing projects, but I couldn't really find exactly what I was looking for, and so I just decided to try to do that by myself. And here finally, I introduce you to the Gopickle library, a library for loading pythonista data serialized with the Pico module. Here's the link to the project. This library is focused on deserializing only, at least for now, and it's actually a part of the Pythonista Gopickle class that you can find on the cpython reference implementation source code. It turned out that mapping those basic data types from Python to go was a fairly easy process. I'm talking again about boolean values and numbers, floating points and integers and strings. And even the Python none type was easily mapped to the go nil value, and everything else that was otherwise especially tied to the Pythonista programming language has been in this library emulated by using structures and interfaces. Also, when I was starting this little project, I was especially reassured by those fact that the pickle library itself is not particularly big. For example, in cpython version three nine, you can find the lib Gopickle py file, which includes both the serialization and the deserialization code, and in total it's less than 2000 lines of code. So that was especially reassuring. But without further ado, let's jump right in with a basic usage example. It all starts once again with some Python code. Let's start by defining an object just using simple built in data types. We already saw previously this very data structure. It's a dict containing a bunch of keys and values and different data types, some strings and numbers. We also already know by now how to serialize data with the pickle module. So again, nothing new. Once this code is executed, we'll get can object pickle file containing our pickle program. So now of course we can deserialize our data back from this file. We already know how to do that in Python by using the pickle module itself. But here's something new we can try to do that from go by first installing the Gopickle package library. Here's the typical goget command to install the library, and then you can import the gopickle Gopickle package and make use of the Gopickle load function, which simply accepts the name of the file containing the Gopickle program and gives back to you the deserialized object. And also an error which in positive case will be simply nil. If everything goes as expected, the object variable will eventually contain something just like these here on the left. I reported the original Python data structure just for reference and comparison, and you can see very well here how the Gopico library transformed some of the original Python data types in specific Go types. For example, the original Pythonista dict is transformed to a Go type, which is also called dict, of course, and it's implemented provides a series of Dict entry elements being each dict entry, just a simple key value pair, and you can see how the various Dict keys and values are mapped in go. There's also the nested dict here, and you can also see the additional list value which contains both the number and the string. These custom types come from the Gopickle types subpackage. You can have a look at it, and it just provides a limited amount of structs and interfaces to represent and handle a limited set of python structures and data types, which are particularly useful for the implementation of the amplitude machine. So, for example, you have ways to represent and handle lists or dicts or tuples and so on and so forth. Please keep in mind that the implementation of some of these types is not particularly clever, and especially is not particularly optimized when those types were created. The main goal was to quickly have a working implementation of the whole unpickling machine, and some of these types still have a pretty unpolished aspect. And now that the whole unpickling machine seems to work fairly well, there's plenty of room for further improvements here. Let's now do something a little bit more advanced, and let's see how the Gopickle library behaves with foreign custom classes. So here from Pythonista we have once again the greeter class. We instantiate can object, and we deserializing Python objects with gopickle pickle modules to our object pickle file. If we now try to deserialize our object from go just like we did before, alas, this time we'll get an error back from the gopickle load function. The message of this error might not be particularly easy to understand. In fact, you might be required to have a little bit more familiarity with the gopickle project, and perhaps the Python pickle module as well. So for this time, let me clarify what's going on here. The first thing you have to know is that when the gopickle unpickling machine encounters unknown data types or classes, for example, the greeter class. In this case, it makes use of a couple of structures available from the Gopickle types subpackage, which are the generic object type and the generic class type. And of course, go is not, strictly speaking, can object oriented language. That's why we have this clear distinction between objects and classes. Sometimes letting go people creating those generic objects and classes is absolutely enough in order to deserialize certain data structures. However, here you can clearly see how even the humble greeter class apparently already has something too much in order to be handled out of the box by the Gopickle library so we can give to our library a little help. In order to better understand the data that is going to deserialize, even the Python pickle module would need to have the greeter class defined in the context in order to properly deserialize it. And so here the plan is to somehow emulate the greeter class and objects here in go. A fairly natural way to port in go the original Pythonista Greeter class is to define a greeter struct, also giving to it the name string field, which is a parallel to the original Python class underscore name instance variable. Later on, we can expect the unpickling machine to handle greater struct values, and it will eventually require it to satisfy those pydict settable interface. This interface is there in order to emulate the Pythonista specific behavior of setting a key value pair on a particular property that almost every Python object has, which is called underscore underscore dict underscore underscore with those assignment in Python, assuming that the object is actually an instance of a certain class, you are effectively assigning a value to a specific instance variable inside that object, and the name of the instance variable is identified by the value of those key. We can easily emulate this behavior in go as well. Let's then define this pydict set function for the greeter struct. It will be automatically invoked by the unpickling machine, which will provide a key and a value. They can be both of almost any type, so they are just both generic empty interfaces. We know that the original Python class had can instance variable called underscore name, so we might expect that this method will be invoked with a key equal to underscore name. And when we encounter this, we can just expect the value to be a string, so we can stringify the value and assign it to the name field of destruct. And of course we can also provide a little bit of error handling here and there. Of course, in go we don't even have the object oriented concept of classes and being able to create object instances from them yet. This is an important feature in the context of the unpickling machine, so we somehow had to emulate that as well. In go, the greeter struct that we just defined seems to be already well suitable for representing Pythonista Greeter object instances. But in go we have to do another step and define also a higher level greeter class. The original Greeter Python class was fairly simple. There were no class level variables or methods, and so we can keep it simple here as well. We can define a greeter class implemented as an empty struct with no fields. Again, sooner or later the unpickling machine will have to handle a greeter class value, and it will require it to satisfy the interface called pynouvable. This time this interface is there to simulate the creation of new object instances. In particular, it represents the Pythonista specific invocation of a special method which almost any class has, which is called again underscore underscore new underscore underscore. In go we can define a PI new function for the Greeter class struct. It should accept a variable amount of arguments and return a value representing an object instance generated from this kind of class, and also an error if something goes wrong. In our case, emulating the creation of a greeter object instance is as simple as returning a new greeter struct value. Having done this preparation, we are now almost ready to deserialize our data. We can import again the pickle package, and this time, instead of using the high level function gopickle load, we can open by ourselves a file for reading from it our object pickle file containing those pickle program, and we can give this program to the pickle new gopickle function. With this we'll get a customizable gopickle object, and after having provided our desired customization, we can eventually call unpickler load and this will try to load the gopickle program. In our case, we can customize the gopickle object by providing a find class callback function. With this function, we can finally tell to the virtual machine what it is in the first place, this foreign greeter type. So this function will be invoked with the module value equal to main and the name equal to greeter, which is the location of the original pythonista data type. And we can finally provide our implementation, our go implementation of the Greeter class, which happens to be just a greeter class struct value. Without this function, the unpickling machine will still fall back to the generic object and generic class types that we saw earlier, and we are finally ready to deserialize our object for doing that. Let's call those unpickler load function. Let's see if there is an error, maybe otherwise, let's just print to the console the representation of this object and lo and behold, there are no errors those time and we get as a result a greeter structures value. Those name field was populated with the value gopher, which is exactly the value that we were passing to the constructor from Python. Having reached this point, there's really just one more missing thing, and for that you might want to run the extra mile and implement the greet method on the greeter struct. Everything should be already in place, so the implementation itself is super simple. And once you have your deserialized object, you can cast it to a pointer to the greeter struct. And finally you can call the function greeter greet. And there you go, you have your message. Hi gopher. As a final reference, here is the full list of interface types from the Gopickle types package, which are replacing or emulating Python specific behaviors or functions. They are especially vital for the correct functioning of the whole unpickling machine. If you are curious, you can have a look at the Gopickle types documentation, and also at the corresponding Python functions documentation as well. Here's also a quick overview of those unpickler objects callbacks that you might want or need to customizable in order to provide a certain guidance for the unpickling process. We already saw the find class callback in action. There are other callbacks you can define as well, for example for resolving objects by a persistent id or handling custom pickle extensions, or handling particular data types or specific instructions. Also, keep in mind that some of these topics might be considered particularly advanced and might require some learning curve and time to get used to that. And sometimes a certain intimate level of knowledge about the those Gopickle model might be required as well. However, don't worry too much. Most of the times, even in real world and more complex scenarios, the required level of customization doesn't differ much from what we just saw before with our simple greeter class. As a bonus, once the whole unpickling machine was there in place, implemented in Go, it turned out that the original intent of deserializing neural network models exported from the Python Pytorch machine learning framework was a fairly simple job. The very go code for doing that turned out to be particularly compact in size, and for that reason we decided to release it directly in the Go Pico library. So there is a Pytorch subpackage which exposes types mapped from the original Pytorch Python implementation, and there's also a high level pytorch load function to effectively load at least a subset of Pytorch models. Also called modules, this package is effectively used by the Spago project, which I already mentioned before. Spago is a machine learning framework for go. Here is the link to the project, especially if you're not a machine learning expert. Spago comes with built in tools and configurations to help you solve traditional machine learning problems. In particular, in the field of natural language processing, you can easily make use of state of the art techniques to perform, for example, text classification, question answering, automatic machine translation, named entities, recognition, and a lot of other cool things. Spago implements all the functionalities, and then you can also easily obtain ready to use pre trained neural network models, for examples from the Huggingface website. Huggingface is a fantastic company, which most prominently started creating this sort of community where people can freely share their own pre trained model, and many of these models are actually generated by using Pytorch. Indeed, a subset of these models is compatible with Spago, which provides high level functions and also command line tools that can automatically download compatible models, load them thanks to the Gopickle library, convert them to a spaghespecific format, and finally, your application can perform a lot of wonderful things, and you don't even have to leave those terminal. Gopickle is still a very young project, there's plenty of room for improvements, and a lot of tasks are still left to do. Among those others, it's definitely desirable to have more tests and better test coverage, more and better documentation. Maybe it will be cool to implement better error messages and more clear ways to inspect what's going on in the PICL programs. We should try to support more and more Python standard classes as well as Pytorch specific classes, and also performance might be an interesting point to work on. In conclusion, here is my call to action for you. Please go visit the Gopickle GitHub repository page. The easiest way to contribute is to simply share the link and if you like, also give us a star. If you use the Gopickle library in your own projects and experiments, let us know how it goes. Feel free to come up with suggestions for fixes or improvements. And also please go on with your own pool requests. They are very, very welcome. Get in touch with us for any in prison that you want, even just for saying hi. And finally, you can also support us via our fiscal sponsor@opencollective.com, Nlpodice so that's it. It has been a long journey, but I hope you enjoyed it. Thank you very much for your attention, and until next time,
...

Marco Nicola

Software Developer @ EXOP

Marco Nicola's LinkedIn account Marco Nicola's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways