Conf42 Enterprise Software 2021 - Online

The technology behind "The Best Country for observing the Northern Lights”

Video size:

Abstract

Having a background in data analysis, a bit of machine learning, I decided to solve this problem as a software engineer - define a problem, find consumable data sources and do some code for the sake of “The Best Country for observing Northern Lights”.

Summary

  • Denys Makogon is a software engineer responsible for Java development at Oracle. On his spare time, he does landscape photography as a hobby. How these two skills aligned to create the northern lights hunt.
  • The neural lights phenomenon is something that can be described as a set of measures. The fact of the neural lights presence in the sky is a coincidence of multiple factors. Visibility of northern lights is very dependent on when season matters and what are the weather conditions.
  • The winning technology for me is yeah, Java for data science. One of the greatest things that happened to Java collection framework was the introduction of streams API. As long as the data can be effectively distributed with a parallel stream, the workload could be divided within a consumable amount of threads.
  • A new HTTP API framework. provides asynchronous and asynchronous request mechanics. Project loom will help to reduce the overhead of threads for an IO bone operation. Java has a lot to offer to data science and the data analysis. The ecosystem of Java is still growing.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Let me first introduce myself. My name is Denys Makogon. I'm a principal member of technical staff, a member of Java platform group who's responsible for Java development at Oracle. So as many of you, I'm a software engineer, meaning I write code for a living alongside with my work. On my spare time, I do the landscape photography as a hobby. People on the Internet call me some kind of a pro level landscaper, maybe because one of my photos won very prestige awards and the latest one is the best photo of 2019 for bloody annual pilot whale massacre at the Faroe Islands. Maybe that's why. So what's this all about? You're here to listen a quite interesting story of how mybo's skill sets in software development and professional landscape photography aligned together, and how such a great collision turned into a great story of the northern lights hunt it's all started with a very simple question from a friend of mine. Which country is the best to visit to witness the northern lights? And I was kinda uncertain to what to tell him, so I told him the following. Pick the one you like the most. Go to Alaska. Go to Swalbard, Iceland, Norway, Finland, Canada, Sweden and Greenland, or north of Scotland and Faroe islands or Shetland islands. If you haven't been to any of those, pick the one upon your budget. But it made me think about the neural lights as a phenomena that can be described by numbers from the point of view of physics. The fact of the neural lights presence in the sky is a coincidence of multiple factors. The neural lights phenomenon, a lovely dancing color waves in the sky, is something that can be described as a set of measures. I ask myself whether I can answer a friend's question in a manner of software engineer with a certain background in data analysis. And from that moment, the answer to this question, which country to visit, was nothing but the ideal state of a system defined by a set of measures and expectations that can be combined into a set of mass equation and the result would be the exact place to go. So I decided to write a theory, but haven't been doing so since the university. So sorry on that. As you see, the study is all about matching the data by certain criteria, doing math, and a bit of assumption based on the years of northern Lights observation. In 2019, the idea of northern lights was born. It's a great attempt to give an answer in the simplest form consumable by nontech people, because the whole theory of the northern lights is quite complex and you need to know a lot about our sun, earth, magnetic fields and a bit of electromagnetics in general. I don't want to waste your time on how all of this works and what the exact algorithm is. You can read bit by yourself, but I'd like to highlight some key aspects of it. But before we'll go and do the deep dive into technical aspects, I'd like to acknowledge two organizations. First is a dark sky weather service who provides a very great structured historical weather forecast. And also I'd like to acknowledge the home hall center of podesdam who provided the data set with the solar activity and magnetic fields yearly. The visibility of northern lights is very dependent on when season matters, where far north or far south, and what are the weather conditions. These are three things that we're not in control of, but as an engineers we do make decisions which technology to use to accomplish our tasks in the most comfortable way. The terms comfortable is kind of tricky here as it actually declares a set of demands to the technology and it has to comply on 100% rather than just go above some abstract desired threshold. So let's speak about those requirements. The technology must be comfortable enough to work with sequences in the case study, each country is represented by a set of GPS locations, nine or more. Each GPS location is described by eight daily measures. Each daily measure described by an hourly weather forecast during autumn, winter and spring seasons within like 226 days. Bit means we have 3000 measures at minimum per single location. So the amount of data, even for a single country is quite big. Alongside with sequences we have pipelines. Ability to build reusable data pipelines is quite critical feature as the whole problem solving algorithm is a set of sequence perceptrons that process multiple point connected data sets from the dark sky and homehold center into a compressed joint. Country based statistics since the case study is some sort of a yearly challenge, the technology must be capable to run perceptrons in parallel. The good thing is that the data is quite effectively splittable because at most of the times the data points are not dependent and can be executed in parallel. Technically speaking, each country can have its own pipeline that must be capable to be paralleled to run perceptions for each gps location for each weather forecast on the location within nightly hours through the autumn, winter spring seasons. So as you can see, all of this can be performed in parallel and this is very critical for the case study to make it go really fast. Both Homehonter and Darksky providing their data through the API's endpoint HTTP rest in particular. So in the context of parallel pipelines, it's very important to reduce the time on blocking HTTP calls ideally with some kind of a nonblocking technique. As for software engineer, type system of a programming language is quite important bit in the case study type system is not very critical but still quite important as there are some data types that require additional manipulations. So all I wanted to see is the same APIs to work with because I don't really want to pull out the hair of my head because something is not supported, require me to write a ton of a code to get a simple value that I can work with and the winning technology for me is yeah, Java. Java for data science. Don't say you're surprised, aren't you? There are multiple whys and I'm gonna list out some of them that I found critical for such tasks. One of the greatest things that happened to Java collection framework was the introduction of streams API. First iteration of streams were introduced in JDK eight release as per docs. Essentially streams are classes to support functional style operations on streams of elements such as Mapreduce transformation on collections. Well, it may sound a bit complex true, but on a practice it's fairly easy to understand. The JDK doc says about presence of intermediate and terminal operations which are combined into a form of a pipeline that contain a source followed by intermediate step and a terminal state operation. Worth saying that all intermediate operations against streams are lazy ones, meaning they wouldn't execute unless the terminal operation is declared. Here are a few examples of intermediate and terminal operations in the case study. Streams used as a pipeline of a set of intermediate operations like providing the mapping of three hour frames for highest magnetic field disturbance. But essentially they are used to do effective and fast filtering, doing some data processing. So basically some both I O and CPU bound operations. No doubt streams are really cool, I totally agree with that, but that's not all streams API also utilizes common fork join pool to run processing units against a number of threads in a pool. So no matter what you do, as long as the data can be effectively distributed with a parallel stream, the workload could be divided within a consumable amount of threads that potentially can give you a boost in the performance. No doubt streams are cool, but that's not all, as the JVM can run multiple threads, stream can do that either. Stream's API utilizes common fork join pool to run operations against a number of threads in the pool. So no matter what you do, as long as the data in your collection can be effectively distributed with a parallel stream, the workload against it could be divided within a configurable amount of threads that potentially can give you a boost in a performance. To understand how the runtime executes the parallel streams, I highly recommend you to look at the source code of the JDK. But there are two ways to define the forkjoin pool for a streams API, there are two possible scenarios where the pool can come from a common pool, the one that is provisioned by the runtime with a customizable parallelism level defined by a number less by one of the available cpus on a machine or a custom fork. Join pool, a developer defined pool with a customizable parallelism value that basically defines the number of threads in a pool. Running streams in parallel is totally fun. However, streams executes in parallel, and the runtime partitions the streams into multiple substreams. Aggregate operations iterate over the process of these substreams in parallel and then combine the result. So these operations could actually be quite expensive. But the trick here is that parallel streams may not serve its purpose most effectively, and there is a reason for that. While aggregate operations enable you more easily implement parallelism, it's still the developer responsibility to determine if the application is suitable for parallelism. We all know that runtime is fast. The only slow thing is your code, and the effectiveness of the parallel execution can be expressed in a pretty simple way. As long as the computational cost of a single item processing through the whole data set is higher than the overall infrastructure provisioning overhead, and the data set is effectively splittable, then a data set processing should be considered to be executed in parallel. Let me explain this. If the operation is so simple and the execution cost is lower than the overhead of a thread provisioning the thread provisioning, then the data set must be very big in order to make the parallel processing worthless. It works in the other way as well. If the computational cost of a single item processing is higher and the data set is effectively splittable, then the size of a data set may not matter in order to make the parallel execution worthless. As we all know, the most affectionately splittable collections include array lights and hashmap. Data structures tend to be effectively splittable if they internally support random access, efficient search, or both. On the other side, if it takes longer to partition the data than process it, then the effort is kind of wasted. So as you can see, streams could be very efficient in terms of threading. But what about memory footprint? A stream itself just stores its own metadata, it doesn't store anything about the collection. The only moment when your application starts gaining in memory because of overstream is when a stream turns into a collection through the terminal operation. So what we already covered here we have that streams are efficient in terms of memory. They can do things in parallel, but still relies on Java threads that are pretty heavy, and we all know about that. So is there a way to improve threading in Java? And the answer is yes. OpengDK project Loom intends to eliminate the frustration trade off between efficiently running concurrent programs and efficiently writing, maintaining and observing them. It leans into strengths of the platform rather than fight them, and also into strengths of efficient components of asynchronous programming. Project Loom lets you write programs in a familiar style, using familiar APIs in harmony with the platform in its tools, but also with the hardware to reach a balance of write time and the runtime codes that we hope we will widely appealing. It does so without changing the language. With the only minor changes to the core library's APIs, a simple synchronous web server will be able to handle many more requests without requiring more hardware. In terms of data processing, a new concept of future threads will make us people who work with data to rethink stages when we retrieve the data and process bit, as loom will help us to increase the productivity of any I O bound tasks by executing more operation at the same time, because the virtual thread itself and the blocking on them is pretty cheap. However, with loom, data extractions should come beforehand. Any cpu bound operations as the number of actual OS threads used by forkjoin pool is way smaller than the number of mutual threads. It means that if you have any IO operations within your stream and you'd like to process in parallel, it's highly recommended to extract the data and run to retrieve the data through loom, but not through Java threads that still will remain in Java and will serve its purposes within loom internally. So going back to other features that Rkstad required and what I found very important in Java is a new HTTP API framework. A new HTTP client incubated in GDK nine will and fully released in GDK eleven. A new client provides asynchronous and asynchronous request mechanics. It also supports HTTP one one HTTP 20, both in asynchronous programming long models, handles requests and response bodies as a reactive streams, and follows the familiar builder patterns. So why it's very important this is the optimized number of the HTTP request required to be obtained to obtain the weather forecast for the case study. These HTTP calls are performed within the data processing pipeline, which technically means nearly 55k possible blockings while waiting for the response from the resource utilization standpoint, it is unacceptable to have all that possible time wasted for just idling. In the case of the GDK HTTP framework, we would no longer need to write asynchronous code, but just simple synchronous code. No more async anywhere. Virtual thread scheduler will handle all the blocking. That is way cheaper than blocking on the OS threads, but still, a choice of the back end of threads is a responsibility of a developer. Would an app will run within virtual or os threads. You write your own code, then somebody chooses whether to run the code in a virtual thread or on the OS thread. Virtual threads are still threads. They are the same construct, just the runtime that allow you to make more of them. More threads mean more request execution. Running in parallel would be a thing. If there is a need to run multiple threads in parallel, then it's highly recommended to use new virtual thread executor to do multiple parallel requests with Project Loom. Let's get to the point when people start talking about data analysis data science, they mostly talk about tools they've got used to work with, but Java wasn't one of them, except those cases when some of the apps written with Java are involved, like Hadoop, Spark, etc. However, Java is one of the most powerful tools for concurrent data processing, so no matter whether the processing would be a CPU bound or an IO bound, if there is a need to process the data of the same time of the same nature, it's highly recommended to use parallel streams because it's like the most comfortable and the most efficient way to process in the data of the same nature. When it's necessary to retrieve some data through the I O channels. The new HTTP framework will do its job at its best by providing necessary features for both synchronous and asynchronous routing, and also taking into account project loom that will be released anytime in the future. That will also improve threading by default within the Java runtime. Moreover, as I said, things will get better when we apply project loom against threaded apps, not just because of a new concept of a virtual thread, but because of the way we treat threads. Project loom will help to reduce the overhead of threads for an IO bone operation that will allow developers to execute more parallel operations, HTTP requests, in particular for retrieving more data within a reasonable, time consuming, less sources for me, as for hobbiesist, Java served a great job with very little effort. It was more complicated to write a theory and define the data processing pipeline on the paper than code bit. Java streams played well for data processing. Sync and a sync HTTP API helped to reduce the time and consume the resources while working with the data coming from the dark sky APIs or homeholds center. And also the project loom just skyrocketed the thread performance and make the whole execution be even more efficient than it is right now. At this point, you may have a question regarding why I haven't picked anything like Python that is pitched as the best tool for data science and the machine learning. Well, the answer would be the nature of Python. It's technically slow, the single threaded runtime, and most of the data science and machine learning code that exists right now is nothing but a c or c code with a very thin Python wrapper plus interpreter as a dependency. But Java has a lot to offer to data science and the data analysis and the ecosystem of Java is still growing and more tools being developed and introduced to the community like Tribio, that will occur in a market share that will occupy its market share of the specific tool for the machine learning and data science in Java. But there is no point to bit and wait for that moment. As you can see, even with JDK only it is possible to do as many cool things as possible. Just look at what I did just for fun, going back to the very beginning of where it all started, to a simple question that my friend asked, what is the best country to visit to witness the nerd lights? And the answer is on your screen. But as I said before, pick the one you like the most and just go there because traveling right now is very critical for well being and kind of we're done. Hope you find this talk interesting, even though it wasn't packed with so many technical stuff. Anyway, thank you for listening and if you have any questions you can at me on Twitter and just subscribe to our Twitter handles. Subscribe to our inside Java blog, which is kind of cool and is one of the most reading Java blogs on the Internet right now. Thank you.
...

Denys Makogon

Principal Software Developement @ Oracle

Denys Makogon's LinkedIn account Denys Makogon's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways