Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let me first introduce myself. My name is Denys Makogon.
I'm a principal member of technical staff, a member of Java platform
group who's responsible for Java development at Oracle.
So as many of you, I'm a software engineer, meaning I
write code for a living alongside with my work. On my spare time,
I do the landscape photography as a hobby. People on the Internet call me
some kind of a pro level landscaper, maybe because one
of my photos won very prestige awards and the latest one is
the best photo of 2019 for bloody annual pilot whale
massacre at the Faroe Islands. Maybe that's why. So what's
this all about? You're here to listen a quite interesting story
of how mybo's skill sets in software development
and professional landscape photography aligned together, and how
such a great collision turned into a great story of the northern
lights hunt it's all started with a very simple question from
a friend of mine. Which country is the best to visit to witness
the northern lights? And I was kinda uncertain to what to tell him, so I
told him the following. Pick the one you like the most. Go to Alaska.
Go to Swalbard, Iceland, Norway, Finland, Canada,
Sweden and Greenland, or north of Scotland and Faroe islands or
Shetland islands. If you haven't been to any of those, pick the
one upon your budget. But it made me think about the neural
lights as a phenomena that can be described by numbers from the
point of view of physics. The fact of the neural lights presence in
the sky is a coincidence of multiple factors. The neural lights
phenomenon, a lovely dancing color waves in the sky,
is something that can be described as a set of measures.
I ask myself whether I can answer a friend's question in
a manner of software engineer with a certain background in data
analysis. And from that moment, the answer to this
question, which country to visit, was nothing but the
ideal state of a system defined by a set of
measures and expectations that can be combined
into a set of mass equation and the
result would be the exact place to go. So I
decided to write a theory, but haven't been doing so since
the university. So sorry on that. As you see, the study is all
about matching the data by certain criteria,
doing math, and a bit of assumption based on the years of northern
Lights observation. In 2019, the idea of northern lights
was born. It's a great attempt to give an answer in
the simplest form consumable by nontech people, because the
whole theory of the northern lights is quite complex and
you need to know a lot about our sun, earth,
magnetic fields and a bit of electromagnetics in general.
I don't want to waste your time on how all of this works
and what the exact algorithm is. You can read bit by yourself,
but I'd like to highlight some key aspects of it. But before we'll go
and do the deep dive into technical aspects, I'd like to
acknowledge two organizations. First is a dark sky
weather service who provides a very great structured historical
weather forecast. And also I'd like to acknowledge the
home hall center of podesdam who provided the data
set with the solar activity and magnetic fields yearly. The visibility
of northern lights is very dependent on when season matters,
where far north or far south, and what
are the weather conditions. These are three things that we're
not in control of, but as an engineers we do make decisions
which technology to use to accomplish our tasks in
the most comfortable way. The terms comfortable is
kind of tricky here as it actually declares a set of demands to the technology
and it has to comply on 100% rather than just go above
some abstract desired threshold. So let's speak about those requirements.
The technology must be comfortable enough to work with sequences
in the case study, each country is represented by a set of GPS
locations, nine or more. Each GPS location is
described by eight daily measures. Each daily measure described
by an hourly weather forecast during autumn, winter and
spring seasons within like 226 days.
Bit means we have 3000 measures at minimum
per single location. So the amount of data, even for a
single country is quite big. Alongside with
sequences we have pipelines. Ability to build reusable
data pipelines is quite critical feature as the
whole problem solving algorithm is a set of sequence
perceptrons that process multiple point connected
data sets from the dark sky and homehold center into a
compressed joint. Country based statistics
since the case study is some sort of a yearly challenge,
the technology must be capable to run perceptrons
in parallel. The good thing is that the data is quite effectively
splittable because at most of the times the data points
are not dependent and can be executed in parallel.
Technically speaking, each country can have its own pipeline
that must be capable to be paralleled to run perceptions for
each gps location for each weather forecast on the
location within nightly hours through the autumn, winter spring
seasons. So as you can see, all of this can be
performed in parallel and this is very critical for the case study
to make it go really fast. Both Homehonter
and Darksky providing their data through the API's
endpoint HTTP rest in particular. So in
the context of parallel pipelines, it's very important to reduce
the time on blocking HTTP calls ideally with some
kind of a nonblocking technique. As for software engineer,
type system of a programming language is quite important bit in
the case study type system is not very critical but still
quite important as there are some data types that require
additional manipulations. So all I wanted to see is the
same APIs to work with because I don't really want to pull out the
hair of my head because something is not supported, require me
to write a ton of a code to get a simple value that I
can work with and the winning technology for me is
yeah, Java. Java for data science.
Don't say you're surprised, aren't you? There are multiple whys and
I'm gonna list out some of them that I found critical for
such tasks. One of the greatest things that happened to Java
collection framework was the introduction of streams API.
First iteration of streams were introduced in JDK
eight release as per docs. Essentially streams are
classes to support functional style operations on streams of elements
such as Mapreduce transformation on collections.
Well, it may sound a bit complex true, but on a practice it's
fairly easy to understand. The JDK doc says about
presence of intermediate and terminal operations which
are combined into a form of a pipeline that contain a source
followed by intermediate step and a terminal state operation.
Worth saying that all intermediate operations against streams are lazy ones,
meaning they wouldn't execute unless the terminal operation is declared.
Here are a few examples of intermediate and terminal operations
in the case study. Streams used as a
pipeline of a set of intermediate operations like
providing the mapping of three hour frames for highest
magnetic field disturbance. But essentially they are used to do effective
and fast filtering, doing some data processing. So basically
some both I O and CPU bound operations.
No doubt streams are really cool, I totally agree with that,
but that's not all streams API also utilizes common
fork join pool to run processing units against
a number of threads in a pool. So no matter what you do,
as long as the data can be effectively distributed with a
parallel stream, the workload could be divided within
a consumable amount of threads that potentially
can give you a boost in the performance. No doubt
streams are cool, but that's not all, as the JVM can
run multiple threads, stream can do that either. Stream's API utilizes
common fork join pool to run operations against a number of threads
in the pool. So no matter what you do, as long as the data in
your collection can be effectively distributed with a parallel stream,
the workload against it could be divided within a configurable amount
of threads that potentially can give you a boost in a performance.
To understand how the runtime executes the parallel streams, I highly recommend
you to look at the source code of the JDK. But there are two ways
to define the forkjoin pool for a streams API,
there are two possible scenarios where the pool can come from a common pool,
the one that is provisioned by the runtime with a customizable parallelism
level defined by a number less by one of the available
cpus on a machine or a custom fork. Join pool,
a developer defined pool with a customizable parallelism
value that basically defines the number of threads in a
pool. Running streams in parallel is totally fun. However,
streams executes in parallel, and the runtime partitions the
streams into multiple substreams. Aggregate operations
iterate over the process of these substreams in
parallel and then combine the result. So these operations could
actually be quite expensive. But the trick here is that
parallel streams may not serve its purpose most effectively,
and there is a reason for that. While aggregate operations enable
you more easily implement parallelism, it's still the
developer responsibility to determine if the application is suitable
for parallelism. We all know that runtime is fast.
The only slow thing is your code, and the effectiveness of
the parallel execution can be expressed in a pretty simple
way. As long as the computational cost of a single item
processing through the whole data set is higher than the overall
infrastructure provisioning overhead, and the data set is effectively splittable,
then a data set processing should be considered to be executed in parallel.
Let me explain this. If the operation is so simple and the
execution cost is lower than the overhead of a thread provisioning
the thread provisioning, then the data set must be very big in order
to make the parallel processing worthless. It works in the
other way as well. If the computational cost of a single item
processing is higher and the data set is effectively splittable,
then the size of a data set may not matter in order to make the
parallel execution worthless. As we all know,
the most affectionately splittable collections include
array lights and hashmap. Data structures tend to be effectively splittable
if they internally support random access, efficient search,
or both. On the other side, if it takes longer to partition the
data than process it, then the effort is kind of wasted.
So as you can see, streams could be very efficient in terms of
threading. But what about memory footprint? A stream itself
just stores its own metadata, it doesn't store anything about
the collection. The only moment when your application starts gaining
in memory because of overstream is when a stream turns into a
collection through the terminal operation. So what we already covered
here we have that streams are efficient in terms of memory.
They can do things in parallel, but still relies on Java threads
that are pretty heavy, and we all know about that. So is
there a way to improve threading in Java? And the answer is
yes. OpengDK project Loom intends to eliminate the
frustration trade off between efficiently running concurrent programs and
efficiently writing, maintaining and observing them. It leans
into strengths of the platform rather than fight them,
and also into strengths of efficient components
of asynchronous programming. Project Loom lets you write programs
in a familiar style, using familiar APIs in
harmony with the platform in its tools, but also with the hardware to
reach a balance of write time and the runtime codes that we hope
we will widely appealing. It does so without changing
the language. With the only minor changes to the core library's
APIs, a simple synchronous web server will be able to
handle many more requests without requiring more hardware. In terms
of data processing, a new concept of future threads
will make us people who work with data to rethink
stages when we retrieve the data and process bit,
as loom will help us to increase the productivity of
any I O bound tasks by executing more operation
at the same time, because the virtual thread itself and the blocking on
them is pretty cheap. However, with loom, data extractions
should come beforehand. Any cpu bound operations as the
number of actual OS threads used by forkjoin
pool is way smaller than the number of mutual threads.
It means that if you have any IO operations within
your stream and you'd like to process in parallel,
it's highly recommended to extract the data and run to
retrieve the data through loom, but not through Java threads
that still will remain in Java and will serve its purposes
within loom internally. So going back to other features
that Rkstad required and what I found very important
in Java is a new HTTP API framework.
A new HTTP client incubated in GDK nine will and
fully released in GDK eleven. A new client provides
asynchronous and asynchronous request mechanics. It also supports
HTTP one one HTTP 20, both in
asynchronous programming long models, handles requests and response
bodies as a reactive streams, and follows the familiar
builder patterns. So why it's very important this
is the optimized number of the HTTP request required
to be obtained to obtain the weather forecast for
the case study. These HTTP calls are performed within the data
processing pipeline, which technically means nearly 55k
possible blockings while waiting for the response from the
resource utilization standpoint, it is unacceptable to have all that
possible time wasted for just idling. In the
case of the GDK HTTP framework, we would no longer
need to write asynchronous code, but just simple synchronous code.
No more async anywhere. Virtual thread scheduler will
handle all the blocking. That is way cheaper than blocking
on the OS threads, but still, a choice of
the back end of threads is a responsibility of a developer.
Would an app will run within virtual or os
threads. You write your own code, then somebody chooses
whether to run the code in a virtual thread or on the OS
thread. Virtual threads are still threads. They are the same construct,
just the runtime that allow you to make more of them. More threads
mean more request execution. Running in parallel would
be a thing. If there is a need to run multiple threads in parallel,
then it's highly recommended to use new virtual thread
executor to do multiple parallel requests with Project
Loom. Let's get to the point when people start talking about data analysis
data science, they mostly talk about tools they've got used
to work with, but Java wasn't one of them,
except those cases when some of the apps written with Java are involved,
like Hadoop, Spark, etc. However, Java is
one of the most powerful tools for concurrent data processing,
so no matter whether the processing would be a CPU bound or
an IO bound, if there is a need to process the data of
the same time of the same nature, it's highly recommended to use parallel streams
because it's like the most comfortable and the most efficient way to process in
the data of the same nature. When it's necessary to retrieve some
data through the I O channels. The new HTTP framework
will do its job at its best by providing necessary features for
both synchronous and asynchronous routing, and also taking into
account project loom that will be released anytime in the future.
That will also improve threading by default within the Java
runtime. Moreover, as I said, things will get
better when we apply project loom against threaded apps,
not just because of a new concept of a virtual thread, but because
of the way we treat threads. Project loom will help to
reduce the overhead of threads for an IO bone operation
that will allow developers to execute more parallel operations, HTTP requests,
in particular for retrieving more data within a reasonable, time consuming,
less sources for me, as for hobbiesist, Java served
a great job with very little effort. It was more complicated to write a
theory and define the data processing pipeline on the paper than code
bit. Java streams played well for data processing. Sync and
a sync HTTP API helped to reduce the time and consume the resources
while working with the data coming from the dark sky APIs or homeholds
center. And also the project loom just skyrocketed the thread performance
and make the whole execution be even more efficient than it
is right now. At this point, you may have a question regarding why
I haven't picked anything like Python that is pitched as the
best tool for data science and the machine learning. Well, the answer would
be the nature of Python. It's technically slow,
the single threaded runtime, and most of the data science and
machine learning code that exists right now is nothing but a
c or c code with a very thin Python wrapper
plus interpreter as a dependency. But Java has
a lot to offer to data science and the data analysis and the
ecosystem of Java is still growing and more tools being developed
and introduced to the community like Tribio, that will
occur in a market share that will occupy its market share of the
specific tool for the machine learning and data science in Java. But there
is no point to bit and wait for that moment. As you can see,
even with JDK only it is possible to do as many
cool things as possible. Just look at what I did just
for fun, going back to the very beginning of where it all
started, to a simple question that my friend asked, what is
the best country to visit to witness the nerd lights? And the
answer is on your screen. But as I said
before, pick the one you like the most and just go there because
traveling right now is very critical for
well being and kind of we're done. Hope you find this
talk interesting, even though it wasn't packed with so many technical
stuff. Anyway, thank you for listening and if you have any questions you can
at me on Twitter and just subscribe to our Twitter
handles. Subscribe to our inside Java blog, which is kind
of cool and is one of the most reading Java blogs on the Internet
right now. Thank you.