Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm Gar Hel senior software engineer with Northshore Technology, and I'm
excited to talk to you about how AI is transforming how organizations
manage and transform their data.
So today we'll talk a little bit about what ETL is, what the problems
with traditional ETL systems is.
We look at how AI powered ETL is the new way to solve some of these problems.
We will take a deep dive into the various parts of the ETL process
and how it AI is improving them.
we'll also look at how, EL pipelines are maintained, how the infrastructure
for the EL pipelines is managed and how AI can improve some of those things.
We look at some of the popular tools and platforms that are used for this, and we
will end with some practical examples.
And future trends.
Alright, so what is ETL?
ETL is really, the backbone of data warehousing and business intelligence.
organizations are accumulating vast amounts of data and they need a way to.
report on it, turn them into actionable insights, and ETL helps with that by
streamlining the process, all this raw information is consolidated, and
brought into a form that makes it usable for analysis and decision making.
So what are the problems with e traditional ETL systems?
So one is of course complexity.
so the e the extract part of it, for example, involves.
Getting data out of a variety of different sources.
these sources could be scattered, they could have data in a form
that is not readily usable.
data extraction involves, someone writing scripts.
Of course there are tools and platforms that can be used for this,
but ultimately it involves, Getting a sense of what shape that data is
currently in and getting it out.
of course, a lot of times the data could be just completely
unstructured and there is no non-AI way of doing this efficiently.
typically it would involve a lot of human input in, excuse me, in
terms of tagging the information, getting data points out of it.
the data quality of the data as we get it out of these different system
in the consolidate them could be low, because the shape of the data,
in due course would keep evolving.
So the scripts that were originally written to extract the data.
might need to be tweaked on an ongoing basis, might, might develop
problems on an ongoing basis.
And of course, there could be scaling challenges because it's hard to
predict ahead of time what kind of infrastructure one needs, to build the
EDL pipeline as the data volume grows.
or shrinks, your need for a specific, level of,
infrastructure would also change.
So there is ongoing sort of input and tweaking needed to what hardware,
is needed to run your pipeline.
And all of this requires, ongoing manual input and manual work.
So, Hello, ai.
so AI helps it to broad ways.
Using ai, we can build a better pipeline, by improving the way
they extract, transform, and load steps of the pipeline work.
Using ai, we can also, improve the way the pipeline runs, meaning.
We can, provision our infrastructure in a better way.
We can monitor it in a better way.
We can make sure it stays performant.
alright, so let's look at the extract part of the process.
So we know that, the extract step involves getting data
from all the different sources.
this data may be in a variety of different formats.
And, could have different schemas that needs to now be joined together,
create a consolidated view of it.
This is a, this is a manual process.
Traditionally.
Yes.
You could, you have tools for this.
so it's not always scripting, although it could be scripting.
there are visual tools that you can use, but ultimately, someone with
techno functional knowledge, needs to look at the different data sources
and say, what element is what?
Okay.
and what is the best way to extract it for all this means that the processing
speed for getting data out of those systems, isn't as high as it could be.
'cause of all the, manual input and because of all the
different types of data that.
I need to be extracted.
So AI is a natural fit, for doing this job because AI can adapt to
not just a different shape of, data that it may be in, at a given
point, but how it's evolving.
Alright, two high level ways in which AI helps with this.
One.
AI is a natural fit for, processing unstructured information.
So if the data that we're getting is not structured or it is semi-structured,
AI can make sense of it and help extract, the different data elements
or different data entities from it.
AI is also good at.
looking at data source whose structure you may not know.
so for example, there may be a webpage or a PDF document or a document for
another kind, which has some kind of a structure that may change with time.
but AI is a natural fit for.
Figuring out that structure so that data can be extracted
from it in an optimal way.
so, In terms of, transformation.
So a typically pipeline has a bunch of transformation steps where we are
looking for problems with the data.
We're looking for, data that might be missing or in an incorrect
format or an unexpected format.
AI can handle all those things.
So for example, using AI we can build a model for what, the data should look like.
And if the data is in a different, shape or has different, values that
are, outside ranges that they typically are in, then AI can flag them.
AI can also notice that some data that is typically present
is not present, or some data that should be present is not present.
Yeah.
He is also good at handling data that may be available,
but in a non-standard format.
So I can do mapping from, the current format to the expected
format so it can be extracted.
AI can also identify that, certain data ranges or certain data values are
outside their expected range, outliers.
So I can, AI can identify those things and uh, of course AI can also.
put human in the loop if needed.
So if there is some, something that, that are so outside the range that Yeah.
Doesn't know how to handle it, then it's possible to build it in a way
so that yeah, it pulls in a human being to say, Hey, look at this.
This is something that I don't know how to handle or I do.
Alright.
Also while loading the data, AI can figure out the right, infrastructure that is
needed to perform the load operation.
for example, if, there is a data spike or there is, less data.
Then one, typically is present in the pipeline.
Then AI can, provision infrastructure or scale up and scale down the
infrastructure in such a way so that data, a load step, happens optimally.
I. if there are any errors during the load process, then AI is typically able
to handle things like, if a data type is incorrect, if a number is stored
as a string, things like that, then AI is able to figure out how to handle
it so that it is loaded correctly.
And of course.
Even in the load process, AI can smartly bring a human in the loop
if the load process is not going as planned, or if things are happening
that AI is unable to handle.
Okay, now let's look at some of the ways in which, AI can help us
run our pipeline in a better way.
All right.
So using, using ai we can build a model for, the kind of.
failures that one can see.
for example, AI can put in logs and time series data from a variety of
different sources and a build a model for, the kind of failures and what
can expect in a certain situation.
And then it can proactively, flag it so that we know that, a failure might happen
or that there is a need for, a human being to monitor certain parts of the process.
so it can figure out these patterns by looking at information from a
variety of sources, from logs, from sensor data, and it can identify
patterns and raise these kind of flags.
using ai, we can do some of these things proactively, that would typically happen
in a traditional pipeline, retroactively when something goes wrong, right?
And then using all these signals, AI can figure out what the right, level of
infrastructure is for running a pipeline.
So it can, for example, suggest that, some compute needs to
be scaled up or scaled down.
Or storage needs to be scaled up or scaled down because the data
that is coming through the pipeline is, more or has a shape that is
different from what it typically is.
It, so in this way, AI can help identify bottlenecks and help
our pipeline run, smoothly.
of course this also helps with cost optimization, because with AI we can
right size our AI pipeline, our sort of our ETL, pipeline, infrastructure.
So we are not over provisioning or under provisioning.
okay, so there are a number of different tools and platforms
that are available today, or this doesn't need to be built by hand.
so here's a small, snapshot of some the popular tools that exist out there.
And depending upon what we are trying to achieve, or rather what we are trying to
focus on, we can pick different tools.
so for example.
there's a provided of tools like AWS Glue, which of course works,
with AWS, but if that's where our solution is, then, AWS glue gives us
a variety of different, capabilities.
We can, create our models, do mapping and, number of steps of our pipeline can
be hosted in automated using AWS group.
if if you wanna go the open source route.
Air Byte is an excellent tool that also lets us do, data processing.
We can create models, we can process unstructured information using air byte.
Alright.
so yeah, so how can we, get started with it?
so one obvious low hanging fruit that, that we can
incorporate in our, EL pipelines.
It's the use of sentiment analysis.
So if we are dealing with unstructured information that doesn't have sentiment
data, then there are a variety of different tools and algorithms that
we can use to, tag the incoming data.
Or even if it is semi-structured, we can tag it with, sentiment
data that subsequent steps in our pipeline can then use.
We can also look at, using AI to plug in.
Holes that may be in our data.
For example, some information that may be missing can be plugged in using ai.
An excellent use case for, AI in, in pipelines is fraud detection.
most of us would have, face a situation where our, credit card provider suddenly
says, Hey, this transaction card go through because it looks fraudulent.
So how do they look?
How do they know it looks fraudulent?
It's because they're able to, because they have so much information that they're able
to see that, hey, for this user, this is typically where transactions are made, or
this kind of, this amount ranges is, is where the transaction typically lies in.
And then if some, if it's something outside that, then they can flag it.
So that's another example of, using anomaly detection.
to detect if a fraud is happening.
Alright, so we've come to the end.
AI is a natural fit for making our EDL pipelines better for
running them in a better way.
Of course, some things need to be kept in mind security, so dealing with, ai.
In a way that, the data that we store with in our AI model is only accessible
to certain users, is a, is something that needs to be paid attention to, right?
we shouldn't just throw our data into, to AI and say, Hey, handle it.
security safeguards need to be built into it so that the right users
are able to access, the right data.
building these systems, of course, requires newer skill sets, so
everybody, needs to understand.
AI needs to learn how to incorporate AI in our day-to-day work.
AI power automation is, is super useful for all kinds of organizations, not
just ETL, but, AI powered workflows enable, so many different use cases.
for example, it's possible to create workflows that have some human
steps, some AI steps, and some sort of imperative programming steps.
And, most workflows, of the future are gonna be some hybrid
of human AI and imperative steps.
Of course, AI also is a natural fit for real time analytics.
rather than the traditional approach of accumulating data in a warehouse,
pre-processing it and generating report from it, AI is able to extract
information from large amounts of data with much less pre-processing, that
a traditional e system would need.
All right, so we've come to the end.
I hope you enjoyed this, talk.
And I hope you find it useful as you build out your next ETL system.
Thank you.