Abstract
Harness Golang to create fast, scalable multi-hop data pipelines for real-time analytics! Learn to process large datasets with low latency, optimize performance, and integrate machine learning for dynamic decision-making, boosting business success in finance, healthcare, and e-commerce!
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let's talk about multi hop, big data pipelines today.
In this day and age of J and AI and machine learning, it is inevitable for
businesses not only to maintain their own big data ecosystem, but also to ensure
that it is scalable and performant.
Hi, I'm Hardik Patel.
I have around 20 years of experience in tech industry working for both
small startups and large enterprises.
I hold Master's degree in software engineering from Bits, Ani, and I have
held various tech leadership roles at high tech places like Yahoo, Symantec,
Groupon, PayPal, to name a few.
I had opportunity to lead teams ranging from five plus people to
30 plus organization, comprising of engineers, tech leads and managers.
I have over a decade of experience in designing and developing distributed
systems and big data architecture.
So before we dive deep into big data pipelines, let us first
understand what are typical hops involved in such big data pipelines?
It typically starts with data extraction, so as the name suggests, this is usually
towards the intake side of data pipeline.
And this stage is mostly involved in grading data from variety of sources,
ranging from structure to semi-structured, to completely unstructured data.
The most essential aspect of this stage is to ensure connectors to disparate input
data sources are resilient and there are sufficient data quality checks at and to
ensure no invalid data enters into the system, into the pipeline from the get-go.
The second aspect, or second hop in such data pipeline is data transformation.
Besides cleansing the data, the stage is designed to format, normalize, and
enrich data for faster processing.
Thereby reducing overall data availability, lag through the
subsequent stages of data pipeline.
Data analysis is another important aspect of stages within the data pipeline.
So once the data is cleansed and prepped up for faster processing, subsequent
stages take up the responsibility of applying complex event processing and
relevant machine learning algorithms to further extract meaningful
insights in near real time manner.
The last aspect of a typical data pipeline is action and storage.
So such data pipelines are essentially in orchestrating automated responses to
key events while also persisting process data with optimized storage footprint.
Now that we have looked at usual hops in a data pipeline, let us dive into
various core design principles that are employed in building such pipelines.
We will start with, fundamental parallelism principle.
So one of the key principles in such pipelines is parallelism.
It is one of the fundamental tenant of such pipelines that help enable
concurrent execution through means of effectively partitioning workloads
across available computational nodes, thereby achieving maximum throughput.
Scalability is another design goal is to ensure we are able to elastically expand
both horizontally and vertically in response to fluctuating and ever evolving
workloads of data volume and velocity.
Fault tolerance is also inevitable in such complex environment of distributed
systems and parallel processing.
It helps us ensure we are meaningfully handling errors and also in some cases
have automated recovery and self-healing.
Mechanisms to ensure overall continuity of such data.
Pipelines.
Throughput optimization implies we are able to craft efficient data parts
with minimal overhead of serialization, optimal memory utilization, and effective
data partitioning to balance overall volume and velocity of incoming data.
So now that we have looked at typical design principles involved in
building, building this data pipelines.
Let us look at some of the very popular industry technology stack and tools and
frameworks used to build such pipelines.
Apache Kafka is an industry leading messaging system.
It is calibrated to handle more than a hundred thousand messages
per second, throughput with built-in capabilities of partitioning,
replication, and fault tolerance.
Spark is another very popular dis distributed data processing framework.
It is.
Hundred times faster than its in-memory Hadoop counterpart.
Besides that, it natively supports SQL machine learning models, deployment
and graph data processing workloads.
AWS Lambda happens to be the perfect technology that exhibits elasticity and is
adaptable to ever evolving and fluctuating data volume, velocity, and variety.
It helps automatically scales from few requests per day to thousands
per second, and has a very attractive pay only for use pricing model.
It is very popular amongst the companies and businesses
operating in Amazon's AWS Cloud.
Apache Fling is yet another distributed data processing framework that supports
constructs, like exactly one semantics.
Event time processing sophisticated windowing operations
with subsequent latency.
Now that we have looked at various technologies, at, at use and different
designing and developing big data pipelines, let us touch upon how, what
are some usual performance optimization techniques in maintaining and ensuring
that these pipelines are performed.
So first strategy is caching strategy idea.
In case of caching strategy is to store frequently access data elements to
further minimize redundant computation and DIS and database IO calls.
Data partitioning is another performance improvement techniques, which is
used to strategically segment data sets to enable parallel processing
and reduce overall processing Time.
Load balancing is a means.
To dynamically distribute computational workloads to prevent resource bottlenecks,
and in turn, optimize overall throughput.
Resource allocation is a way to precisely allocate resources like
CPU Memory Network based on workload characteristics and business priorities.
The last one is optimized data formats, and it is a very popular
technique to choose formats like Columnal format binary formats.
Which further helps us reduce IU operations and minimize
data transfer overheads.
So with that, now let's dive deeper into various scaling challenges and potential
solutions involved in those challenges.
So the first, scale popular scaling challenges, data skewness.
what is data skew?
So it implies to uneven distribution of data.
It le leading to performance and scale bottlenecks and resource utilization
inefficiencies across clusters of nodes.
A typical solution that we employ in order to tackle data skewness is
to implement adaptive partitioning with real time workload monitoring
and dynamic re rebalancing to ensure optimal processing across compute nodes.
State management is another scaling challenge.
What happens in case of state management is that with maintaining
stateful operations, it significantly limits our ability to horizontally
scale distributed system.
So typical solution that we employ in this case is we actually deploy something
like fault tolerant distributed state stores with tiered caching architecture
and configurable eviction policies.
It helps us to preserve consistency while maximizing overall.
Throughput at scale across various stages of data.
Pipeline back pressure handling is another scaling challenge.
this occurs when a typical, high velocity data producers ends up overwhelming
a relatively slower consumer system.
It actually leads to system instability, and in some of the cases
also leads to potential data loss.
What a typical solution that can be employed in this case is to implement
intelligent rate limiting algorithms, priority based buffering cues and
adaptive throttling mechanism that will, this will certainly helps us
maintain system illiquid equilibrium and of at the time of wearing data loops.
Now let us look, so whenever we are talking about big data
pipelines, you cannot avoid.
Talking about observability.
So let's touch upon a little bit of monitoring and observability.
So how do we ensure that all of this complex ecosystem and distributed
compute nodes and storage nodes are always observable and we always
are able to stay on top of anything that is happening under the hood?
So what are various ways to achieve effective observability?
So one of the popular mechanism is metrics collection.
Metrics collection is inevitable.
To track mission critical performance indicators like end-to-end latency,
throughput at each and every processing stage, and resource computation trends.
So when in case of metrics collection, it not only helps us track the
effective and metrics in a meaningful way, but it also helps us stay
proactive in case something goes south.
The second aspect of observability is log analysis.
So we capture detailed event logs.
as another aspect of keeping systems observable, it helps us identify error
trends and improves our overall time required to conduct root cause analysis.
In the event of various performance issues, DTrace helps engineering
teams to create correlation amongst distributed computing notes and helps us
ensure in pre taking preemptive action before performance degrades of this.
Typical big data pipelines.
So now with that, let us take a look at what are some of the
key takeaways of this session.
So architecture matters.
To employ architectures that facilitates this performance criteria
is inevitable in case of distributor systems and big data pipeline.
Technology selection is another important aspect, which helps us ensure that we
are leveraging the right framework.
Our tool for business problem at hand.
Continuous optimization is also very important, which helps us
regularly calibrate and benchmark key performance indicators to stay
proactive with ever evolving data volume, velocity, and variety.
Business alignment, this goes without saying whatever we
build must be in alignment with impactful business outcomes.
There is no technology solution without a meaningful business associated to it.
I hope you enjoyed this talk as much as I enjoyed preparing for it.
Thank you for your time today.