Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome.
Today I'm here to present.
Our Yay is Transforming enterprise Data Pipelines with performance
efficiency, up to 70 to 85%.
AI innovation, the data engineering workflow.
In the context of context today world motivational intelligence is having
a major impact on enterprise data pipelines and the data engineering
transformations are happening across the entire data lifecycle from processing.
Myself, I'm with 15 years of experience in data engineering with a focus on
cloud technologies and automation.
I consistently pioneered innovative solutions so that bridge technology gaps
and deliver measurable business impact.
Also implementation of data solutions that changed how
large scale data.
So back to the topic of the enterprise data pipelines.
Yeah, it can detect the data patterns and automate the connections between different
systems such as cloud, on-prem, APIs, et cetera, drastically reducing the manual.
And also it's
clean.
Ations and we can implement the faster and smarter a within enhanced
data governance and security.
So thereby giving the result in the real world measurement top two
85% gain in terms of efficiency.
What is the data challenge we have currently, right?
On a daily basis we are generating 2.5 country of data with this
amount of data under the different challenges in handling the large volume
efficiency gain.
Human intervention.
We'll discuss more on this topic.
So the, if we delve into what kind of challenges we're facing in terms of data.
So the first one is, the data silo and fragmentation it is often spread across
multiple systems, making it difficult to create a unified view, integrate and
process efficiently across the different data sets because why It's a problem.
Data engineers spend huge amounts of time just locating, extracting,
and normalizing the data.
The second, I would like to highlight the data quality, right?
Most of the times it seems to be poor.
So the organizations struggled with the missing or incomplete data,
such as the duplicates, inconsistent formatting and also the sche.
Impacts
the
data
decisions.
We cannot make it more efficient, right?
In terms of business.
And also the complex orchestration in terms of the transformation when
we do ETL or e LT jobs across the dependencies, schedules, and environments.
Coordination most of times seems to be tricky, so failures can cascade.
And the next one would be the performance bottlenecks.
As data volume grow pipelines need to scale horizontally.
Many legacy systems.
Most of the times it's poorly optimized.
The pipelines can't keep up.
And the last one could be, data governance and compliance, right?
Keeping up with the data policies such as GDPR, hipaa.
CCPA is hard.
So the challenges include tracking the lineage, the hardback, world based
access control, and the classification of the data in terms of sensitiveness.
So these are the different challenges, right?
We facing.
Under solution how can we find it like in this slide automated feature engineering.
So what is about, how it can address the data challenge?
So this is an AI driven, game changer in the data science and machine learning
pipeline because it uses machine learning, heuristics and domain knowledge
to automatically generate, select, and optimize the features from raw
data without much traditional way of.
Future engineering, what it's all about, it's a process of transforming the
raw data into the meaningful inputs.
When I say inputs, it's more of a features for a machine learning model.
Traditionally, this is manual time consuming and highly dependent on
the expertise knowledge of domain.
So now comes.
Yay driven automated feature engineering flips that by using algorithms to
analyze data types and its relationships, generating the new features such as
ratios, time aggregates, encodings,
rank, and select the best features, continuously optimize
features for model performance.
It's often powered by the famous tools like feature tools by alter
maker autopilot, Google Auto, DataRobot, to name few, and one of the key
techniques being employed in this engineering in a more automated way.
The number one is a deep feature synthesize.
DFS automatically builds the features from relational dataset.
And the next is recursive transformation creates the layered and higher level
features, embedding and encoding, which is a technique used to
convert categorical or textual data.
Vector what the algorithm can be used in the selection
using a correlation method, mutual information or values
to drop low value features.
Moving on to the next slide, which talks about.
Data cleansing, right?
How it can be employed in an intelligent way compared to the traditional methods,
which will give us the improvements in terms of detecting the, in the data.
And also we can implement the correction of data in a more automated fashion
and how to recognize the patterns.
So it's a data cleansing, intelligent data.
Cleaning is more of is a use of ai, machine learning and advanced automation.
To clean, validate and enrich or enhance the data are more
precisely and efficiently compared to traditional methods.
Relying on static rules or manual review, intelligent cleansing
adapts to the patterns, context, and domain specific logic to ensure high
quality usable data for the different purposes, such as analytics reporting
on obviously the machine learning.
So now comes the question when we say the data cleansing, what makes it intelligent?
Traditional data cleansing involves manual scripts of pixel logic, like
removing the duplicates, filling in null values, and standardizing
format to name few on the other.
And intelligent data cleansing uses a that learn and adapt.
We came across in the data cleansing, most predominant one is the missing values.
So out intelligence can help right predicts the missing values using the
ML models, such K or regression models.
And the next one would be outlier detection.
Using statistical or yay models to flag the anomalies, not just thresholds,
then the data normalization comes in.
Or can we intelligently cleanse the data with respect to them
data, an normalization by learning the patterns, right?
Intake to dates, addresses, et cetera, and.
And the entity resolution using fuzzy matching or the NLP to merge duplicates
such as you can say IBM COP versus International Business machines.
And data enrichment plays an important role, right?
Attributes by putting from external sources like business databases
or geolocation we and improve the reaching the data in an efficient way.
So what would be the next step?
When we successfully cleanse the data?
It's gonna be an interesting topic for all us, right?
The entire, how we integrate, right?
Using the.
So it is a practice of combining machine learning systems with DevOps principles
to streamline the lifecycle, right development, deployment, monitoring,
and governance of models in production.
It off as a glue between data science and engineering, turning experimental
models into a robust, scalable, and continuously improving business systems.
So what?
It integrates the lifecycle.
We'll see Ops connects multiple workflows such as data ingestion, versioning,
model training, and experimentation.
And validation.
And the next one would be the deployment.
CA CDL model monitoring performance, drift
retraining, and the feedback loops.
So these are all the, some workflows where we can integrate, right?
The would be CD pipelines for machine learning.
Yeah.
What is the purpose of it?
It can automate model testing, validation, and deployments,
and the second component would be model registry, so that can tracks,
versions, metadata, and lineage of the different trained models.
And the third component would be feature store, which can be a central
for consistent reusable features across multiple trainings and inference.
Automated retraining triggers based on data drift or model decay
monitoring tools and can be used for checking the currency latency drift.
These are the different components of the integration of a machine learnings model.
Moving on to predictive quality monitoring.
Monitor which is critical.
Business.
So statistical modeling and sensor data to detect patterns forecast
the deviations and identify quality risks before they lead to defects or
failure in product or the process.
So what would be some of.
We can see reduced defects can be one of the benefit.
So we can catch the issues before final product is made.
A lower cost of quality.
We can prevent the scrap rework on warranty claims and the real time
decision making, which is really helpful to fix the issues in process
rather than postproduction and also.
Which in turn will maximize the output while keeping the I standards.
And if there is a issue even in production, the benefit would be having
a predictive monitoring would be the root cause insights that we can identify
key variables affecting the quality and thereby improve upon it in production
regulatory compliance, which is.
Organization whether it's a retail, manufacturing, finance, or
healthcare, the is sensitive, right?
So father, it'll have an impact Mentioning the consistent quality
for audits and certifications, having a monitoring method, which is.
We can say as an example, right?
The predicting health or medical condition, sepsis in real time.
Yeah, hospital users and EA model trained on vitals plus labs data.
For example, we can say all rate WBC count ate levels, so system predicts
sepsis, risk others before the symptoms actually show in the body.
Since alerts to the car teams, they can do or provide early antibiotics and fluids,
which in turn can reduce the mortality rate by 20 to 30%.
That's cool, huh?
Moving on to data enrichment.
How can we drive.
So it uses intelligent algorithms right to analyze existing data, detect
patterns undocumented with additional attributes, either from external
sources, derived insights or automated predictions to increase its quality and.
We can quickly see the different types of data enrichment, right?
So first is raw data from the legacy systems unstructured content
areed from the disparate enterprise sources, and and enhance the insights
which can deliver 56% greater.
So these are the different types, right?
The one popular one would be the texture enrichment using the natural
language processing, NLP, what it does basically extract the such structured
data from unstructured text, right?
So the example would be pulling the job titles and skills.
And image or video enrichment, that is one of the typical type.
What the AI is doing is uses the computer vision to label a classify
the media, and the best example would be tagging the product photos with
the categories, colors, and objects.
And the next one would be geospatial.
We can it, it basically what it does is adding the location based
data from GPS or address fields.
The typical example would be attach weather, region risk,
or store proximity data.
So this is how yeah, is powering right.
So it's a smart system that manages the history and the
evolution of digital artifacts.
We can say such as code data, models, documents, everything requires
the version controlling, right?
But how can we attribute using yay to automate?
Detect changes, suggest actions and optimize collaborations across
multiple teams and pipelines.
So what are the few benefits of having a intelligent version control?
So smaller collaboration, faster conflict resolution, and to
detect the semantic conflicts.
And suggest the resolutions intelligently also improves auditability.
Better traceability automatically logs the model data and code lineage
vital for the regulated industries such as finance and healthcare.
Also, this.
Version controlling in an intelligent way improves the models in data management,
like how changes in data sets can be tracked quickly and efficiently, or
the features impact the model accuracy and also effectively supports the
rollback to known good configuration if any issue occurs reproducibility.
The full experiment, tracking and rerunning the pipelines with
the exact inputs and parameters,
and also with all these benefits, right?
It can save the time.
And also automation in an efficient way like intelligent merging, auto, tagging
our versions, change summaries, et cetera.
So what are the different popular tools available now in
enabling the intelligent version?
Control would be DVC for the data on thel versioning.
YAML flow.
Is there GitHub copilot is the AI systems Smart Code versioning.
Yep.
This is all about version control
and we can see about, case studies, right?
Which are implementing these mechanisms.
So yeah, one of the financial services company that they have booster analyzed
the data analyst, the productivity by 37%, enabling the deeper market insights.
So by tailoring the intelligence automatically delivered to
each team member based on their specific role and also the
historical pattern of interactions.
And one of the healthcare provider automated successfully could say 73% of
critical classifications while maintaining the strict compliance standards.
And also they can see that one intelligent classification and masking technologies
help them to complete the sensitive data on the protection of patient
information and PHI and one of the.
These algorithms and mission language process, the I value
market data to be presented 2.8 times faster, creating significant
advantage over the competitors.
So this time sensitive, revenue driven insights deliver the ahead
of their competitors in the industry directly impacting their results.
Quarterly they can see.
So these industry leaders in the different segmentation I've seen
their data operations through the strategic implementation of
these different AI techniques.
We talked about there are transformative results included dramatically.
Acceleration.
And also reduce the operational costs.
And particularly for the finance and healthcare businesses or industry,
they could see an enhanced compliance regulatory across them industry.
So this conversational yay assistant in a day to day life.
Everyone we'll be coming across with the AI assistant.
It's a virtual agent powered by yay.
That interacts with the users like us using the natural language.
Basically text our voice, which automates.
And provides informations and enables a self service across the digital channels.
It mimics human conversation patterns and often supports multi conversations,
context retention and personalization.
Yeah.
And one, are the core technologies involved in these agents used as a.
Which basically understands the user intent and entities
natural language generation.
The NLG crafts, meaningful human-like responses.
Dial management manages a flow of conversation and machine learning.
Lms, for example, the good example would be GPT covers
adaptability and reasoning, right?
Speech to text to speech.
Converts a spoken input output for wise assistance.
The APIs and the integrations are quite popular.
Connects with the backend systems, so CRMs and databases.
So this is about a assistant.
How can we interact with them efficiently?
Yeah.
With that it concludes this presentation.
Thank you all for joining this session.
Yeah.
Before we wind off lemme conclude.
With the data engineering how it transforms, it's not just
about adapting new tools.
It's about modernizing the workflows, I embracing the
automation and aligning with.
Business goals, right?
To create a scalable, agile, and intelligent data foundation.
So I can provide the quick roadmap style to transform these data
engineering practices effectively.
So adopt the modern data stack used in ingestion, storage processing.
Load, transform, integrate thel for intelligent workflows.
So which will help for any business in auto detecting schema drift.
Predict pipeline failures, enrich data intelligently, implement data
observability, create data pipelines like production system, monitor the
freshness, accuracy, volume, lineage, and alert for anomalies or downturn.
Use tools like they're quite popular.
Monte load data band and big test and version, everything,
which is a critical one, right?
Data testing, data versioning, and model tracking.
And wherever the automation use possible can be effectively implemented using
the data ingestion, pipelines, even
data quality checks.
So the goal is to reduce a manual ops and increase pipeline reliability, and finally
try to foster your DevOps culture, treat data as a product, not as a byproduct.
Yeah.
With this this session concludes.