Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Kisha Kumar IPOing.
I'm a CPQ Solution Architect.
I have more than 15 years of experience in building the CPK applications
for various enterprise orgs.
Okay.
And I worked in the very fortune companies led to the manufacturing
semiconductor and also the energy and power utility services built.
The CPQ ecosystem and helped in generation of the proposal, quotation and complex
pricing related solutions to them.
So coming to this today topic, where I was trying to present to the external
audience is how can we utilize the CPQ application or the CPQ systems
by using the AI powered platforms?
The reason being now, nowadays, we keep hearing everywhere the AI
implement, the AI in every technology.
So that's where we are trying to see, hey, how can we leverage
the AI tools to the CPQ?
How, what is the benefits that could be bring up to the CPQ applications
by using the AI platform, having the microservices architecture kind of stuff.
So we'll go in detail.
How it is going.
How it looks like, how the deployments, how the configurations,
what is the ROI we get it from, be utilizing this corresponding feature.
And what are the current challenges?
Use the, the implementations and also the sales guys, how issues in generating
the quotations, getting the dynamic pricing, the complex configuration,
taking how much time for them to bill.
All this information we encapsulated in such a way by giving a
microservices kind of an operation by using the AI as a base platform.
Okay, so coming to this one, right?
So as I mentioned, a system process in this particular architecture
standpoint that a hundred care requests monthly could be coming up with 99.9%.
There will be an uptime means.
There are no issue.
The, the key complex AI workloads will be at the enterprise level scale.
We have engineered the AI in such a way that even though you've tried to hit
thousand K per requests immediately, the trained models will able to
generate the right configuration with accuracy, pricing, and accuracy coating.
So that's where this corresponding architecture, the tool will looks like.
So going to the next slide.
Okay.
So as I mentioned there's a interaction.
So CPQ has much more demand coming to this CPQ means its
configuration, pricing, and coating.
So how can we do this is complex configurations with the real time
data I know by using the real time data, configure the products,
accurate pricing decisions.
And when, if you try to give more discounts, the approval process
should be in the dynamic nature.
Hey, this is the, most of the customers are asking this much of discount.
Okay, we can able to approve this particular code with the same discount.
So those kind of a training model and the dynamic and the forecasting.
Okay, these are the list of codes within the quarter.
How can we forecast that one?
What is the expected revenue?
And when I were trying to use a, trying to configure a model.
We do an upselling kind of a thing.
Hey guy, our guided selling, Hey, if you try to buy this particular product,
okay, so you get this particular product with certain price, with less price.
If you combine into a bundle, you have much more discounts a thing.
But this is all like, and the air will train the corresponding backend system and
suggest the user saying that, Hey, this is the configuration you're trying to do.
Most of the customers are trying to buy this corresponding product.
And also you'll save this much of amount.
So these kind of ai, there are microservices behind the scenes,
which will try to solve and help you in getting the right configuration
with the right pricing too.
And coming to this, the response time by building this microservices
is, was 200 milliseconds.
It's not even 0.2 second in the response time.
And we address by utilizing this corresponding platform technology, we
address the challenges like, okay, what are the providing the abstractions,
the tooling necessary to deploy the manage the AI workloads at a scale?
And this particular article presence, the approach, how can we build the
microservices based CPQ platform that seamlessly integrate with enterprise grade
reliability and developer productivity means this microservice can be plugin.
And within that AI tools, and then that will be trained and
get utilized by the CPQ platform for the better coding process.
So here is a high level architecture.
So one is the user request means, for example, users trying
to input some user request.
Okay?
Capturing the configuration, input, whatever he's trying to want
to buy, he select those inputs.
And everything.
After selecting those configuration, whatever is required.
So what happen, it'll go to call an API, which is like a
background app, API interface that will call the microservices.
Microservices already build in such a way that it take the mission
learning languages, AI tools, everything will betray, generate
the product rules and the data.
So for example, configuration, it'll generate hey.
For this configuration.
These are the, bill of materials, we can say it or it could be a product.
Those are the products that could be offered.
And also this is a price that could be optimal.
Price also will be generated for the users so that they can
able to generate a quotation.
So it is the, it's all included in the microservices.
The way of getting the request from the user, it will orchestrate and
give a flexibility and to, and for the enterprise work, I mean ensuring that
we are handling the ING mechanism.
We are handling the multiple request, multi treading, all those interfaces,
which will again interact with the microservices AI interface services and
supporting the infrastructure components.
And we do have a synchronous API calls and also the synchronous event, streamlining
on the case by case requirements.
Real time ca pricing calculations and also ensure there's no latency
and that sometimes it requires a batch processing and model training.
Real time model training will also be taken care as part of this
overall architecture standpoint and next to this one, right?
If it is about the AI and the machine learning, the service integration.
So what is do is we do have three level of architecture of the integration pattern.
One is a gradient Boosting engines means it's handling the pricing observation
with a standard API interfaces.
The second is the neural network.
It's a power configuration recommendations to dedicated microservices.
The third one is the real time.
Real time is nothing but okay, having the information, so
dynamic prices, adjustments based on the market conditions.
So based upon, for example, if you're trying the module of the
train model will suggest a price.
At the same time, it'll interact with the several services microservices in the on
demand, on the current market conditions.
And it'll also adjust the pricing, say pricing dynamic pricing.
So it'll, it's not only the easy infrastructure is not
just for each static thing.
It can be leveraged to any industry standard frameworks, including the
tens of flow serving mission flow, and the customer interface service
builds that kind of information.
So the data science team will to choose the when.
Now we are trying to sell this corresponding product of the architecture.
The data science team can choose the appropriate serving solutions
for the specific model types and the performance requirements.
So where they can also provide a unified monitoring and
management CAPA capabilities across the all serving frameworks,
ensuring the consistency in the.
Operational practices regardless of underlying technology.
So it really comes as a package, as a service so that this microservices can
be utilized to any industry framework is what the high level, the integration
standpoint we are trying to build.
That is the microservices service integration and coming to this
event driven communication band.
The next level is about, one is we talk about the integration patterns.
And ML thing, there's an even based, so we now we're trying to,
okay, there's some triggers or the even driven based architecture.
We now there some changes is happening.
So then lose couple, so for example, the end up data is not getting it.
So you lose couple of between the services while maintaining the data consistency and
also enabling the real time approaches.
So we use the car Kafka serves as a central event, maintaining
handles the millions of pricing events so that it's event driven.
So a lot of pricing, even when now we try to hit the request, the millions
of pricing events will directly hit to the Kafka services and standardize
the schema, enabling the backward compatibility as a platform evolve, so the
events also be taken into consideration.
So it's keep on evolving the microservices and also the get the data, train
the model, all those information.
So even and sourcing the caps, the complete history of the pricing
decisions, the configuration changes.
And provides then audit trails and then also, okay, what type of price happened in
past, like last few days and what is the history of the overall pricing and okay,
what type of products configured by the users in the order the period after time.
All those audit trails will also be available and provide for the analytics,
this events to maintains the immediate records, debugging, compliance,
analytical purpose, all those information.
So this highly, we are trying to talk about the event driven
communication patterns and coming to the next one is the what.
Okay.
We, what is how we can, what is infrastructure, how, what is a
deployment cycle looks like on this particular microservices?
One is this is a multi-cloud infrastructure management.
So Terraform infrastructure as a code.
Okay.
That is one structure.
The other one is about the cross cloud networking.
So coming to the Terraform is pretty much like entire infrastructure with
the modular configurations and different cloud providers, including the AWS
Azure and Google Cloud platform.
So these are the standardized patterns for the common components.
It has a build in security controls and also caught up cost
optimization settings as well.
And coming to the cross cloud networking, it'll be a ME technology,
so it to provide a secure, encrypted communication between the services.
It's an advanced traffic management, candid deployments, and the secured
braking and automatic tires.
So we leverage both of these infrastructures based upon,
avoiding the vendor locking.
Then this infrastructure code undergoes the same rigorous
review and also the testing.
Have the ability process as application code, ensuring the real reliability and
the security at the infrastructure layer.
So we use the multi-cloud infrastructure management on the
high level we talking about it.
Yeah.
Coming to the scalability and the performance optimization.
So any tool, anywhere performance is very key.
And also on the top of it, what is a scalability standpoint?
So it's not just okay, one time you buy it or one year, one time you install it.
We cannot, we need to ensure, because now the world is keep on
evolving, changing very, quite often.
We don't want, okay, this is the static thing.
We cannot change anything.
And the performance is like that.
We cannot do improve the performance.
It's not like that.
This is about the performance.
We ensure that the services are more scalable and keep on, it could
be evolved as much as we can and you can make the changes as, but
the industry standards and all.
And coming to these auto-scaling policies respond to the both traditional metrics
and also the AI specific indicators.
Okay.
And also horizontal ports, autoscaler configurations, and machine learning
services, including custom metrics, exported from the models, serving the
frameworks, ensuring the scaling decisions reflect actual model performance.
And.
I end up coming to the performance optimization.
That is one of the key characteristics without having the performance
intelligence that we added to the way of implements we do that is the intelligent
node selection algorithm that's considered both cost and performance
characteristics when adding a capacity.
So GPU enabled nodes are provision only when neural network workloads require them
optimizing the infrastructure course while maintaining the perfor, while maintain the
performance as a. Is and high level, it is a scalable and performance optimized
as well and coming to the deployments.
Okay, so everyone knows we do have a GI repository.
Also, the Microsoft is can also be handled by the GI repository and the
configuration deployment manifests stored in the gates and the bluegreen deployment.
This is a new model version that is fully bombed up so that you can have the
algo series sync automatically syncs the design state from gate with Accu actual.
Cluster state, automated testing, all those simple, the deployment
pipelines, it has a zero time down downtime for the deployments.
For example, if you're trying to enhance a new features to the M,
you don't need quite a downtime.
It'll, the pipeline is in such a way that it will automatically disconnect
and connect to another models.
Inside that is a well sophisticatedly connected platform and manages
the traffic also be shifted.
From one tric to another metrics and detect any degradation is happening.
The performance in the model performance, it'll alert and say
that, Hey, there's some issues.
We have to take a look.
So it was very strongly connected.
GI ops and the continuous deployment.
Okay.
The second thing is the developer experience and tooling, right?
There's the one is the deployments, which we already discussed.
What should be the developer experience?
What type of APIs that will be supported for the microservices
and SDKs, so standardized APIs.
Okay, so there are two types.
Typically we look, one is the standardized and the custom APIs.
But by using the self services, APIs provide the standard interfaces
for common operations like model deployment and also the feature more
engineering and performance monitoring.
These APIs will extract abstract the complexity of
the underlying infrastructure.
Allowing the developers to focus on business logic rather
than the operational concepts.
And the other one is the software development kits.
I like the Python, Java, and GoPro idea medic interfaces for
interacting the platform services.
STK can also used data scientist tooling.
So the other one is, we know this APIs, we know the about the
languages, and then the data.
Data is data scientist that will be, these environments helps libraries.
The common tasks of feature extraction, model evaluation, and the deployments.
This is about the developer experience and the tools we use coming to that once
we develop, the other thing is okay.
Where we have to do that development.
Okay.
We understand that this is very scalable.
SDA SDKs and also APIs and the data science.
We can do local development, okay?
And also replicate the production platform architecture with the
lightweight alternate use like Docker composed configuration models.
Including mock ML models for common.
These approaches enable developers to test complex workflows
entirely on their local mission.
They don't, see, because we're trying to take the backup of the production,
it is a very simple mechanism or lightweight, alternate use, and
we can reproduce any issues or challenges in the complex workflows and
increasing the development velocity.
And also the platform includes a sophisticated testing frameworks
for ML workloads, addressing the unique challenges of
testing probabilistic system.
And the testing, yes.
Continuous integration pipelines we have with automated testing is available.
We can have the full test lifecycle suit.
We can create it so that once it is pulled from the production code
base, you can run the test test cases and see what is a benchmark.
The performance benchmark and integration test, we can do it.
The c the kind of maintains the historical performance data.
Automatically flag integrations in more accuracy or interference speed.
Okay.
Now, integrated development environments.
Right now we have your local environment.
We can do the testing.
How can we connected with, how can we connected with the production
environment and what are the tools are necessary kind of a thing.
So cloud based environments that includes all the necessary tools,
configurations for production deployment, GitHub code basis, and pre-configured
work workspaces accessible for any.
So integration with popular ideas through remote development
extensions, NFD developers, and while more benefiting from cloud-based
compute process, this is high level.
We are trying to explain that this is an integrated development environment,
which is in easily integrated from our test dev stage and production with
having the CICD platform coming to the observ and see one is, yes, deployed
is done after once it's go live.
Okay, how can we calibrate the metrics?
Okay, so what is the metrics collection?
Okay, we are what is how the usage is happening?
Okay?
What are the, any error is happening in the application?
How we get notified, so what are the metrics collection?
How will we distribute that tracing visualization in the dashboard,
provide a realtime visibility hearing system, behavior system,
or holds the system performance.
What is a motor behavior?
And also the intelligent alerting.
So ML algorithms analyze the historical metric data to establish a dynamic
baseline and detect anonymous.
And the other one is that the metrics collection pro,
this captures the traditional application metrics and instrument.
So typically this microservice has an ability once we go live, to
provide the observability and also the reliability as it kind of stuff.
So we know the metrics we visualize in the dashboard.
The racing thing and also intelligence alerting if there are any issues as
well coming to the SLI or SLO framework, okay, this is 99.9% availability
with, 0.1, I mean I can say close to the 99 point 900%, it is up.
There's no downtime per month, so try approximately 33 minutes
of downtime per month is max.
And also the response time.
The other one is very, see the services are not down.
The how the response looks that is, that's what we talk in previous slides
about the performance standpoint and the scalability standpoint.
It's coming to the performance.
The events is 200 milliseconds means the accuracy is more important also, and
also the speed is also more important.
95%. The pricing calculation completed within the 200 milliseconds
by using this microservices.
Maintaining the response to, so user not even feel any slowness.
And 50, the ML interface, because we are using an, the average inter
interference of the latency, enabling the real time pricing without
predictability delay is 50 milliseconds.
The, these are the SLI service level indicators beyond traditional
availability and latency metrics to include AI specific measures.
The key SL.
S includes the modern model interference and accuracy.
Prediction consistency across replicas and future freshness
for real time predictions.
So on the high level, the downtime is very minimal.
The response time is very quick, and the accuracy or suggestion to the
customers is almost all very quick.
And the ML interference is just, it's 50 milliseconds.
So it's all very, it's negligible.
So that user will not see any slowness.
And also there will always be available.
That's how the framework looks like coming to the incident response.
So we talk about the development first.
We talk about the infrastructure, we talk about the development.
What are the APIs that will be utilized, and also the pipeline
integration to the production.
And then we observed, we looked into it.
What is the pa? Also any alerts are a thing.
And also how the downtime and looks like and now coming to this, okay,
in the worst case, if there's an incident happen, what is a response?
What is a recovery and how can we ensure that?
So because there's a, some issue happen in the circuit breakers
preventing cascading failures or automatically it's not getting that much.
So there is an automated runbook guide troubleshooting.
So we can get created so that incident response team procedures
leverage the extension.
Even for those instance, we do have automated tools so that which
will be okay for this response.
For this one, you have to fix this score or this responding.
And also we have an automated charge, GPT integration, that instrument management
communication platforms where user can also see the moment they get an error.
That error will all be translated in the backend and guide the user, Hey, there was
some issue with X, Y, Z. And just wait.
We are working.
We created an incident automatically just called off.
And also the disaster recovery, right?
Nowadays, okay.
All of a sudden, what will happen out to our critical data?
So it's all running in the ml. We are trying to gather the data,
train the microservices, and get the information and share the data.
So what is the disaster recovery process?
Okay.
We are always take regular backups, regular restoration tests, do
BA and all if anything happen.
So we always have a restore mechanism from our backup services.
And automated DNS updates as well for the traffic redirection.
Yes.
And coming to this important thing is about the data
platform integration, right?
So one task is about how the Microsoft architectural front-end
looks like the backbone of this.
Everything is about the data without having the data.
Even though if you try to build the microservices Yes.
Output will not be kept.
That is the major thing.
There's a Kafka backbone, is there?
What are the sources we are trying to consume?
And the cons were the consumers was how we streamline the data
based on the consumer information.
So every year we process the millions of the pricing events daily, and the
topics are organized by businesses domain, which generalized naming
conventions and retention process.
So it includes the timestamps and correlation IDs, portion
information enabling the detail traceability, and the debugging cap.
So that is about the, one of the important data platform architecture.
The strain, even stream streaming architecture system.
Coming to the other one is about the feature store and ML data management.
So the Future store serves as a central repository for ML features, ensuring
the consistency between the training and also the serving environments
built on top of the Apache fist.
This is the feature store in the ML data management.
So we had a batch feature.
Suppose the model training workflows while online features enable the low
latency interval during the interference.
So Apache Spark for the large scale data processing with optimized algorithms
for the common transformations and also coming to the data line is right.
So we know the tracking provides a complete visibility into
the future generation process.
From raw data services through transformation steps to final.
Future values.
This transparency proves invaluable for debugging model issues, and
also ensuring the compliance with data governance requirements.
And the coming to the other picture is about security and compliance.
This is very important aspect.
Okay.
So because we are dealing with the data of the organization, and also nowadays
AI is trying to capture all the data and to guide us, but at the same time.
We need to protect our data and also what is the security measures?
What is the compliance?
When you're trying to input a request, the request has to be reside within that
our architectural and also the compliance, guidance, and standpoint so that it
should not be below outside of our world.
So that's why we doing the mutual TLS, all the service to service
communications across over mutual t ls.
It's certificate rotation automated through certificate manager so
that it has a TLS mutual TLS certificates and the network policies.
The moment when you're trying to send the request, the stringent
segmentation between the services, limiting the OCA communication too,
explicitly authorize the paths.
And the other thing is the identity management could be open ID or 2.0.
Definitely.
It's a key based kind of thing.
Or the client ID kind of, or rules.
We use it.
There's nothing like a. Layman, a username and password kind of thing.
It's all will be a secure identity management of what?
2.1 the Open ID connects and the secrets management, right?
We store everything in the vault.
Even though those keys, you cannot easily capture them.
Those will be the secured wall and it'll be the vault to access
the vault every 15 minutes.
It's keep on changing your password.
You need to connect to the VPN and you need to connect to the
world management with your short.
We can save one type of a password to get the key.
And again, then this kind of a rotation will also be happen in the database.
So these are the very key important aspects for the security and compliance.
And ensure that every layer of the platform, zero trust principles, that
in assume no implicit trust between the components because end of the
day it is, will be supporting with single signon, with multifactor and
hazard capacity to utilize this, I mean backend device services.
So that they cannot tam the data and and also very strong mechanism
to, for the security standpoint.
Yeah.
Coming to the data privacy one, we talk about the security, right?
How we can control, we have those mechanisms of TLS open ID and also
the what kind of a thing and okay.
Privacy.
So we need to have certain data.
Okay.
Certain persons only can sit and see the certain data within
even the organization as well.
There could be industry specific requirements.
Hey.
He can see that XY amount of price, which is the roll of
case, should not see that price.
So those kind of things also be classified in such a way.
But the rules a thing also be embedded.
Hey, okay, this particular user is just dedicated to configure the model.
The price is not there because it belongs to the engineering department a thing.
And also the privacy preserving techniques also.
So if we add that control noise to maintain the data and federated the.
Learning across distributed data sets right to be forgotten implementation.
So at the moment when we've done this diagnosis, there's a right
to be forgotten implementation.
The data retention and also the deletion policies automatically remove the expired
data so that you don't need to locate access according to the regulatory
requirements and the business policies.
And the platform maintains the detailed audit logs for off all
data taxes and modifications.
Support compliance and reporting and forensic analysis and coming to the
other important gain, the performance optimization, we did, we lightly touch
in the previous slide, but coming to the multi-layer caching, so it's
nothing but you keep on hitting the same request, it will be cached and it
gives the response very quickly and very intelligently with, even though you're
trying to make the small parameter change, you get it very quickly.
And the other one is database optimization.
So it's now.
Now we nowadays it's like Postgres SQL instances use that once.
Features including the ING pallet query execution, just in time compilation.
NoQ databases, including Mongo, d, p, and Cassandra, handle specific
workloads and they coming to the queue.
Resource utilization, spot instances handling batch processing workloads
with automated failover or to on-demand instances when sport capacities available.
ML training jobs run during the off peak hours.
In which compute the resource are less expensive.
I know there are very less resources that the training jobs will be trained
and get it done so, and these content delivery network integration accelerate
the delivery of the static assets and also the A PR responses to global users,
edge location, catch pricing catalogs, and the configuration data, which will reduce
the latency for international customers.
That is mainly, we are working on the performance optimization and the strategy.
Is coming to the case studies, right?
What?
Okay.
We are, we're all good.
We have the microservice that we talk about.
How can we utilize it in the CPQ, what are the deployment strategies?
What are the APIs behind the scenes?
What is the performance?
Optimization and also scalability.
We are talking a lot of good things in the slides.
Okay.
How can we ensure this all will be work and what are the
kind of thing or the results?
Okay.
We did a high level analysis of this taking into small microservices
built in the system and trying to understand the capacity.
So 10 time you will request the capacity and increase from 10,000 to over a
hundred K monthly request to horizonal scaling and the performance optimization
so that even though you try to increase the 10 times, the responsive time is 75%.
Respondent is very quick.
That is between the 800 milliseconds to under.
200. 200 milliseconds.
It signed 65% time to market decrease for new features measured for production.
So the newer products evaluation will be very quicker and very good.
Building the scalable AI power platform request a care will consideration
of architecture, operations, and developer experience for Microsoft.
Visual based CPQ platforms demonstrates the enterprise great reliability and AI
innovation can coexist when supporting by.
Appropriate platform engineering practices all in one go.
What I'm trying to tell is everywhere we are using the ai but AI is giving a
very positive result and taking that as an advantage kind of a thing, or taking
that as a source for that AI platforms.
Plugging into the CPQ technologies and trying to evolve, how can we
do a configuration pricing co. Very efficient, very responsive, very
optimization way, and also very accurate.
So Synchron and more driven with the train the models and get the
import robust platform engineering that will grow by using this AI
to drive the business decisions.
That's concluding myself.
This is just focusing on that scalability, reliability.
And also the greater experience the organization can create.
The platforms not only meet the current needs, but also adapt to the future
challenges because the landscape is keep on evolving, changing day by in the
industry standpoint by utilize, utilize this AI powered MEChA enterprise systems.
That's the major history and overall the presentation I was provided to you guys.
Hope you guys have enjoyed the presentation.
And thank you everyone.
Have a great day.
Good night.
See you.
Bye.