Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, good morning everyone.
Welcome to Platform Engineering Conference.
My name is,
I have over 14 years of experience as a platform architect in the IT industry.
I'm here to give a presentation on the building production ready rag systems,
platform engineer strategies for enterprise knowledge infrastructure.
So before going to that, I would like to know, I would
like to discuss about the rag.
what is rag?
RAG is stands rag in AI stands for the re travel argument generation.
A technique where a generative model, such as a large language model, first
drives relevant information from external sources before generating responses.
Improving factual accuracy and relevance.
The code concept behind the RAG is it combines the traditional
information retrieval systems.
It'll search the databases with the generat, with the generated
capabilities of lms instead of relying ly on static PRETRAINED data sets.
RAG allows in AI to dynamically.
Access up to date information are domain specific content and
incorporated into its output.
How is rags going to work?
So a user submit a prompter query the system such as our retro
relevant information snips from the designated large base or data
sets, which might include internal documents, PDFs, emails, our web data.
These retrial snip nets are passed as context alongside the prompt to the LLM.
The LLM generates its response, grounded in both its modern knowledge
and the most relevant retried facts.
So as enterprise scales, their use of LLMs, they encounter two
critical challenges, hallucinations.
The first one, which is.
Rate as high as 27% and outdated training data.
So the re travel argument generation has emerged as a powerful architectural
solution, bridging the gap between static model origin and realtime
information by coupling generated AI with enterprise scale large repositories.
Rack Systems enable reliable, accurate, and context responses at scale.
The Enterprise Rack Challenge over deploying production ready
rack systems in enterprise environment is far from trivial.
Because platform engineers must design architectures that can handle
terabytes up to 10 or plus repositories.
You subsecond query response times and maintain 95% uptime while supporting
10,000 plus concurrent users.
This article explores the platform engineering strategies needed to
operationalize rag pipelines at scale, focusing on infrastructure.
Performance, reliability, security, and cost optimization.
when we come to the infrastructure architecture, scaling a rag for
the enterprise is a big challenge.
The backbone of any rag system is its vector database, which stores
embedding self enterprise knowledge for the travel to achieve enterprise
scale, reliability and throughput.
Platform engineer must design.
Based on these four concepts, first distributor, vector databases, and
next is load balancing strategies.
Part of one is hybrid storage models, and fourth is the elastic scaling.
These are the four pillars when we are building a rack
system for the enterprises.
When we come to the distributed vector databases, shedding embeddings across
nodes, ensure horizontal scalability systems like Vate, pine, cone, and Fan
provide different tradeoffs between scalability, query latency, and cost.
So these are the various vector data databases, which we can use when we
are building a rag for the enterprise.
Now, when it comes to the load balancing, is very important to handle the
upscale or vertical scale, requests.
So the balances must intelligently route queries across database
clusters, pipelines, and LLM instances to minimize bottlenecks.
So load balances must play an important role to retrieve your results.
Then the second is storage.
Third one is, sorry.
Third one is hybrid storage.
So when the storage comes into picture, we were combining the SSDs based
hard storage for frequently accessed embeddings with coal data and is
both performance and cost efficiency.
And the last one is Elastic scaling.
Kubernetes deployments allow rack components, vector stores and
traverse, rankers and LMS to scale independently based on demand.
Now the second part is performance optimization, so we need to get, we need
to optimize our query set so that we will get the results in a subsequent second.
So the query latency, should we need to get it below two
seconds to retry your results.
So the near, because we, in enterprise, we used to get the real
time responses from rack systems.
So reducing query times from eight to 12 seconds to under second subsecond
under two seconds, it requires a multi-layer optimization strategy.
So to our, to give these sub two second response time for any rank systems, we
have to focus on these four pillars.
One is catching strategies, so your que much must be catched in and, catch
with the hash ash, hash value and query level, catching for repeated requests and
embedding level catching for frequently, documents significantly reduced.
Redundant competition.
And the next one is hierarchical re travel using a two-step travel process
filtering, followed by fine range ranking, minimize unnecessary vector comparisons.
The third one is resource allocation.
This which is a. Which is also a place, a vital role, GPV accelerated
with travel and dynamic model routing.
For example, small L LMS for lightweight queries, large LMS for
complex queries, balance, speed with cost, and the hybrid cloud deployment.
So the deployment we can choose is a hybrid cloud, so locating the
travel and interference pipelines closer to data sources via edge
deployments or regional clusters.
Reduces the latency.
And when we go to the next slide, reliable engineering, which is 99% availability,
which is a must for the rag systems.
The resell features of resilient two failures across multiple
layers, databases, inference, pipelines, and orchestration.
The key practices include monitoring and observability fault tolerance, Charles
engineering, ci, and CD automation.
When it comes to the monitoring and observability, we need to collect the
metrics across embedding throughput vector, such latency LLM, inference,
time and system resource utilization.
Tools we should use as ERs, Grafana, et cetera, and telemetry
is an essential tool too.
For monitoring and observability.
Next, when it comes to the fault tolerance, so implement the replica of the
clusters so that at any time any automated failure happens, it'll automatically
go back to the other region or other failure in point to give you results.
So check pointing of embeds to recover from outages with minimal downtime.
This is the main, motto for the fall tolerance.
When it comes to the Charles Engineering regularly stress test
trial pipelines under simulated note values, our network partition
to validate reliability exemptions.
So always we need to focus when we do, deploy the rag models, we can also
check the stress systems test as well.
So all these pipelines should be automated and deployed through CICD process.
We need to keep your was control mechanisms for all your rack pipelines.
It comes to next security and compliance, which is a trustworthy
for the enterprises security.
So given the current days, we need to secure our data.
Enterprise data because of the sensitivity and ence issues are non-negotiable.
Production Ready Rack Systems must integrate zero trust architecture
encryption from end to end, and must focus on trials and the
data governance Integration too.
Zero trust architectures means in encrypt least privilege access across
all rack components and ensure no implicit trust between services.
The encryption to end so encrypt embeds at rest and in transit.
To safeguard your S two data.
Audit trials must be happens, and it needs to be compliance
with SOX two and IS 4 27 0 0 1.
The data governance integration, ensure rack pipelines, respect
existing enterprise data governance frameworks, including retention
policies and access controls.
Cost management.
So after security and compliance comes to the cost management for optimizing
your resources at scale, enterprises often face skying costs as rack systems
scale, achieving up to 60% cost.
DUCTION requires intelligence, resource optimization, auto scaling,
LLM, inference endpoints under travel services and sourcers.
Resources match demand without pro or provisioning, the more tiring
using smaller open social lamps for routine queries and reserving premium.
API calls for high value use cases, cut cost.
Brushed ling.
Vendor negotiations, which will also play a crucial role when we want, to reduce
the cost for the rag systems enterprise.
Managing 10 terabytes or terabyte pository should leverage scale to
secure volume discounts with vector database and cloud service vendors.
Workload placement, balancing between on-premises, GPUs, part cloud
incentives and manage vector services, minimizes compute and storage expenses.
The real world case studies, proven patterns at scale.
So over the last few years, more than 50% Prise implementations
success patterns have emerged.
85% user adoption rates using this rag model achieved by integrating
seamlessly into existing enterprise workflows like Slack Bot CRM plugins,
knowledge portals, 200 to 400%.
ROI in 18 months, driven by practic gains, reduce support costs, and faster
data missing making These implementations also demonstrate proven architecture
blueprints, modular pipelines that combine vector databases, orchestrators
and inference layers, allowing gradual scaling with that wholesale system.
Rewrites when it comes to the external frameworks for the platform engineers.
We need to succeed the rag model and the production deployments.
Platform engineers should adapt section frameworks.
The first one is vector picture design.
Your right picture should involve any vector database for your vector, such as
this is your first component when you are designing your rag company rag systems.
In the enterprise level, so this is, vector search will establish
embedding, update, IES trading logic and ranking algorithms
aligned with the business needs.
Next, your deployment should be automated.
Use infrastructure as a code like Terraform or health to standardize
rack deployments across environments.
Your deployment should be seamless and automated.
Using your Terraform call
performance benchmarking, create benchmarks, measuring end-to-end
latency with travel accuracy and throughput under peak load.
So you need to do your performance testing before deploying your rag models
so that you will be aware of how your rag is, re your response is coming.
On a huge terabytes of data request, so operational excellence.
Define SLAs.
What is your availability?
99.9% and less than two.
Second response time.
And minimizing your whatever rates and latency that align
with the enterprise expectation.
This will comes under operational excellence.
So the four, as I discussed, these are the main core architecture
components in any rag model.
First is ingest.
How you'll ingesting your data.
Next is Vector database, how you are retrieving or
searching your query request.
And now the third one is retrieval how you want to retrieve your output.
Next is you getting the ranking of the output.
And finally, LLM in France.
Event and architecture integrates these components while maintaining separation of
concerns, allowing each element to scale independently based on demand patterns.
This module approach enables enterprise to upgrade individual components
without disrupting the entire system performance benchmarking framework,
rack system performance benchmarking.
So platform English should focus on these key five metrics.
First is end-to-end query latency under various load conditions, accuracy
compared to ground truth system throughput at peak, concurrent user
levels, resource utilization across compute, memory and storage, cost
per query at different scale points.
These benchmarks provide the foundation for continuous
automation and capacity planning.
Security implementation checklist.
So when we go into the security of data, you need to implement this
authentication and authorization.
You need to protect your data, and you have you, you need to
compile your data with all these.
and is O 27 0 0 1, certified authentication authorization, implement
WATH 2.4 or OIDC for user authentication.
Enforce role-based access control for all rag components integrated
with the enterprise solutions.
These are the three best practices we need to follow when we are deploying
your rag models and enterprise level.
And make sure you those RA models are securely deployed.
the data protection at rest and transient must be encrypted.
Using a 2 56 algorithm, implement TLS 1.34 protocol for all service,
authentication or communications.
Apply data masking for sensitive information in logs so that you can
use your, role based access control.
Compliance.
Maintain detailed audit logs for all system interactions.
Implement retention policies aligned with regulated requirements,
conduct regular security assessment and penetration testing.
Implementing these security measures, ensure that rack systems meet and price
compliance requirements while protecting data through the generation process.
And the last slide when we come to the conclusion.
The rag represents a paradigm shift in endocrin knowledge
infrastructure, transforming static K models into dynamic context.
Our systems for platform engineers challenge lies not only in implementing
pipelines, but also in scaling how you are going to scale your rag models.
You have to design your architecture both to be scaled
in vertically and horizontally.
And implementing best security practices like we discussed, we need to mu we
need to follow the proper protocols.
And optimizing them for production.
By mastering infrastructure design, performance optimization, reliable
engineering, security and cost management, platform engineers can build production
ready rack systems that deliver business value adaption and generate to and
generate miserable ROI and raise that.
Adapt.
These strategies will not only overcome the limitations of traditional.
But also unlock a new era of AI driven knowledge, access, reliable,
scalable, and future proof.
That's all.
Any questions.
Thank you.
Thank you for providing this opportunity.