Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, this is Han Joseph Raach.
Today I'm going to talk about building high performance AI
literature processing platforms.
So AI has been a buzzword for past several years.
And let's talk about how AI can be utilized with respect to the.
Navigating through the complex ecosystem of the researchers out there, the
research is happening every day.
There are plenty of researchers happening and the publications out there are like
thousands and thousands of applications.
The challenge is not about publishing a research it's about keeping it up.
Without a powered platforms, critical research discoveries, something which
may be phenomenal or maybe utilized to build on top of that's a vital
component maybe for another research.
So if you wanted to navigate through that and find out those things, it may
be buried in the exponentially growing volume of publications for an instance.
10,000 plus biomedical papers are published every day.
So let's talk about rag retrieval, augmented generation
as a foundation pattern.
Where do we need rag?
There are lots of LLM power models like we have Gemini Claude Engine, grok.
There are lots of things out there.
What is the value addition being done by the rag?
Let's talk about it through these slides.
But the main idea is rag prevents hallucinations.
When you talk about the research discovery, we want accuracy.
We want the correct details to be rendered.
So rag prevents hallucinations.
It's about grounding LLM reasoning in authoritative references.
Let's talk about the role of distributed vector database
in there and its architecture.
So unlike keywords, embeddings capture the context, distributing indexes,
ensure subsecond retrieval at scale.
So we are not only concerned about.
Navigating through, through the complex ecosystem of researchers, we
also want it to happen efficiently.
So rack distributed, sorry.
Vector database helps with semantic understanding.
We also have distributed indexing, so which means that it's sharding
across nodes enables parallel processing and fault tolerance.
So now we are talking about the sub-second retrieval optimized nearest
neighbor search algorithms maintain speed at scale, so we want to happen
this entire research to be navigated through and get us the right truth of
information in split second timing.
That's what we are, we care about.
We care about the reliability, accuracy, and also the response time.
Event driven microservice architecture, elastic scaling during document
search surges, keeps ingestion and inference smooth benefits of this
independent scaling of components.
Fault isolation, a technology flexibility, and we can use
messaging system like Kafka.
Stateless services and even driven workflows.
Workflows.
So it's not about how we are going to implement it.
We may use the mix and match of different tools and technologies, but
we need to have an architecture which scales and helps us with the, with
respect to the, without being worried about the volume of the transactions.
Efficient LLM Inference pipelines.
So here we are talking about the techniques like
batching, reuse, and caching.
They help us to cut cost drastically while enabling instant answer.
So batching means it's combining multiple request to maximum GPO utilization.
We have lots of GPO resource, but it's costly.
We have to use it very wisely.
We have to if it is needed and if it is efficient, we need to campaign multiple
records to maximize GPO utilization response caching, which means that storing
common queries and responses to avoid the computations being done repeatedly, the
same computations being done repeatedly.
Model quantity, quantitation, optimizing model size without
sacrificing their quality.
We talked about the rag to bring more avoid hallucination and to bring
more accuracy in while navigating through the document search.
Getting the truth out of the research publications out there.
It's not about getting the right information, getting the right
information with split second timing.
It's all also about keeping up with the research updates.
Like researchers change quickly there are more and more
research publications out there.
Something might have changed, something may be outdated and new
information might be coming in.
So we also need to keep up to date with the research out there.
So real time knowledge synchronization help with that.
It's all about in involving the techniques, like
continuous inte ingestions.
Incremental updates, conflict resolution and knowledge integration
entity extraction at scale.
So we are going to implement domain specific models to
extract genes, proteins.
Here I'm talking about, sorry, here I'm talking about the model specific
to bioinformatics or biomedical.
So domain specific models extract genes, proteins, and diseases, merging them into
knowledge graphs for deeper insights.
Entity types like such as genes and proteins may have a relationship
like causes and cost to buy.
So those kind of entity types and relationship types or stored at scale.
So which means we can navigate easily through these geo entities
and their relationship and find out the right information, which we need.
And we also need to understand the horizontal scaling strategies rather
than bigger machines will scale.
We scale horizontally distributing embeddings and inferences
across G-P-S-G-P-U pools.
So we need to distribute and horizontally scaling will help us to avoid cost when
it's not needed, and also help us to meet the demand when it's really needed.
Caching layers and cost optimization caching ensures the instant results
without invoking a full pipeline, keeping latency low and cost sustainable.
So we don't need to update everything every minute, if it is something
frequently access to query.
And we don't need to update it every day every however,
every 30 minutes or like that.
So we probably might update with the next incremental update, but until then, we can
retrieve from the cache to save some cost.
So cash with can be with the embeddings retrieval cash or response cash.
Monitoring and observability.
End-to-end monitoring captures throughput, latency, and inference Quality.
Observability ensures issues are caught before they escalate, so we have to
capture performance metrics, throughput latency, error rates, and results,
utilization, quality metrics, anomaly detection, and distributed tracing.
To make sure that whatever we deploy, they work and meet the
demand and help us to get the right information and right on time.
Reliability and error handling failures are inevitable in any systems.
So when we talk about the system design, we also need to talk
about handling the failures.
What matters is resilience.
Graceful degradation ensures uninterrupted service.
Resilience patterns means circuit breakers, retry with back off
fallback, recur, redundant components.
So grace will, degradation may be simpler models as backup cast
responses when life fails, partial resistance when complete unavailable.
Clear error communication.
So we need to adapt a technique which suitable for that particular
instance, and make sure that it's not a complete failure.
While we work on to rectify that problem that is happening in the
production infrastructure as code and CACD, this is a common technique
used in the software engineering.
So infrastructure as a code ensures reproducibility.
CACD pipelines validate changes and rollouts, updates safely.
Infrastructure definition like a Terraform Cloud formation, Kubernetes manifest.
Automated testing, we can ingest embed the automated testings in
the code file to make sure that's happening along with the pipeline run.
CACD pipeline runs gradual rollout.
Canary deployments and bluegreen strategies, performance
fallback, continuous monitoring of the deployed changes.
Containerization and deployment strategies.
So when we talk about the environment, the run environment of any code, we also
need to talk about the containerization.
So microservices run in isolated, continuous orchestrated for scaling,
resilience and seamless update.
So the benefits of container.
It.
It provides a consistent environment, isolated dependencies, efficient
resource usage, orchestration features, autoscaling, self-healing,
rolling updates, load balancing, so deployment strategies.
When we comes to deployment strategies, we can adapt blue green deployment,
canary releases, feature flags, et cetera.
So we also need to come up with the benchmarking and draw a line
and do the performance validation.
So benchmarks, they validate design goals, testing under load failure
scenarios, guides capacity planning.
So we have to define those matrices.
Like latency, like for an example, it should be less than 50
milliseconds for typical queries.
Throughput how many queries should be processed per second accuracy how much
accuracy we expect We have to define those matrics and constantly monitoring that
those matrics are met and we are on track and we deliver what we intend to deliver.
And uptime system availability, that's a very crucial part.
Cross domain applicability.
Though I talked about biomedical research, the same patterns, ex can be
extended to law, finance, insurance, or any type of knowledge for legal.
For an instance, we can have case laws, sta statutes and regulatory documentation,
doc documents, finance market reports, filings and analysis, technical documents,
specifications, manuals, and standards.
Historical archives like manuscripts, records and cultural artifacts.
In, in, in certain industry we may have plenty of documents conclusion.
So high performance platforms.
So like you carefully engineering, pipe engineered pipelines.
This blueprint accelerates discovery across science industry.
It's not about deploying an LLM for our need.
It's about engineering an ecosystem around it.
That's, that's then only we can achieve the accuracy, the speed,
and also we keep up to date to avoid outdated information being transmitted
to the critical science community.
The architecture patterns we have explored provide.
Foundation for building scalable, reliable, efficient AI literature
processing platforms that can handle exponential growth of
obligations across domain.
Thank you for this opportunity.
I hope that this particular presentation would've helped few audiences.
And thank you for this opportunity.
One second.
Thank you.