Conf42 Site Reliability Engineering 2021 - Online

Building Near Real-time Fully Managed Analytics Solution with Minimum to No Coding on AWS

Video size:

Abstract

To create value, companies must derive real-time insights from a variety of data sources that are producing data at high velocity and volume enabling faster react in real time to events affecting business. The need for analysing heterogeneous data from multiple sources (internal/external) more than ever. Thereby making the analytic landscape ever evolving with numerous technologies and tools and making the platform more and more complex. Therefore, building a futuristic analytic solution is not only time consuming but costly involving selection of right stack, acquiring talent and ongoing platform management and monitoring.

In this session, we’ll discuss and demo on how you can leverage AWS stack to create a near real-time analytics solution with minimum to no coding for an e-commerce website while an option to integrate with pre-existing data sources. The solution offers the following advantages:

  • easy to build
  • elastic and fully managed
  • high available and durable
  • seamless integration with AWS Services
  • pay for what you use

Summary

  • In today's session I am going to talk about a streaming and near real time analytics. Building a futuristic solution is not only time consuming, but costly. In this session we'll discuss and demo how you can leverage AWS producing and services to create aNear real time analysing solution with minimum to no coding.
  • Companies must drive the real time insights from a variety of data sources. Gartner in 2019 emphasized that, saying data integration requirement. These days demands more of a real time streaming, replication and virtualized capabilities. To achieve a better customer experience, organs need to work with the freshest data possible.
  • World is moving towards stream based, which is near real time or real time stream based interaction across systems. As a result, we are seeing customers replace message queues with data streams to provide an immediate boost in the capability of their architecture.
  • organizations face many challenges as they attempt to build out real time data streaming capabilities. The AWS solution is easy to set up and use, has high availability and durability default being across three regions. With AWS, you only pay for what you use, making the solution very cost effective.
  • Data streaming technology lets customer systems ingest, process and analyze high volumes of high velocity data from variety of sources in real time. Kinesis services work together and provide flexibility and options to tailor your streaming architecture to your specific use cases.
  • Kinesis data stream is massively scalable and durable real time data streaming service. It's continuously capture gigabytes of data per second from hundreds of thousands of sources. Data collected is available in milliseconds to enable real time analytics use cases. You only pay for the resources you use for as little as zero point $15 per hour.
  • Data firehose captures transfer and load streaming data into s three redshift elasticsearch splank. It enables near real time analysing with existing business intelligence tools. Fully managed service that automatically scales to match the throughput of your data. All can be done within the demo time of 20 minutes.
  • Every time user clicks on something, action HTTP request goes to the cloud front. Information is streamed through the Kinesis data stream. Data then consumed by kinesis firewalls. Everything is serverless, few clicks away from actual data coming through the quicksight.
  • Conf fourty two website demo. It basically asks you what is the capacity of your kinesis data stream. And then it will calculate how many shards do you need. Now let's move on to cloud front.
  • Real time basically takes the click stream, which whatever request goes through the cloud front, it can capture that. We have decided to go with 100% subliminates every request we considered coming in. And of course it is getting delivered to this kinesis data stream. Now let's go to the architecture bank and see how much you have covered.
  • In Athena, you can get data from different sources. You can create a workflow, and then use the workflow to create a real data set for your analytics tools to pick up. Every time the workflow runs, your s three data lake will get populated with the latest data. And then you can pull it from the quicksight.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hello everyone. In today's session I am going to talk about a streaming and near real time analytics this near real time analytics, not a real time one. So if you see in today's world, most of the companies are looking for a near real time analytics or real time analysing, which are of course challenge to achieve most of the time. So gone are the days when you would get the data, next day you analyze for the next one week and try to figure out what we need to do, how to react to a situation like a fraud or anything like that. So things have changed. To create value, companies must drive real time insights from a variety of data sources that are producing high velocity data and of course in huge volumes. So having analytics solution in place would enable faster reaction in real time to events which are affecting business. So the need for analyzing heterogeneous data from multiple sources, whether it's internal, external, doesn't matter, are more than ever, thereby making the analytics landscapes ever evolving with numerous technologies and tools and making the platform more and more complex. So building a futuristic solution is not only time consuming, but costly because it involves quite a lot of things like selection of right technologies, acquiring the right talent pool, ongoing platform management and operations and monitoring. So what I'm going to talk is how to make this platform build bit easier and less expensive. Keeping in mind that most of the time building analytics solution is an afterthought or other. Okay, you first build the features, core features, and then we'll talk about analytics. So when that is the time, you probably have less budget left. So in this session we'll discuss and demo how you can leverage AWS producing and services to create a near real time analysing solution with minimum to no coding. Note, minimum to no coding for an e commerce website. Of course it's a demo website. I've just built a one pager website just to showcase how the data flows from one end to the other end at the same time, and how you can integrate with the preexisting data sources if you need to. And most of the time you probably need to because you would probably need to integrate with the other back end systems and make a joint of analysing and reporting. So of course the solution need to have set of advantages which is no different than any other one. In this case, it is easy to build AWS, I said there's no coding or a very minimum coding depending on how exhaustive your feature set you want to build. Then it is elastic and fully managed. So it is auto scalable horizontally and vertically and fully managed by AWS. It is pretty much serverless solution and then it is as always, any other AWS product and services is highly available and durable. Seamless integration with other AWS services like Lambda ECS, Fargate or EKS RDS S three which is again a core aspect of the whole solution, which is the data lake. And last but not the least is the cost. It is pay as you go. Like if you don't use it, you don't pay it. So let's quickly go over the agenda. So over the next 1015 minutes, I'll quickly go over why real time data streaming and analytics principle of data streaming and near real time streaming on AWS. What are the options we have in hand? And at the end I'll go over one use case and the demo covering that use case end to end. So let's quickly turn our attention to the why real time analytics, as I briefly covered earlier, is companies must drive the real time insights from a variety of data sources. And Gartner in 2019 emphasized that, saying data integration requirement, which these days demands more of a real time streaming, replication and virtualized capabilities. Gone are the days when you do offline processing in days or weeks or months, right? So I think that pretty much sets the scene. Now, before I go to the details, I just wanted to take you through a quick case study about this. Epic Games Fortnite. So real time data streaming analysing guarantees in this game that gamers are engaged, resulting in the most successful game currently in the market. So for those who are not really that familiar with, what is this all about? Fortnite is set in a world where players can coordinate among, cooperated rather on various missions, on fight back against a mysterious storm, or attempt to be the last person standing in the game's battle royal mode. It has become a phenomenon, attracting more than 125,000,000 players in less than a year. So it is quite popular in that sense. So what is challenge here? So it is a free to play game with revenue coming entirely from in game microservice transactions, meaning its revenue depends on continuously capturing the attention of the gamers through the new content and continuous innovation. So to operate this way, Epic Games needs an up to date or up to the minute understanding on the gamer satisfaction. Helping guarantee the experience is one that keeps them engaged so that's a challenge, right? Because you need to understand every gamers, how are they reacting, how make sure they're happy all the time. Their experience is seamless. At the same time, collecting data, sort of the solution solution was so Epic collects billions of records on a daily basis, tracking virtually everything happening in the game, how players interact, how often they use certain weapons, and even the strategies they use to navigate the game universe. More than 14 petabytes of data are stored in, in the data lake, powered by Amazon S three. So Amazon S three plays a significant role here, growing two petabytes per month. It's a massive amount of information. So as you can see, data loses value over time. So it is published by Forrester. And if you see from left to the right, the time critical decisions are made within minutes. And as you move towards the right, it becomes more of a historical how it should be used for batch processing for business intelligence report or machine learning training data. So the value of data diminishes over time. To get most value from the data, it must be processed at the velocity in which it is created at the source. So organization, in pursuit of better customer experience will inevitably need to start driving towards more reactive, intelligent and real time experiences. They just can't wait for data to be batch processed. And this making decisions and taking actions too late. So reactivity will differentiate your business. To achieve a better customer experience, organs need to work with the freshest data possible. So I think that's pretty much, pretty clear here. Now, if I go further down and analyze the different use cases of the data, and what kind of use cases would you have in organization, in terms of timeline, of using the different data? As you can see here, the messaging between microservice is a classic example where you need a millisecond of delay. You can't afford to have minutes here. So response analytics, it's a way of an application and mobile application notification, they need to happen within milliseconds. When things are happening at the back end of the front end, the micro interactions are happening, then long ingestion that could you consider IoT, device maintenance or CDC. When you capture data from source to the destination database, those scenarios you can think of having in seconds delay, whereas in a typical ETL, in a data lake or data scenario, you can have minutes or hours or days of delays in terms of analysing. So again, this clearly articulates the need or the importance of data over time as we move forward. So what is the trend? So, one of the great things about data stream is that many customers find they can be used for messaging and they enable the development team of real time analytics application down the road. As a result, we are seeing customers replace message queues with data streams to provide an immediate boost in the capability of their architecture. So they're effectively, they're moving away from batch workflows to a lower latency streaming build application and then data streams are event spinal cord for services. So streams have become the backbone of eventually Microsoft service interaction. Their messaging is also there, but it's slowly getting into the real time stream kind of mode. And of course we have KmS. We'll talk about a little bit around one of the AWS services, which is managed Kafka service. Again, we talked about CDC change stream database and any stream machine learning, any ideal time automation that is also slowly getting becoming popular. So the fundamental message is here the world is moving towards stream based, which is near real time or real time stream based interaction across systems. So what happens here? So effectively, you ingest data as they're generated, you process without interrupting the stream. That is important because when you are processing data, your ingestion should not, whole streaming process should not get disturbed, right? Or the delayed in the person. And so you have ingest and then your ingestion, then you are processing the data, and then you are creating analysing, which will be real time, near real time, or completely a batch based analysing. So the idea is to decouple each one of them. They're making sure they're all frictionless at the same time. Create that real time, near real time experience for the consumers, right? So fundamentally, if you look at the principle of data streaming, there are producers which data can be produced, captured and processed in millisecond data, then data buffered, enabling parallel and independent I o. And data must be captured and processed in order that they're produced. So there are three fundamental need. So number one, in order to be real time, data needs to be procured, sorry, produced, captured and processed in milliseconds. If not, you can't react in real time. That is important. So you have to be able to produce, capture and process in milliseconds in seconds. Second, you need a system that scales up to support the ingestion needs of your business, but also allows you to build your own application on top of the data collected. Otherwise, you'll need to be chaining data fits together and which adds latency and erodes your ability to react in real time. Third, ordering is critical because your application need to be able to tell the story of what happened, when it happened, how it happened, related to other events in the pipeline. So that is also while you're talking about real time, the sequence of event or sequence of record process also is extremely important. Otherwise you lose track of when things happen within the when things are happening in really fast. So all three are equally important. AWS I articulated. Now moving on. So what are the challenges of data streaming? So organizations face many challenges as they attempt to build out real time data streaming capabilities and embark on generated real time analytics. Data stream are difficult to set up, tricky to scale, hard to achieve, high availability, complex to integrate into broader ecosystems, error prone, complex to manage. Over time, they can become very expensive to maintain. As I mentioned earlier as a part of my introduction, these challenges have often been enough of a reason for many companies to shy away from such projects. In fact, they always get pushed for those various reasons. The good news is, at AWS, it has been our core focus of the last five plus years to build a solution that removes those challenges. So let's talk about a little bit of what is there for real time, near real time streaming on AWS. So how do you address those in AWS? The AWS solution is easy to set up and use, has high availability and durability default being across three regions is fully managed and scalable, producing the complexity of managing the systems over time and scaling as demands increase, and also comes with seamless integration with other core AWS services such as elasticsearch for log analytics, s three for data lag storage, redshift for data housing purposes, lambda for serverless processing, et cetera. Finally, with AWS, you only pay for what you use, making the solution very cost effective. So basically what I'm saying, you pay only for the part you use. If you don't use, you do not pay for it. So that makes the whole solution is also very cost effective. It's of course one of the biggest criteria for making a decision for building analytics how much it's going to cost right on a first time, as well as on an ongoing basis in terms of maintenance. So let's talk about streaming data architecture. So, data streaming technology lets customer systems ingest, process and analyze high volumes of high velocity data from variety of sources in real time. We have been talking about it. So you enable the ingestion and capturing of real time streaming data, store it based on your processing requirements. And this is essentially what differentiates this from MQ type of setup and process. To tap the real time insight, you can set alerts, email notification, trigger other event driven application and finally move the data to a persistence layer. So what are those steps? So your data sources, devices and or application that produce real time data at high velocity. Then your stream ingestion data from tens of thousands of data sources can be written into a single stream. You need to have a pipe where you can push the data through. And then once you push the data through, you should be able to store that. So data stored in order, in the order received for a set duration, and can be replayed indifferently during this time. So some of the AWS, not some of the AWS, the kinesis products like Kinesis data stream you're talking about. You can store those streaming data for a year, so you can effectively replay as many times you want and if you need it in future debts. But it can be up to one. You can decide to keep it for a month or one day or any duration, but it is up to three years, up to up to 365 days. And then once you store it, then you then process that data so records are read in the order they are produced, enabling real time analytics or streaming ETL. So that is the time we'll again cover this as a part of Kinesis data firewalls, the product we'll be using for our demo. And then at the end you store the data for a longer duration. It could be data like s three or database or any other solution as you might think of what are those near real time or real time streaming producing within AWS. So this is the kinesis set of products. We made it very easy. Customers can collect, process and analyze data and video stream in real time without having to deal with the many complex cities mentioned before. The Kinesis services works together and provide flexibility and options to tailor your streaming architecture to your specific use cases. We have Kinesis data stream allows you to collect and store streaming data in a scale on demand. The firehouse Kinesis data firehouse firehouse is a fast and simply and simple way to stream data into data lakes or other end destination again at scale, with the ability to execute serverless data transformation as required. And then you have data analysing allows you to build, integrate and execute applications in SQL and Java. These three services work together to enable customers to stream, process and deliver data in real time. Then you have MSK AWS managed streaming for Apache Kafka, which is fully managed service for customers who prefer Apache Kafka or use Apache Kafka alongside Kinesis to enable specific use cases. The fully managed service and then at the end you have the Kinesis video stream allows customers to capture process store media streams for playback, analytics and machine learning. Out of this file, we'll be using two of them. The first two Kinesis data stream and Kinesis data firewalls as a part of the demo. Moving on, so little more about Kinesis data stream and the firewalls I'll cover. So KDS is popularly known as Kinesis data stream is massively scalable and durable real time data streaming service. It's continuously capture gigabytes of data per second from hundreds of thousands of sources such as website click streaming. That is the one we'll be using. Website click streaming as a part of the demo database event stream, financial transactions, social media feeds, it logs and location tracking events. The data collected is available in milliseconds to enable real time analytics use cases such as real time dashboard, real time anomaly detection, dynamic producing and many more. So make your streaming data available to multiple real time analytics application to s three or to AWS lambda in events milliseconds of the data being collected. That is fast, right? You can't probably get better than that and getting the data from which is being ingested and pushing it to the s three or lambda within 70 milliseconds. It is durable and it is secure, easy to use. It has got call it KCL Kinesis client libraries, connectors, agents and it is easily integrated with lambda and kinesis data analytics and data firehouse. It is elastic, dynamically scale your application currency data sync can scale from megabytes to terabytes of data per hour. Scale from thousands to millions of put records per second. You can dynamically adjust the throughput of your stream at any time based on the volume of your data input data and of course low cost can see detectives has no upfront cost. You only pay for the resources you use for as little as zero point 15 events. Zero point $15 per hour of course. For the latest pricing you can go to the website AWS website and get the information in more details then moving to the firehose. All right, data firehose again is the fully managed service is the easiest way to reliably data load streaming data into data lakes again. We'll be using this as a part of the demo to load the data in S three. It captures transfer and load streaming data into s three redshift elasticsearch splank, enabling near real time analysing with existing business intelligence tools. That's what you are already using today. It is fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, encrypt data before loading it, minimize the amount of storage used at the destination and increasing security. You can easily create firewalls delivery stream from AWS console, configure it with few clicks and again, we'll cover this. How easy to configure it because we'll be literally finishing the demo end to end. For an ecommerce website, click streaming data collection in 15 to 20 minutes time and within that we will be setting up Kinesis data stream. Kinesis data firewalls lambda s three and then at the end Amazon quicksite. All can be done within the demo time of 20 minutes. So in realtime you can spend few hours to set it up and get the data for the first time. When you attend for the first time with ekinesis data firehose, you only pay for the amount of data we transmit through the service and is applicable for data format conversion. There is no minimum fee or setup fee as such, like many other AWS services. So what we'll do, we'll now see the different. We have seen the five steps of data stream architecture and we see how those steps can be aligned with AWS product or how they fall within the AWS product we have been talking about. So this is an example. Like from the left, if you see there are data sources which producing data in millions of records from different devices, different applications, and which is getting streamed. And then the stream is being stored in KDS, which is kinesis data stream. And then the stream is being processed using Kinesis data analytics or kinesis data firewalls. Or you can actually use Kinesis video stream which is not shown on the screen here. And at the end you store those data for a longer term analytics. For a longer term analytics. So that's pretty much I wanted to discuss and take you through the AWS streaming solution before getting into the use cases and the demo. Now let's start the demo. But before I actually go to the browser console and take you through the entire demo, I just wanted to spend few minutes to explain you the use case and the architecture behind the entire demo. So it's a very simple use case. I have a demo website which is listing list of products and it has got two actions is a buy and view details. The objective of this demo would be to capture the user actions like how many people are clicking on buy and which product they're buying. As simple as that. Now quickly. So from the left, if you see the users logging in or signing up. Then they browse the product, then they view the product details, then they decide to go ahead to buy or not. That's it. Now let's move into the actual architecture. So the simple website is hosted in s three bucket and the website is being accessed through the cloud front. And in front of the cloud front you have the web application firewall WAF, popularly known as now. Every time user clicks on something, action HTTP request goes to the cloud front and that click is streamed through the click. Information is streamed through the Kinesis data stream. And then from Kinesis data stream, it is consumed by kinesis data firewalls. And once it is consumed by data firewalls, for every record which is being consuming by data fire, you can trigger a lambda to do additional processing of the data which is being consumed by data firewalls. And then once it is processed by lambda, it goes to the s three. Of course, kinesis data firehose can send data to many destinations, including s three, redshift and other dynodB, et cetera. So in this case we are actually pushing the data, all those click string information from cloud, from, through the Kinesis data string to the firehose, to the s three. And once it goes to s three, we're using the serverless ETL platform of AWS glue and crawler and accessing further, creating a data model for the whole analysing platform using that glue and crawler. And then we are seeing the data through QuickSight and Athena integration. Again, everything is serverless, few clicks away from actual data coming through the quicksight, the entire operation could take up to, depending on how you automated the whole setup would take up to few minutes, less than five minutes, definitely. Here in the kinesis data stream, there's a minimum buffer time of 1 minute. So you have to have a buffer of 1 minute. Means when the data comes from cloud front to the kinesis data stream to fire up, there's a minimum delay of 1 minute because it accumulates for a minute and then pushes to the next level. So by the time it comes here in s three, maybe a couple of minutes, and then you do the crawler trigger and the workflow goes through, puts the data and you can do data integration with your other systems. And then whole integration might take few more minutes, and then eventually in a quick site it will appear. So let's say in five minutes, you will get the data from the time somebody clicks on a product and by the time you see in a quick site. So idea is to, in this case, to see which maybe you can see end of the day or maybe every hour, which product, when you have launched a product, which product possibly is more popular, which one people are buying more, reviewing more those kind of information. So you don't need to build any other analytics platform, but just use the Clickstream data to see in your Quicksight which could be used by the business users. Now with that, I will now move to the browser to show you the entire architecture. So we'll start with kinesis data stream. We'll create that I've already pre created, but I'll show you how to create it. And then we'll move to cloud from and see how the cloud from is binding to the kinesis data stream. So let me switch to browser now. Ah, all right. So I have kinesis data stream here. I've already created a data stream here. Conf fourty two website demo. I can create one, but it takes few minutes to deploy. So it's very simple, straightforward. It basically asks you what is the capacity of your kinesis data stream and how is the capacity is calculated, is depending on what is the average size of the records and what is the number of records per second is coming. So let's say you have ten records per second coming in and you have only one string to capture. And then it will calculate how many shards do you need. So for more information about the shard and everything, you can go to the edible documentation when you have time. So I've created that. Now let's move on to cloud front. So I have created a cloud front distribution which is pointing to my s three origin. So this is my s three origin. Basically they all pointing to the same s three origin, which is basically a single page website like this. As simple as that. I have buy and buy buttons and I have view details button. That's pretty much now let's move into cloud from once I have created, let's work with this. Once coding with QY. So what I've done is I have just gone to the logs. In the logs you have real time configuration. Real time basically takes the click stream, which whatever request goes through the cloud front, it can capture that. Now let me create one, but I'll not save that because I already created one. But I'll show you what are the information you need to provide when you create the configuration. So give a name, give a sampling rate. This is nothing but what percentage of the click stream data you want to capture. Is it 50%? Is it 100%? Let's say I put 100% and then which field of the request you want to capture. All these fields you can capture which is coming as a part of the HTTP request. But most important bit for us would be to capture this uri query parameters. Apart from that you can capture other information like the country, the port and IP addresses, et cetera. And then what is the endpoint, where will the data get pushed? So remember I created KDS, which is kinesis data stream. Confront conf 42 website demo which is nothing. But here in the other tab, if I can go to other tab, yeah, this one, I'm connecting to this kinesis data stream now. Once I connect it, okay, now back to cloud font, real time configuration as I showed you. You can pick up whatever information you want from the request which is coming in now. We have decided to go with 100% subliminates every request we considered coming in. And of course it is getting delivered to this kinesis data stream which we already created earlier. Now let's go to the architecture bank and see how much you have covered. I'll just pull together the architecture slides. Here you go. So here we have already have the s three bucket with the website. We have configured the cloud front, we have configured the data stream. So now the requests are all coming from, every request is going to the cloud front. The stream is also getting delivered to kinesis data stream. Now what we'll do, we'll set up this so that it can consume the record, those records, click stream records from Kinesis data stream and then process by lambda and then pushes to s three bucket. Let's go back to the browser. So we have kinesis data files, I've already created one, let's see, what information do you have in that? So as always, this is the source, which is, as we have seen, the architecture diagram. We are getting the source from kinesis data stream. Then we are consuming that, transforming the records using a lambda function. And the lambda function is nothing but just taking the input, see all those attributes which are coming from the header from the request and then messaging it, taking the product id, product name, status, et cetera. Very simple, straightforward, but you can have of course much more complicated lambda function which can create different alert, et cetera, whatever if you want coming back. And then I'm putting that into the destination bucket which is conf 42 website demo September 2021. And then I'm just creating a prefix for the bucket so that I know which day, what month, which month, year and day and hour the request came through for my further analysis. And I have some other configuration which, for the timing you can ignore like encryption, et cetera. Now I also have a set of a bucket for all the error. In case there is a processing error happens, it will go into that bucket. Now let's go to s three. Since I've already sent some request. My s three bucket is already populated with some of the records. If you see here, the website is my year 2021, and then I have month and then I have day ten and day eleven events of September. And I had at 01:00 UTC I have request came in and then I had 02:00 UTC the records came in. So now data is all there in s three. Once it goes to s three, if you remember, we have the glue and crawler. So I have a workflow already created here. And what it does, it actually goes through the s three, creates a data catalog, database catalog. So I have the database called Conf 42 demo, and I have the table which it has created called website. Because the prefix of this s three bucket is website here. So it has picked up that and then back to console here. If you see in this table, it has actually picked up all those attributes which are there, part of the request, it's all picked up and it has done a partitioning based on near month, day and hour, so that the analysing your analytics can run faster and does the processing based on those partition. And those are unique. Now I have the data here, I have the tables created. Then I just need to trigger this workflow. What it does, it is again crawl through the latest data which has come in s three bucket and puts it into the table. So as the visitors visits the website, the data keeps coding in and you can schedule your workflow to run as frequently as you want. And then every time the workflow runs, your s three data lake will get populated with the latest data and then the same data. We can pull it from the quicksight. But before I go to quicksight, I just want to show you how the workflow looks like. Right. So this is typically workflow. It is crawling and then it's populating the database within s three, the tables within s three. So I can see that show that there are multiple run happened for this workflow. And if I open one of these and it tells you every steps has gone through successfully, and for last time, I ran here. All right, now let's go to Quicksight. So, quicksite you can create quicksighted dashboard from different sources. As you can see on the screen out of the box, we could have done from s three, but only thing in that case, we couldn't have done the data integration from different sources, whereas in our case, in Athena, you can get data from different sources, you can create a workflow, and then using the workflow, you can integrate that and create a real data set for your analytics tools to pick up. So I'm not going to create, because I already created the data set, and I'll show you the data set which I created. This is the data set I created. It has already 34 rows in it. Just refreshed it into last twelve minutes, twelve minutes back. So let's see whether doing a refresh again, it increases the number count. I am not sure whether I did any more request in between, but while that is happening, what I want to do is I want to show you how the data is actually coming into the quicksight. Right? So this is the data which you can see in the quicksight by day and hour. You can see the dark blue is the data on 10th, which came 10 September, which is at 09:00, 10:00 twelve and 01:00 as in 11th, it is at 01:00 and 02:00 in the afternoon. And then you can do whatever dissect, dissection of the data you want. If you want product name, you can get the product name, which product was clicked more. You can do by hour as you have seen, or you can do by month. But in this case, there's only one month of data available on the 10th and 11th. You can see that. So that's pretty much. So if I go back to the presentation for a moment, it. So in summary, I think we have covered pretty much end to end. If you take a look from the left, the user is accessing the website which is hosted in s three through cloud front. Every request which goes through Cloudfront. Also the request log stream goes to the Kinesis data stream. And the Kinesis data stream is connected to firehose as the Kinesis data firehose as a consumer. And when the records comes to firehose, the lambda gets kicked in for everyday record. And you can do anything you want with that record using the lambda. And you can use that lambda to communicate further downstream systems if you want. For any specific scenario, if you have or you can send a notification, SNS or email, you can trigger email, anything you want as such. So let's say you have a very high value product which people are clicking quite a lot but not buying. So you can have a scenario where you can count those and keep counting it and keep the count in a dynamodb database. When the count reaches a certain number, lambda can send a notification raising concern, right, so you can do all those different scenarios and then after that it goes to the s three. And again we did a prefix which is website near month, day and time. And you can do whatever you want to do. You can define that. Then that is picked up the crawler glue job and the crawler picks up the data from s three based on the frequency fix which has been set up. And then eventually through Athena Quicksight integration, the data is available in the quicksight. So if you see the entire end to end architecture, there was no coding involved, only the lambda was used just to manipulate the click stream, which is coming for a better for the clarity of the data which goes through the s three to the quicksite. Apart from that, there's nothing else effectively, you know the user behavior, how they are clicking the different products in a quick site in minutes. That's pretty much. Thank you for joining in. Good to have you guys. And if you want to learn more about AWS analytics platform, there are quick site, of course those are all available. You can go to AWS website and based on interest which area you want to focus, you can get additional information. Thank you once again. Have a wonderful day.
...

Shubhankar Sumar

Senior Solutions Architect @ AWS

Shubhankar Sumar's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways