Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Are you an SRE,
            
            
            
              a developer,
            
            
            
              a quality engineer who wants to tackle the challenge of improving
            
            
            
              reliability in your DevOps? You can enable your DevOps for
            
            
            
              reliability with chaos native. Create your free
            
            
            
              account at Chaos native Litmus Cloud hello
            
            
            
              everyone. In today's session I am going to talk about
            
            
            
              a streaming and near real time analytics this
            
            
            
              near real time analytics, not a real time one.
            
            
            
              So if you see in today's world,
            
            
            
              most of the companies are looking for a
            
            
            
              near real time analytics or real time analysing, which are of course
            
            
            
              challenge to achieve most of the time. So gone
            
            
            
              are the days when you would get the data,
            
            
            
              next day you analyze for the next one week and try to figure out what
            
            
            
              we need to do, how to react to a situation like a
            
            
            
              fraud or anything like that. So things have
            
            
            
              changed. To create value, companies must
            
            
            
              drive real time insights from a variety of data sources that
            
            
            
              are producing high velocity data and
            
            
            
              of course in huge volumes. So having
            
            
            
              analytics solution in place would enable faster
            
            
            
              reaction in real time to events which are affecting business.
            
            
            
              So the need for analyzing heterogeneous data from multiple
            
            
            
              sources, whether it's internal, external, doesn't matter, are more
            
            
            
              than ever, thereby making the analytics
            
            
            
              landscapes ever evolving with numerous technologies
            
            
            
              and tools and making the platform more and more complex.
            
            
            
              So building a futuristic solution is not only
            
            
            
              time consuming, but costly because it involves
            
            
            
              quite a lot of things like selection of
            
            
            
              right technologies, acquiring the right talent pool,
            
            
            
              ongoing platform management and operations and
            
            
            
              monitoring. So what
            
            
            
              I'm going to talk is how to make this platform build
            
            
            
              bit easier and less expensive.
            
            
            
              Keeping in mind that most of the time building analytics solution is
            
            
            
              an afterthought or other. Okay, you first build the features,
            
            
            
              core features, and then we'll talk about analytics. So when
            
            
            
              that is the time, you probably have less budget left.
            
            
            
              So in this session we'll discuss and demo how you
            
            
            
              can leverage AWS producing and services to create
            
            
            
              a near real time analysing solution with
            
            
            
              minimum to no coding.
            
            
            
              Note, minimum to no coding for
            
            
            
              an e commerce website. Of course it's a demo website. I've just
            
            
            
              built a one pager website just to showcase how the data flows from one end
            
            
            
              to the other end at the same time, and how
            
            
            
              you can integrate with the preexisting data sources if
            
            
            
              you need to. And most of the time you probably need to because you would
            
            
            
              probably need to integrate with the other back end systems
            
            
            
              and make a joint of analysing
            
            
            
              and reporting. So of
            
            
            
              course the solution need to have set of advantages which is no
            
            
            
              different than any other one. In this case, it is easy
            
            
            
              to build AWS, I said there's no coding or a very minimum
            
            
            
              coding depending on how exhaustive your feature set
            
            
            
              you want to build. Then it is
            
            
            
              elastic and fully managed. So it is auto
            
            
            
              scalable horizontally and vertically and
            
            
            
              fully managed by AWS. It is pretty much serverless solution and
            
            
            
              then it is as always, any other AWS product and services
            
            
            
              is highly available and durable.
            
            
            
              Seamless integration with other AWS services like Lambda
            
            
            
              ECS, Fargate or EKS RDS
            
            
            
              S three which is again a core aspect of the whole solution,
            
            
            
              which is the data lake. And last but
            
            
            
              not the least is the cost.
            
            
            
              It is pay as you go.
            
            
            
              Like if you don't use it, you don't pay it.
            
            
            
              So let's quickly go over the agenda.
            
            
            
              So over the next 1015 minutes, I'll quickly go
            
            
            
              over why real time data streaming
            
            
            
              and analytics principle of data streaming
            
            
            
              and near real time streaming on AWS.
            
            
            
              What are the options we have in hand? And at the end I'll go
            
            
            
              over one use case and the demo covering
            
            
            
              that use case end to end.
            
            
            
              So let's quickly turn our attention to the why
            
            
            
              real time analytics, as I briefly covered
            
            
            
              earlier, is companies
            
            
            
              must drive the real time insights from a variety of data sources.
            
            
            
              And Gartner in 2019 emphasized that,
            
            
            
              saying data integration requirement, which these
            
            
            
              days demands more of a real time streaming,
            
            
            
              replication and virtualized capabilities.
            
            
            
              Gone are the days when you do
            
            
            
              offline processing in days or weeks or months,
            
            
            
              right? So I think that pretty much sets
            
            
            
              the scene. Now, before I go to the details, I just
            
            
            
              wanted to take you through a quick case study about this.
            
            
            
              Epic Games Fortnite. So real time
            
            
            
              data streaming analysing guarantees in this game that gamers
            
            
            
              are engaged, resulting in the most successful game currently
            
            
            
              in the market. So for those who are not really that
            
            
            
              familiar with, what is this all about? Fortnite is set in a world
            
            
            
              where players can coordinate among,
            
            
            
              cooperated rather on various missions, on fight
            
            
            
              back against a mysterious storm, or attempt to be the last person
            
            
            
              standing in the game's battle royal mode. It has
            
            
            
              become a phenomenon, attracting more than 125,000,000 players
            
            
            
              in less than a year. So it is quite popular in that sense.
            
            
            
              So what is challenge here? So it is a free to play game
            
            
            
              with revenue coming entirely from in game microservice
            
            
            
              transactions, meaning its revenue depends on continuously
            
            
            
              capturing the attention of the gamers through the new content and
            
            
            
              continuous innovation. So to operate this way,
            
            
            
              Epic Games needs an up to date or up to the
            
            
            
              minute understanding on the gamer satisfaction.
            
            
            
              Helping guarantee the experience is one that
            
            
            
              keeps them engaged so that's
            
            
            
              a challenge, right? Because you need to understand every gamers,
            
            
            
              how are they reacting, how make
            
            
            
              sure they're happy all the time. Their experience is seamless. At the same
            
            
            
              time, collecting data, sort of the solution solution was so
            
            
            
              Epic collects billions of records on a daily basis,
            
            
            
              tracking virtually everything happening in the game, how players interact,
            
            
            
              how often they use certain weapons, and even the strategies
            
            
            
              they use to navigate the game universe. More than
            
            
            
              14 petabytes of data are stored in, in the data lake,
            
            
            
              powered by Amazon S three. So Amazon S three plays a significant
            
            
            
              role here, growing two petabytes per month.
            
            
            
              It's a massive amount of information.
            
            
            
              So as you can see, data loses value over time.
            
            
            
              So it is published by Forrester. And if you
            
            
            
              see from left to the right,
            
            
            
              the time critical decisions are
            
            
            
              made within minutes. And as you move towards the right,
            
            
            
              it becomes more of a historical how it should be used for
            
            
            
              batch processing for business intelligence report or machine
            
            
            
              learning training data. So the value
            
            
            
              of data diminishes over time. To get most
            
            
            
              value from the data, it must be processed at the velocity
            
            
            
              in which it is created at the source. So organization,
            
            
            
              in pursuit of better customer experience will inevitably need to
            
            
            
              start driving towards more reactive,
            
            
            
              intelligent and real time experiences.
            
            
            
              They just can't wait for data to be batch processed.
            
            
            
              And this making decisions and taking actions too late.
            
            
            
              So reactivity will differentiate your business.
            
            
            
              To achieve a better customer experience, organs need to work with the
            
            
            
              freshest data possible. So I think that's pretty much,
            
            
            
              pretty clear here. Now,
            
            
            
              if I go further down and
            
            
            
              analyze the different use cases of the data,
            
            
            
              and what kind of use cases would
            
            
            
              you have in organization, in terms of timeline,
            
            
            
              of using the different data? As you can see here, the messaging between microservice
            
            
            
              is a classic example where you need a millisecond of delay.
            
            
            
              You can't afford to have minutes here. So response analytics, it's a way
            
            
            
              of an application and mobile application notification, they need to
            
            
            
              happen within milliseconds. When things are happening at the back end of the front end,
            
            
            
              the micro interactions are happening, then long ingestion
            
            
            
              that could you consider IoT, device maintenance or CDC.
            
            
            
              When you capture data from source to the destination
            
            
            
              database, those scenarios you can think of having in
            
            
            
              seconds delay, whereas in a typical ETL,
            
            
            
              in a data lake or data scenario, you can have minutes or hours
            
            
            
              or days of delays in terms of analysing.
            
            
            
              So again, this clearly articulates the need
            
            
            
              or the importance of data over time
            
            
            
              as we move forward. So what is the trend?
            
            
            
              So, one of the great things about data stream is that many customers find they
            
            
            
              can be used for messaging and they enable the
            
            
            
              development team of real time analytics application down the road.
            
            
            
              As a result, we are seeing customers replace message queues with
            
            
            
              data streams to provide an immediate boost in the
            
            
            
              capability of their architecture. So they're
            
            
            
              effectively, they're moving away from batch workflows to
            
            
            
              a lower latency streaming build application and then
            
            
            
              data streams are event spinal cord for services. So streams have become
            
            
            
              the backbone of eventually Microsoft
            
            
            
              service interaction. Their messaging is also there, but it's slowly
            
            
            
              getting into the real time stream kind of mode. And of
            
            
            
              course we have KmS. We'll talk about a little bit around one of the AWS
            
            
            
              services, which is managed Kafka service.
            
            
            
              Again, we talked about CDC change stream
            
            
            
              database and any stream machine learning, any ideal time
            
            
            
              automation that is also slowly getting becoming popular.
            
            
            
              So the fundamental message is here
            
            
            
              the world is moving towards stream based, which is
            
            
            
              near real time or real time stream based interaction
            
            
            
              across systems. So what
            
            
            
              happens here? So effectively, you ingest
            
            
            
              data as they're generated,
            
            
            
              you process without interrupting the stream.
            
            
            
              That is important because when you are processing data,
            
            
            
              your ingestion should not,
            
            
            
              whole streaming process should not get disturbed,
            
            
            
              right? Or the delayed in the person. And so you have ingest
            
            
            
              and then your ingestion, then you are processing the data,
            
            
            
              and then you are creating analysing, which will be real
            
            
            
              time, near real time, or completely a batch based analysing.
            
            
            
              So the idea is to decouple each one
            
            
            
              of them. They're making sure they're all frictionless at the same
            
            
            
              time. Create that real time, near real time
            
            
            
              experience for the consumers,
            
            
            
              right? So fundamentally, if you
            
            
            
              look at the principle of data streaming,
            
            
            
              there are producers which
            
            
            
              data can be produced, captured and processed in millisecond data,
            
            
            
              then data buffered, enabling parallel and
            
            
            
              independent I o. And data must be captured and
            
            
            
              processed in order that they're produced. So there are three fundamental
            
            
            
              need. So number one, in order to
            
            
            
              be real time, data needs to be procured,
            
            
            
              sorry, produced, captured and processed in
            
            
            
              milliseconds. If not, you can't react
            
            
            
              in real time. That is important. So you have to be able to produce,
            
            
            
              capture and process in milliseconds in seconds.
            
            
            
              Second, you need a system that scales up to support the
            
            
            
              ingestion needs of your business, but also allows you to build
            
            
            
              your own application on top of the data collected. Otherwise, you'll need
            
            
            
              to be chaining data fits together
            
            
            
              and which adds latency and erodes your ability to
            
            
            
              react in real time. Third,
            
            
            
              ordering is critical because your application need to be able to tell the
            
            
            
              story of what happened, when it happened,
            
            
            
              how it happened, related to other events in the pipeline.
            
            
            
              So that is also while you're talking about real
            
            
            
              time, the sequence of event
            
            
            
              or sequence of record process also is extremely important.
            
            
            
              Otherwise you lose track of when things happen within the
            
            
            
              when things are happening in really fast.
            
            
            
              So all three are equally important.
            
            
            
              AWS I articulated. Now moving on.
            
            
            
              So what are the challenges of data streaming? So organizations
            
            
            
              face many challenges as they attempt to build out real time data streaming capabilities
            
            
            
              and embark on generated real time analytics.
            
            
            
              Data stream are difficult to set up, tricky to
            
            
            
              scale, hard to achieve, high availability, complex to
            
            
            
              integrate into broader ecosystems, error prone,
            
            
            
              complex to manage. Over time, they can become very expensive
            
            
            
              to maintain. As I mentioned earlier as a part of my introduction,
            
            
            
              these challenges have often been enough of a
            
            
            
              reason for many companies to shy away from such projects.
            
            
            
              In fact, they always get pushed for those various
            
            
            
              reasons. The good news is, at AWS,
            
            
            
              it has been our core focus of the last five plus years to build
            
            
            
              a solution that removes those challenges.
            
            
            
              So let's talk about a little bit of what is
            
            
            
              there for real time, near real time streaming
            
            
            
              on AWS. So how do you address those in
            
            
            
              AWS? The AWS solution is easy to set
            
            
            
              up and use, has high availability and durability
            
            
            
              default being across three regions is
            
            
            
              fully managed and scalable, producing the complexity of managing
            
            
            
              the systems over time and scaling as demands
            
            
            
              increase, and also comes with seamless
            
            
            
              integration with other core AWS services such as
            
            
            
              elasticsearch for log analytics, s three for
            
            
            
              data lag storage, redshift for data housing purposes,
            
            
            
              lambda for serverless processing, et cetera. Finally,
            
            
            
              with AWS, you only
            
            
            
              pay for what you use, making the solution very
            
            
            
              cost effective. So basically
            
            
            
              what I'm saying, you pay only
            
            
            
              for the part you use. If you don't use, you do not pay for
            
            
            
              it. So that makes the whole solution is
            
            
            
              also very cost effective. It's of course one of the biggest criteria for
            
            
            
              making a decision for building analytics how much it's going to cost
            
            
            
              right on a first time, as well as on an ongoing
            
            
            
              basis in terms of maintenance. So let's talk about
            
            
            
              streaming data architecture. So,
            
            
            
              data streaming technology lets customer systems ingest,
            
            
            
              process and analyze high volumes of high
            
            
            
              velocity data from variety of sources in real time. We have been talking
            
            
            
              about it. So you enable the ingestion
            
            
            
              and capturing of real time streaming data,
            
            
            
              store it based on your processing requirements.
            
            
            
              And this is essentially what differentiates this from MQ type
            
            
            
              of setup and process. To tap the real time insight,
            
            
            
              you can set alerts, email notification, trigger other
            
            
            
              event driven application and finally move the data to
            
            
            
              a persistence layer. So what are those steps?
            
            
            
              So your data sources,
            
            
            
              devices and or application that produce real time data
            
            
            
              at high velocity. Then your
            
            
            
              stream ingestion data from tens of thousands of data sources can
            
            
            
              be written into a single stream. You need to have a pipe where
            
            
            
              you can push the data through. And then
            
            
            
              once you push the data through, you should be able to store that. So data
            
            
            
              stored in order, in the order received for
            
            
            
              a set duration, and can be replayed indifferently during this time.
            
            
            
              So some of the AWS,
            
            
            
              not some of the AWS, the kinesis products like
            
            
            
              Kinesis data stream you're talking about. You can
            
            
            
              store those streaming data for
            
            
            
              a year, so you can effectively replay as many times you want and if you
            
            
            
              need it in future debts. But it can be up to one.
            
            
            
              You can decide to keep it for a month or one day or any duration,
            
            
            
              but it is up to three years, up to up to 365 days.
            
            
            
              And then once you store it, then you
            
            
            
              then process that data so records are read in the order they
            
            
            
              are produced, enabling real time analytics or streaming ETL.
            
            
            
              So that is the time we'll again cover this as a part of Kinesis
            
            
            
              data firewalls, the product we'll be using for our demo.
            
            
            
              And then at the end you store the data for
            
            
            
              a longer duration. It could be data like s three or database or
            
            
            
              any other solution as you might think of what
            
            
            
              are those near real time or real time streaming producing
            
            
            
              within AWS. So this is the kinesis set of products.
            
            
            
              We made it very easy. Customers can collect, process and analyze
            
            
            
              data and video stream in real time without having to deal
            
            
            
              with the many complex cities mentioned
            
            
            
              before. The Kinesis services works together and provide
            
            
            
              flexibility and options to tailor your streaming architecture to
            
            
            
              your specific use cases. We have Kinesis
            
            
            
              data stream allows you to collect and store streaming
            
            
            
              data in a scale on demand. The firehouse Kinesis data firehouse
            
            
            
              firehouse is a fast and simply and simple way to stream data into
            
            
            
              data lakes or other end destination again at scale,
            
            
            
              with the ability to execute serverless data transformation
            
            
            
              as required. And then you have data analysing
            
            
            
              allows you to build, integrate and execute applications in
            
            
            
              SQL and Java. These three services work together to enable
            
            
            
              customers to stream, process and deliver data in real time.
            
            
            
              Then you have MSK
            
            
            
              AWS managed streaming for Apache Kafka, which is
            
            
            
              fully managed service for customers who prefer Apache Kafka
            
            
            
              or use Apache Kafka alongside Kinesis to enable specific use
            
            
            
              cases. The fully managed service and then at the end you have
            
            
            
              the Kinesis video stream allows customers to
            
            
            
              capture process store media streams for playback,
            
            
            
              analytics and machine learning.
            
            
            
              Out of this file, we'll be using two of them. The first two Kinesis data
            
            
            
              stream and Kinesis data firewalls as a part of the demo.
            
            
            
              Moving on, so little more about Kinesis
            
            
            
              data stream and the firewalls
            
            
            
              I'll cover. So KDS is popularly known as
            
            
            
              Kinesis data stream is massively scalable and durable real
            
            
            
              time data streaming service. It's continuously capture gigabytes
            
            
            
              of data per second from hundreds of thousands of sources
            
            
            
              such as website click streaming. That is the one we'll be
            
            
            
              using. Website click streaming as a part of the demo database event
            
            
            
              stream, financial transactions, social media feeds, it logs and
            
            
            
              location tracking events. The data collected is available in milliseconds
            
            
            
              to enable real time analytics use cases such as real time dashboard,
            
            
            
              real time anomaly detection, dynamic producing
            
            
            
              and many more. So make
            
            
            
              your streaming data available to multiple real time analytics application
            
            
            
              to s three or to AWS lambda in events
            
            
            
              milliseconds of the data being collected.
            
            
            
              That is fast, right? You can't probably get
            
            
            
              better than that and getting the data from which is being ingested
            
            
            
              and pushing it to the s three or lambda within 70 milliseconds.
            
            
            
              It is durable and it is secure,
            
            
            
              easy to use. It has got call it
            
            
            
              KCL Kinesis client libraries, connectors,
            
            
            
              agents and it is easily integrated
            
            
            
              with lambda and kinesis data analytics and
            
            
            
              data firehouse. It is elastic, dynamically scale your
            
            
            
              application currency data sync can scale from megabytes
            
            
            
              to terabytes of data per hour.
            
            
            
              Scale from thousands to millions of put records per
            
            
            
              second. You can dynamically adjust the throughput of your stream at any time
            
            
            
              based on the volume of your data input data and of course low cost
            
            
            
              can see detectives has no upfront cost. You only pay for the
            
            
            
              resources you use for as little as zero point
            
            
            
              15 events. Zero point $15
            
            
            
              per hour of course. For the latest
            
            
            
              pricing you can go to the website AWS website and get
            
            
            
              the information in more details then moving
            
            
            
              to the firehose.
            
            
            
              All right, data firehose
            
            
            
              again is the fully managed service is
            
            
            
              the easiest way to reliably data load streaming
            
            
            
              data into data lakes again. We'll be using this as
            
            
            
              a part of the demo to load the data in S three.
            
            
            
              It captures transfer and load streaming data into s three redshift
            
            
            
              elasticsearch splank, enabling near real
            
            
            
              time analysing with existing business intelligence tools. That's what
            
            
            
              you are already using today. It is fully managed service that
            
            
            
              automatically scales to match the throughput of your data and
            
            
            
              requires no ongoing administration. It can
            
            
            
              also batch, compress, transform, encrypt data before
            
            
            
              loading it, minimize the amount of storage used at
            
            
            
              the destination and increasing security. You can easily create firewalls
            
            
            
              delivery stream from AWS console, configure it
            
            
            
              with few clicks and again, we'll cover this. How easy
            
            
            
              to configure it because we'll be literally finishing the demo
            
            
            
              end to end. For an ecommerce website,
            
            
            
              click streaming data collection in 15 to 20 minutes time
            
            
            
              and within that we will be setting up Kinesis data stream.
            
            
            
              Kinesis data firewalls lambda s
            
            
            
              three and then at the end Amazon quicksite.
            
            
            
              All can be done within the demo time of 20
            
            
            
              minutes. So in realtime you can spend few hours to set
            
            
            
              it up and get the data for the first time. When you
            
            
            
              attend for the first time with ekinesis
            
            
            
              data firehose, you only pay for the amount of data we
            
            
            
              transmit through the service and is applicable for
            
            
            
              data format conversion.
            
            
            
              There is no minimum fee or setup fee as such,
            
            
            
              like many other AWS services.
            
            
            
              So what we'll do, we'll now
            
            
            
              see the different.
            
            
            
              We have seen the five steps
            
            
            
              of data stream architecture and we see
            
            
            
              how those steps can be aligned with AWS product
            
            
            
              or how
            
            
            
              they fall within the AWS product we have been talking about.
            
            
            
              So this is an example. Like from the left, if you see there are
            
            
            
              data sources which producing
            
            
            
              data in millions of
            
            
            
              records from different devices, different applications, and which
            
            
            
              is getting streamed. And then
            
            
            
              the stream is being stored in
            
            
            
              KDS, which is kinesis data stream. And then the stream
            
            
            
              is being processed using Kinesis data analytics
            
            
            
              or kinesis data firewalls. Or you can actually use
            
            
            
              Kinesis video stream which is not shown on the screen here.
            
            
            
              And at the end you
            
            
            
              store those data for a longer term analytics.
            
            
            
              For a longer term analytics. So that's pretty much I
            
            
            
              wanted to discuss and take you through the
            
            
            
              AWS streaming solution before getting into
            
            
            
              the use cases and the demo.
            
            
            
              Now let's start the demo. But before I actually go to the
            
            
            
              browser console and take you through the entire demo, I just wanted to
            
            
            
              spend few minutes to explain you the use case
            
            
            
              and the architecture behind the entire demo.
            
            
            
              So it's a very simple use case. I have a demo website
            
            
            
              which is listing list of products and it has got two actions
            
            
            
              is a buy and view details. The objective
            
            
            
              of this demo would be to capture the
            
            
            
              user actions like how many people are clicking on buy and which
            
            
            
              product they're buying. As simple as that.
            
            
            
              Now quickly. So from the left,
            
            
            
              if you see the users logging in or
            
            
            
              signing up. Then they browse the product, then they view
            
            
            
              the product details, then they decide to go ahead
            
            
            
              to buy or not. That's it. Now let's move into
            
            
            
              the actual architecture.
            
            
            
              So the simple website is hosted in s three bucket
            
            
            
              and the website is being accessed through the cloud front. And in front of
            
            
            
              the cloud front you have the web application firewall WAF,
            
            
            
              popularly known as now. Every time
            
            
            
              user clicks on something, action HTTP request goes
            
            
            
              to the cloud front and that
            
            
            
              click is streamed through the click. Information is streamed
            
            
            
              through the Kinesis data stream.
            
            
            
              And then from Kinesis data stream,
            
            
            
              it is consumed by kinesis data firewalls.
            
            
            
              And once it is consumed by data firewalls,
            
            
            
              for every record which is being consuming by data fire, you can trigger a
            
            
            
              lambda to do additional processing of the data which is being consumed
            
            
            
              by data firewalls. And then
            
            
            
              once it is processed by lambda, it goes to
            
            
            
              the s three. Of course, kinesis data firehose can
            
            
            
              send data to many destinations, including s three, redshift and
            
            
            
              other dynodB, et cetera. So in this
            
            
            
              case we are actually pushing the data, all those click string information from
            
            
            
              cloud, from, through the Kinesis data string to the firehose,
            
            
            
              to the s three. And once it goes to s three, we're using the
            
            
            
              serverless ETL platform of AWS
            
            
            
              glue and crawler and accessing
            
            
            
              further, creating a data model for the
            
            
            
              whole analysing platform using that glue and crawler.
            
            
            
              And then we are seeing the data through QuickSight and
            
            
            
              Athena integration. Again, everything is serverless, few clicks away from
            
            
            
              actual data coming through the quicksight, the entire operation
            
            
            
              could take up to, depending on how
            
            
            
              you automated the whole setup would take up to
            
            
            
              few minutes, less than five minutes, definitely.
            
            
            
              Here in the kinesis data stream, there's a minimum
            
            
            
              buffer time of 1 minute. So you have to have a buffer of 1
            
            
            
              minute. Means when the data comes from cloud front to the kinesis
            
            
            
              data stream to fire up, there's a minimum delay of 1 minute because it accumulates
            
            
            
              for a minute and then pushes to the next level. So by
            
            
            
              the time it comes here in s three, maybe a couple of minutes, and then
            
            
            
              you do the crawler trigger and the workflow
            
            
            
              goes through, puts the data and you can do data
            
            
            
              integration with your other systems. And then whole integration
            
            
            
              might take few more minutes, and then eventually in a quick site it will appear.
            
            
            
              So let's say in five minutes, you will get the data from the
            
            
            
              time somebody clicks on a product and by
            
            
            
              the time you see in a quick site. So idea is to,
            
            
            
              in this case, to see which maybe you can see end of the day
            
            
            
              or maybe every hour, which product, when you have launched a product,
            
            
            
              which product possibly is more popular,
            
            
            
              which one people are buying more,
            
            
            
              reviewing more those kind of information. So you don't need to build any other analytics
            
            
            
              platform, but just use the Clickstream data to see in your
            
            
            
              Quicksight which could be used by the business users.
            
            
            
              Now with that, I will now move
            
            
            
              to the browser to
            
            
            
              show you the entire architecture.
            
            
            
              So we'll start with kinesis data stream.
            
            
            
              We'll create that I've already pre created, but I'll show you how to create
            
            
            
              it. And then we'll move to cloud from and see
            
            
            
              how the cloud from is binding to the kinesis data stream.
            
            
            
              So let me switch to browser
            
            
            
              now. Ah,
            
            
            
              all right. So I have kinesis
            
            
            
              data stream here. I've already created a data
            
            
            
              stream here. Conf fourty two website demo. I can
            
            
            
              create one, but it takes few minutes to deploy. So it's very
            
            
            
              simple, straightforward. It basically asks you what is the capacity of
            
            
            
              your kinesis data stream and how is the capacity is calculated,
            
            
            
              is depending on what is the average size of
            
            
            
              the records and what is the number of records per second is coming. So let's
            
            
            
              say you have ten records
            
            
            
              per second coming in and you have only one string to capture.
            
            
            
              And then it will calculate how many shards do you need. So for more information
            
            
            
              about the shard and everything, you can go to the edible documentation
            
            
            
              when you have time. So I've created that.
            
            
            
              Now let's move on to cloud front. So I have created a cloud front
            
            
            
              distribution which is pointing to my s three origin.
            
            
            
              So this is my s three origin. Basically they all pointing to
            
            
            
              the same s three origin, which is basically a single
            
            
            
              page website like this. As simple
            
            
            
              as that. I have buy and buy buttons and
            
            
            
              I have view details button.
            
            
            
              That's pretty much now
            
            
            
              let's move into cloud from once I have created, let's work
            
            
            
              with this. Once coding with
            
            
            
              QY. So what I've done is I have just gone to the logs.
            
            
            
              In the logs you have real time configuration.
            
            
            
              Real time basically takes the click stream, which whatever
            
            
            
              request goes through the cloud front, it can capture that.
            
            
            
              Now let me create one,
            
            
            
              but I'll not save that because I already created one. But I'll show you what
            
            
            
              are the information you need to provide when you
            
            
            
              create the configuration.
            
            
            
              So give a name, give a sampling rate. This is nothing but what
            
            
            
              percentage of the click stream data you want to capture. Is it 50%?
            
            
            
              Is it 100%? Let's say I put 100% and
            
            
            
              then which field of the request you want to capture.
            
            
            
              All these fields you can capture which is coming as a part of the HTTP
            
            
            
              request. But most important bit for us would be
            
            
            
              to capture this uri query
            
            
            
              parameters. Apart from that you can capture other information like
            
            
            
              the country, the port and IP addresses, et cetera.
            
            
            
              And then what is the endpoint, where will the data get pushed?
            
            
            
              So remember I created KDS, which is kinesis data stream.
            
            
            
              Confront conf 42 website demo which is nothing.
            
            
            
              But here in the other tab, if I can
            
            
            
              go to other tab, yeah, this one, I'm connecting to this
            
            
            
              kinesis data stream now. Once I connect it,
            
            
            
              okay, now back to cloud
            
            
            
              font, real time configuration as I showed you. You can pick up
            
            
            
              whatever information you want from the request
            
            
            
              which is coming in now. We have decided to go with 100%
            
            
            
              subliminates every request we considered coming in. And of course it
            
            
            
              is getting delivered to this kinesis data stream which we already
            
            
            
              created earlier. Now let's go
            
            
            
              to the architecture bank and see how much
            
            
            
              you have covered.
            
            
            
              I'll just pull together the architecture slides. Here you go.
            
            
            
              So here we have already have the s three bucket with the
            
            
            
              website. We have configured the cloud front, we have configured the
            
            
            
              data stream. So now the requests are all coming from,
            
            
            
              every request is going to the cloud front. The stream is also getting
            
            
            
              delivered to kinesis data stream. Now what we'll
            
            
            
              do, we'll set up this so that it can consume the record, those records,
            
            
            
              click stream records from Kinesis data stream and
            
            
            
              then process by lambda and then pushes to
            
            
            
              s three bucket. Let's go back to the
            
            
            
              browser.
            
            
            
              So we have kinesis data files, I've already created
            
            
            
              one, let's see, what information do you have in that?
            
            
            
              So as always,
            
            
            
              this is the source, which is, as we have seen, the architecture
            
            
            
              diagram. We are getting the source from kinesis data stream.
            
            
            
              Then we are consuming that, transforming the records
            
            
            
              using a lambda function. And the lambda function is nothing but
            
            
            
              just taking the input, see all those attributes
            
            
            
              which are coming from the header from the request and then messaging it,
            
            
            
              taking the product id, product name, status, et cetera. Very simple,
            
            
            
              straightforward, but you can have of course much more complicated lambda
            
            
            
              function which can create different alert, et cetera, whatever if
            
            
            
              you want coming back. And then I'm putting that into
            
            
            
              the destination bucket which is conf 42 website demo
            
            
            
              September 2021. And then I'm
            
            
            
              just creating a prefix for the bucket so that I know which day, what month,
            
            
            
              which month, year and day and hour the request came through
            
            
            
              for my further analysis. And I have some
            
            
            
              other configuration which, for the timing
            
            
            
              you can ignore like encryption,
            
            
            
              et cetera. Now I also have a set of
            
            
            
              a bucket for all the error. In case there is a processing
            
            
            
              error happens, it will go into that
            
            
            
              bucket. Now let's
            
            
            
              go to s three. Since I've already sent
            
            
            
              some request. My s three bucket is already populated with
            
            
            
              some of the records. If you see here, the website is my year
            
            
            
              2021, and then I have month
            
            
            
              and then I have day ten and day eleven events
            
            
            
              of September. And I had at 01:00
            
            
            
              UTC I have request came in and then I
            
            
            
              had 02:00 UTC the records came
            
            
            
              in.
            
            
            
              So now data is all there in s three. Once it goes to s
            
            
            
              three, if you remember, we have the glue and crawler.
            
            
            
              So I have a workflow already created here.
            
            
            
              And what it does, it actually goes
            
            
            
              through the s three, creates a data catalog, database catalog.
            
            
            
              So I have the database called Conf 42 demo,
            
            
            
              and I have the table which it
            
            
            
              has created called website. Because the
            
            
            
              prefix of this s
            
            
            
              three bucket is website here. So it has picked up that and
            
            
            
              then back to console here. If you see in this table,
            
            
            
              it has actually picked up all those attributes
            
            
            
              which are there, part of the request, it's all picked
            
            
            
              up and it has done a partitioning based on near month,
            
            
            
              day and hour, so that the analysing your
            
            
            
              analytics can run faster and does
            
            
            
              the processing based on those partition. And those are unique.
            
            
            
              Now I have the data here,
            
            
            
              I have the tables created.
            
            
            
              Then I just need to trigger this workflow. What it
            
            
            
              does, it is again crawl through the latest data
            
            
            
              which has come in s three bucket and puts it into the table.
            
            
            
              So as the visitors visits the website, the data keeps
            
            
            
              coding in and you can schedule your
            
            
            
              workflow to run as
            
            
            
              frequently as you want. And then every time the
            
            
            
              workflow runs, your s three data lake
            
            
            
              will get populated with the latest data and
            
            
            
              then the same data. We can pull it from the quicksight.
            
            
            
              But before I go to quicksight, I just want to show you how
            
            
            
              the workflow looks like. Right. So this is typically workflow. It is crawling and then
            
            
            
              it's populating the database within s three,
            
            
            
              the tables within s three.
            
            
            
              So I can see that show that there are multiple run happened
            
            
            
              for this workflow. And if I open one of these and it tells you
            
            
            
              every steps has gone through successfully, and for last time, I ran here.
            
            
            
              All right, now let's go to Quicksight.
            
            
            
              So, quicksite you can create quicksighted dashboard
            
            
            
              from different sources. As you can see on the screen
            
            
            
              out of the box, we could have done from s three, but only
            
            
            
              thing in that case, we couldn't have done the data integration from different sources,
            
            
            
              whereas in our case, in Athena,
            
            
            
              you can get data from different sources, you can create
            
            
            
              a workflow, and then
            
            
            
              using the workflow, you can integrate that and create a real
            
            
            
              data set for your analytics tools to pick up.
            
            
            
              So I'm not going to create, because I already created the
            
            
            
              data set, and I'll show you the data set which I created.
            
            
            
              This is the data set I created.
            
            
            
              It has already 34 rows in it.
            
            
            
              Just refreshed it
            
            
            
              into last twelve minutes, twelve minutes back. So let's
            
            
            
              see whether doing a refresh again,
            
            
            
              it increases the number count. I am not sure whether
            
            
            
              I did any more request in between, but while that is happening,
            
            
            
              what I want to do is I want to show you how the data is
            
            
            
              actually coming into the
            
            
            
              quicksight. Right?
            
            
            
              So this is the data which you
            
            
            
              can see in the quicksight by
            
            
            
              day and hour. You can see the
            
            
            
              dark blue is the data on 10th, which came
            
            
            
              10 September, which is at 09:00, 10:00 twelve and 01:00
            
            
            
              as in 11th, it is at 01:00
            
            
            
              and 02:00 in the afternoon. And then you can do
            
            
            
              whatever dissect, dissection of the data you
            
            
            
              want. If you want product name, you can get the product name,
            
            
            
              which product was clicked more. You can
            
            
            
              do by hour as you have seen,
            
            
            
              or you can do by month. But in this case, there's only one month of
            
            
            
              data available on the 10th and 11th.
            
            
            
              You can see that. So that's pretty
            
            
            
              much. So if I go back to the presentation
            
            
            
              for a moment,
            
            
            
              it. So in summary, I think we have covered pretty much end to
            
            
            
              end. If you take a look from the left,
            
            
            
              the user is accessing the website which is
            
            
            
              hosted in s three through cloud front. Every request
            
            
            
              which goes through Cloudfront. Also the request
            
            
            
              log stream goes to the Kinesis data stream. And the
            
            
            
              Kinesis data stream is connected to firehose
            
            
            
              as the Kinesis data firehose as a consumer. And when
            
            
            
              the records comes to firehose, the lambda gets kicked in
            
            
            
              for everyday record. And you can do anything you
            
            
            
              want with that record using the lambda.
            
            
            
              And you can use that lambda to communicate further downstream
            
            
            
              systems if you want. For any specific scenario, if you have or
            
            
            
              you can send a notification, SNS or email, you can trigger email, anything you
            
            
            
              want as such. So let's say you have a very high
            
            
            
              value product which people are clicking quite a lot but not
            
            
            
              buying. So you can have a scenario where you can count those
            
            
            
              and keep counting it and keep the count in a dynamodb
            
            
            
              database. When the count reaches a certain number,
            
            
            
              lambda can send a notification raising concern, right, so you
            
            
            
              can do all those different scenarios and then after that
            
            
            
              it goes to the s three. And again we did a prefix
            
            
            
              which is website near month, day and
            
            
            
              time. And you can do whatever you want to do. You can define
            
            
            
              that. Then that is picked up the
            
            
            
              crawler glue job and the crawler picks up the data from s three based
            
            
            
              on the frequency fix which has been set
            
            
            
              up. And then eventually through Athena Quicksight
            
            
            
              integration, the data is available in the quicksight. So if you see the
            
            
            
              entire end to end architecture, there was no coding involved, only the lambda was
            
            
            
              used just to manipulate the
            
            
            
              click stream, which is coming for a better for the clarity of the
            
            
            
              data which goes through the s three to the quicksite. Apart from that, there's nothing
            
            
            
              else effectively, you know the user
            
            
            
              behavior, how they are clicking the different
            
            
            
              products in a quick site in minutes.
            
            
            
              That's pretty much. Thank you for joining
            
            
            
              in. Good to have you guys.
            
            
            
              And if you want to learn more about AWS
            
            
            
              analytics platform, there are quick site, of course those are all available.
            
            
            
              You can go to AWS website and based
            
            
            
              on interest which area you want to focus, you can get additional information.
            
            
            
              Thank you once again. Have a wonderful day.