Conf42 Internet of Things (IoT) 2025 - Online

- premiere 5PM GMT

IoT Data Lakehouse: Scalable Architectures for Connected Device Analytics

Video size:

Abstract

Learn how IoT Lakehouse architectures solve massive sensor data challenges by unifying streaming telemetry, device logs, and analytics in scalable platforms that power predictive maintenance, smart city monitoring, and connected device intelligence.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Vanta. I'm a data engineering leader working on large scale data platform across cloud analytics and machine learning from over the last 14 years. Over the years, I worked in extensively with very high volume, high velocity data systems, and one pattern I keep seeing is organizations don't fail at iot. Because they can't collect the data. They struggle because the cost and complexity of managing IO OT data grows faster than the value they extract from it. Today I want to talk about how we can rethink IOT data warehousing using modern Lake houses, architecture and Snowflake feature stores. Not just to scale technically, but to fundamentally change the cost equation. IOD data platforms today are under immense pressure to do more than just scale. They need to scale economically as connected devices continues to grow across industries like manufacturing, transportation, utilities, and smart cities. Organizations are realizing that traditional data architectures were not designed for continuous telemetry at this scale. In this talk, I'll walk through the architectural patterns and real world approaches that help teams reduce cost while still enabling real time analytics, machine learning, and business insights. IOT Data Challenge iot ecosystem generates data at an unprecedented scale and velocity sensors stream telemetry continuously, often at millisecond intervals as we know. And within months, this data can reach petabytes scale at the same time. This same data must support very different consumers. Operation teams want the real time dashboard. Data scientists need historical data for modeling and business users want aggregated insights. Supporting all these workloads simultaneously is where many iot platforms begin to stream. Coming to the unique characteristics of iot data. What makes IOT data uniquely different is that the combination of high velocity, massive volume, and constant schema evolution device changes. Firmware updates introduces new fields and different vendors produce consistently inconsistent telemetry formats. Any successful iot architecture must handle continuous ingestion, adapt to all the schema changes gracefully, and retain the large historical data sets without repeatedly reprocessing or rewriting the data. So how are we gonna bridge the gap by using the Lakehouse architecture? Traditional data warehouses deliver strong performance and consistency, but they quickly become expensive and inflexible. Due to the semi-structured data set of iot. Data lakes are cost effective, but they lack reliability, governance and transactional guarantees. Lake house architecture bridges this gap. By combining the best of both worlds, warehouse grade reliability on top of low cost cloud object storage for iot. This isn't just an architectural improvement, it is a cost containment strategy. So coming to the core Lakehouse Technologies, technologies like Delta, lake Apache Iceberg, and Apache Hudy brings transactional, semantics, schema evolution, and efficient metadata management to cloud storage. These capabilities are critical for iot workloads where data is continuously dependent, occasionally updated, and queried across long time ranges. Without these features, IOD platform tend to accumulate hidden operational and storage cost over time. So what does the Lakehouse physical architecture looks like? It has. It separates actually the storage table formats and compute Cloud object storage provides virtually unlimited low cost capacity. Table formats manage the metadata and the transactional behavior. Compute engines like Spark or Trium scale independently to process these heavy workloads. This operation is essential for IT platforms because it allows the teams to scale compute. Only when needed, instead of paying continuously for the fixed infrastructure. Coming to the architectural considerations iot platform must be designed with streaming at the core. Data typically flows through Kafka, Kinesis, or event hubs, and is processed in real time using Spark or Flink. Cost efficiency comes from decisions like time-based partitioning aligned to IOT's, temporal nature. Lifecycle management across hot, warm, and cold data layers for data storage, retention, and accommodating the schema evolution without rewriting the historical data. In practice, many iot costs overrun stems from poor decision in these architectural considerations. So the medallion architecture is that place where all these issues gets resolved. The Meall architecture provides a structured way to manage IOT data. Bronze stable captures the raw sensors data exactly as received, preserving fidelity for auditing and replay. Silver tables apply the validation, normalization, and standardization, and the gold table exposes the business ready data set optimized for reporting and dashboards. This layered approach is very influential because it reduces the redundant processing and ensures the higher cost. Compute is only applied where business value exists. Coming to the integration, the realtime and the batch integration iot platform must support both. And instead of maintaining separate systems, modern architecture, increasingly adopted KAA style approach where a single streaming pipeline serves both realtime and the batch use cases. Incremental processing and materialized views reduce the operational complexity and significantly lowers the long-term cost. Now the machine learning integration is a place where we will be talking about the feature engineering. Machine learning is where iot data delivers significant value, but it's also where cost quietly explored. Feature engineering often gets duplicated across teams and pipelines. Snowflake features stores addresses this by enabling reusable governed features that remain consistent across training and inference. This reduces the compute cost, improves the model reliability, and accelerates the experimentation. There are a few case studies that I have included in here, which are directly ties up with the scalable architecture. In manufacturing environments, IO ot, Lakehouse architecture enables the predictive maintenance by analyzing sensor telemetry for early failure signals. Instead of reacting to equipment breakdowns, organizations can plan maintenance proactively. This reduces the downtime, avoid emergency repairs, and leads to measurable operational savings. Another case study which adds which supports, this is the connected vehicle fleet optimization for connected vehicle fleets. A real-time telemetry supports the route optimization as we know driver behavior scoring, vehicle health monitoring, and demand prediction. These insights directly improve safety, efficiency and utilization while reducing fuel cost and unplanned downtime. The smart city infrastructure is another strong case where the data is used to optimize the traffic flow, monitor the environmental conditions, and enable the open data initiatives. A unified Lakehouse platform allows governments to manage diverse data sources and schemas while maintaining the governance, reliability and cost control. Utility meter analytics is another beautiful case study. It prob, it ingests in, sorry. It ingests the smart meter from data from millions of households. As we know, Lakehouse platforms handle schema heterogeneity, compress the time series data, and enable advanced analytics such as energy theft detection, peak demand forecasting, and grid optimization. All while keeping infrastructure cost manageable. Looking ahead, iot data platforms will evolve towards lower latency streaming warehouse class, time series performance, stronger governance and edge analytics that reduce bandwidth and centralized processing cost. The most successful platforms will be those that are not just data aware, but cost aware by design as well to close. Iot success is not about collecting more data, it's about designing systems that align cost with value When built correctly. Iot data platform don't just scale. They become sustainable strategic assets that power realtime decisions and long-term innovations. Thank you, and I'm open to take any questions you have.
...

Bhanudeepti Chinta

Engineering Manager, Data @ Dave

Bhanudeepti Chinta's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content