Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, Tim Spann here, senior solutions engineer.
my talk today is on smart cities unleashed.
So I've worked with a lot of data companies, worked with lots of
different types of data in motion, sensors, all types of stuff.
Now I'm working with Snowflake, NiFi, Iceberg, Polaris, Streamlit,
Casey, some Flink and Kafka, some other cool stuff in there.
Every week I put out a free newsletter in a lot of different formats.
Click or scan, get there, get lots of free content.
Never a charge.
You can go on GitHub and find it.
Pretty much every week unless there's something going on and they interrupt it.
But, we're pretty consistent.
You gotta look at all the back stuff.
Today we're gonna cover a couple of things.
We'll do an intro, some overview, look at different types of data.
Show you a lot of NiFi.
If everything's running, I'm recording this on a Sunday, we'll see if demo
works, give you some resources so you can continue on your path to
unlock your data for smart cities.
And you require a team of products built around a solid data platform.
Things like Python, NiFi, Iceberg, these things.
Help you solve these complex problems, whether the data is structured,
semi structured, or unstructured.
We'll find all these formats when we're working with cities.
Because you've got things like cameras, you've got documents, you've got,
standard static data, you've got things like transit data, which is often in
the, GTFS, format, which is a binary format, Based on Google protocol buffers.
This is used by most, transit systems out there around the world.
And fortunately they're very consistent and the mobility data people.
Keep a good track and list of them.
We can actually automate grabbing all the transit systems in the world.
But people tend to work on one city for their smart city.
Put together an example architecture and did a little of the work on this.
We'll see if people are interested if we want to go further on this.
So we got data from the real world.
I can have NiFi running on edge devices.
Kafka feeds coming in from all over.
you're going to be working with different providers.
And open source groups and communities, when you're getting this data, it's going
to be living on, real machines, could be on moving bicycles, trucks, lots of stuff.
data sources we've looked at recently, sensors in JSON format, transit data
in GTFS, traffic data in JSON, some raw data we might have to clean up.
this could be documents and other things, images.
And fortunately, once we get this into Snowflake with Cortex AI, we can do a
lot of the, complex processing regardless if that's documents or whatever.
We've got Kafka dropping into, Snowpipe.
We've got raw data.
We're getting data from our marketplace, including free weather
data we could use for stuff.
Got a couple quick SnowSite charts and, Queries there just to show
stuff, but we could do it in Jupyter Notebooks or whatever.
If we want to, we could store everything in just an S3 data lake with Iceberg.
Still have it part of this big puzzle.
Again, we're talking about semi structured data.
It could be things like, from OpenAQ, they've got air quality data, which
is important, depending on your city.
we've got data about location, time, different sensors.
Sometimes this data is coming from other systems.
Could be Avro, Parquet, ORC, all Apache formats.
JSON or XML, very common.
I'm getting some RSS feeds, which is a type of XML.
Hierarchical data.
So this is complex.
We might drop it in RAW.
But to query it, it's much better in a more traditional table format.
We could do some transformation.
Maybe getting logs, see what's going on in other systems.
Key value data is everywhere.
Now in the new world of AI, we've got also unstructured data.
And this is a ton of different formats.
Lots of different text, documents, PDF, spreadsheets.
there's just thousands.
Some more fun ones.
Images, we grab these off camera, same with video, audio, again, a lot of
things happen for smart cities where you may be tracking what's going on.
Seeing if there's disturbances.
Could also be having to take email from different sources and process
that, so that's available so I know what's going on in the city
with regulations and other stuff.
Fortunately with Snowflake, we put all these weird formats and,
variant type, and be able to use it.
Again, grabbing city cameras.
huge source of data.
And a lot of value in there, especially when we start doing, analytics
on it and see what's in there.
It could help us know, are the buses on time?
what's slowing them down?
Is it people double parking?
Those sort of analytics we could start doing.
Now, the most fun and the most valuable for the future is once we
get this into a structured format.
Whether that's a snowflake table, or snowflake hybrid
tables, which are pretty cool.
Let's just do, transactional kind of stuff on there too.
Iceberg table is a great way to share with different, systems,
companies, what have you.
The old style relational tables like, all those old vendors.
Of course, new relational stuff like the, newest versions of PostgreSQL
and all their 10, 000 variants.
And of course, structured files like comma separated values or tab separated.
There's a lot of variations there.
Some are Closer to semi structured than fully structured, but
it depends on what you have.
with Iceberg, we can start appending data as it comes in, whether that's
through Snowpark with Python or Java or Scala, or through NiFi with
the Puticeberg table, depending on which version of NiFi you have.
Get that data in there.
Just talk real quick about the Snowflake AI Data Cloud.
in the middle, so all these different workloads, it's not just the
data warehouse, data lake stuff.
Unistore is cool, we can do, like I said, that hybrid approach of analytical and
relational style data in the same table.
So we can do really fast stuff, get that data available for reports right away.
And, like always, data engineering right in the center.
Python, Java, busting out that code.
Maybe AI helped you write it or made your code better.
Still need that data engineering.
Adding, NiFi, OpenFlow there.
Apps, this is really cool.
With, Snowflake, I could write some apps in Python and be really solid
and share them with other companies.
And of course, AI and ML are just solving a lot of problems.
And what's cool is I build these apps, if they become useful, I could just put
them on the marketplace and either give them away, as a sense of community,
or maybe sell it, make some money.
And everything is, governed, so even when I'm collaborating, I don't have to worry
about losing data or something happen.
You control all that, very cool.
Architecture itself is pretty amazing.
with Snow Grid, could share this too.
know, different clouds and different cloud regions around the world.
Great way to be able to collaborate and still lock down data if
it's required by governments.
All the cloud services to make sure our metadata is managed.
We have full administration, optimization, security, data catalog, all that there.
And as much of it that can be automated is, so you don't have to worry about that.
And then we've got AI from Cortex as the next layer.
Where you could do work with chat APIs or easily got model registries in their ton
of models, but they're Validated so, this is good and we got security everywhere So
none of these models are messing up your stuff and the studio to make it really
easy You don't have to be a data scientist to be able to work with these models
And you need to be able to run apps, so we've got Elastic Compute, you can do
serverless things, you can do things on demand or wherever you need it, whether
it's in a CPU or GPU, or if it's a more complex app, we support running containers
in a controlled secure environment, so you can deploy your apps there, again
support for the major languages, you can do almost everything in SQL, but of course
Python and Java and Scala are there.
And as much storage as you need for all these types of data that
interrupts, whether that's in the cloud, managed, unmanaged, or even
touching your on premise if you want to set up those connections.
And next up we talk about patching NiFi.
NiFi is an open source project that's been taken in house by Snowflake to improve
and make enterprise secure and ready.
If you haven't seen NiFi before, you might have seen some of my other talks.
I really NiFi.
It makes things a lot easier for doing some of the mundane data engineering
and data tasks that, you might not want to have to do every time.
But it's got all the things you need for enterprise wide systems.
Guarantee you get your data.
There's buffering with MAC pressure and you can control that
so things don't get shut down.
If you need to slow things down to make it more controlled for downstream systems
or to just save money, you can do that.
You can prioritize queuing of data through the system.
You can control per data flow in a data engineering pipeline all the quality
of service things you need to control.
There's latency.
how much loss you want, throughput, error handling, those kind of things.
But we let you know everything that happened along the way
with full data provenance.
This lineage of your pipeline lets you know, each individual
row of data or record.
Or a flow file, we call it, that comes through the system,
everything that happened to it, size, all kind of metadata.
Very cool.
And we can push data, pull data, schedule data.
Any kind of process you need to do is in here.
Including, if it's not there, you could write your own in
Java or Python and deploy it.
There's hundreds of different processing abilities there, whether that's
opening things, closing, storing, conversion, transformation, enrichment.
You'll see it's all visual.
Very nice.
very easy to, start off with, templates that are available on the internet.
Or through your own company and then just go.
Lots of different security, whatever one makes sense for your organization.
Designed to be extended.
Clusters very easily.
Scales very linearly.
Supports full version controls.
You're doing real development here.
It's not some throwaway drag and drop tool.
What's even better in today's world, I can move all that binary
unstructured image data plus tabular data and understand it natively.
This tool doesn't just work with, JSON file.
It's not a stretch to push images through here.
We've been doing it forever.
Zip files of images.
Zip files of data.
Whatever you want to put through there.
Enrich it.
You do very cool, simple event processing.
This isn't Flink, but I could do a lot here.
Very great for routing to multiple places, depending on what it is.
I need to send data to seven different places, depending on certain scenarios.
No problem, no code.
feed this right into some kind of central messaging system or just
land it in your AI data cloud.
All the modern protocols are there.
Kafka, Pulsar, all those fun things.
With the new 2.
0 that's recent, it is designed for real time integration and AI.
Python is been improved, so you can very easily add Python components.
I'll show you a couple I built.
Everything's parameterized.
We work with the latest JDK, so it's faster, more durable, all
those new features in there.
Got a rules engine to help you.
Things for all the different clouds, some new Azure ones.
And work with a lot of those SAS systems you might like to work with, like Zendesk
and Slack and Salesforce and you name it.
I can use existing data table as my schema to control data loading.
Or I can work with Amazon Glue or Confluence, ton of different registry.
Support for open telemetry if you're using that.
Architecture is pretty standard for How these sort of, clustered
systems work, but very scalable, very survivable, very solid.
I've done, real time demos with this for eight years and have people in
production that are mission critical.
If you haven't heard, the original NiFi came out of the NSA and they use this
for extremely mission critical systems in live environments out in the field.
Providence, we mentioned that Lineage, we track everything.
You can visually look at it, you can see all the metadata.
This is really helpful, not just for debugging, but to see if data got lost,
what's going on in the real world.
Lots of reasons why you want to know.
Again, we mentioned all that unstructured data, and the ability to do a lot
of things with that, identify it, chunk it, store it places, send it.
Dissect it, convert from HTML to text, extract pieces, uncompress,
compress, all that kind of fun stuff.
Coal, different ML, whether that's through REST or through a native API
or through Python, whatever it is.
For all that more structured data out there, the ability to work with all
kinds of record oriented data very easily, including pulling apart logs,
syslogs, working with all kinds of standard data out there, being able to
do that without knowing all the fields.
So fields change, versions of things change, you don't
have to change your code.
Pretty amazing.
One record at a time.
Batches of records.
Micro batches.
Whatever makes sense for your use case.
Like I mentioned, I've extended it with some, Python, processors of my own.
This one pulls out your company list for any kind of data you push through it.
So say you've got a document.
You want to see what kind of companies are referenced.
This one does it for you.
Certainly you can write your own.
This is just using some standard, spacey PyTorch and NLP to do it.
You could use other AI models if you want.
Got one that'll caption your images that come through the pipeline.
Using Salesforce's pretty awesome Blip image captioning.
They've been improving this one recently so I probably gotta take
a look at this one again, see if there's a better model to pick.
Standard image classification with ResNet 50.
Pretty easy one.
This one comes in handy a lot.
Using the, Nominitum library to get your addresses into LatLng.
This is helpful with all the GeoAbilities and Snowflake and other systems.
when I'm working with cities, you need to know where you are in the city.
So sometimes you have a LatLng in data, sometimes you don't.
This is one of the libraries to help you with that.
You could also go the other way.
We could also map IPs.
Again, you got Smart City, you often have different Wi Fi, or
different networking protocols that we can map to where things are.
That is an important type of data we need with a city.
is this enough data?
Let's see if we can get into the demo.
I'm hoping, I didn't talk so long that I've timed everything out.
that has happened before, maybe a little long winded here, sorry about that.
I've been able to load some of my data into a snowflake table, and
there's a lot of different ways to do that, however you like to do it,
there is a connector there for you.
From ODBC, JDBC, Python, JARP, everything, SPRING, so we got some data in, this
is from the transcom system, that is a Regulatory body that handles New
Jersey, New York, and Connecticut.
Three big metropolitan areas over here and a lot of data.
This says things like what's going on with utility work, at what location.
Again, important with your smart city to know when, if you want to
send a robot somewhere, don't send it where there's already traffic
or there's you know, a road down and what's nice with a snowflake.
I can use any of the Python libraries, but we got one with
Streamlit here that could show me where there's some problematic spots.
And as you can see, there's a lot going on in Manhattan, which is not
surprising, but just to give you a quick look at that, let's see if we've
got our, NiFi is still running here.
we are very security conscious so things will time out to keep us secure.
so I've got a transit feed from that transcom system I mentioned.
Now I don't have this running continuously because this is just a demo.
In your smart city, you may want this running pretty frequently.
down to the minute, depending again, we mentioned that trans
comp, so we're going to run it once.
This is the NiFi UI.
This shows me live things running.
Here is how I get, an HTTP feed, which is the transcom data here.
I'm going to take that XML data, that RSS feed and convert it into Jason format.
I like better.
We can look at that lineage and see what just came by.
And we can look at the details of that and see all the metadata.
You can see what we called and how we called it.
And then we can actually look at the content that came through as well.
And as you can see we converted it into JSON.
And it's got all this data about what's going on.
There, just to give you an idea when we go through this system,
we could see what's happening.
I'm splitting out the individual records and here, I'm just making
sure it's a clean format and I'm going to add a timestamp and a unique ID.
If I generate unique IDs for me, and then I'm also going to add the lat long.
There's a point in there.
If you looked at the, Those records closely when they were coming through
and it's in a nice format But, really, everybody else uses lat long.
As you saw in that snowflake, chart here, I wanted lat and long.
I just parse that out here and do that quick, very simple, fast enrichment as
the data is traveling through the system.
And then I'm sending it into another component to, break it down further.
And send that into Snowflake.
Now I don't know if I still have my VPN connection open.
Snowflake is perhaps the most secure data platform I've ever seen.
And I've worked with a lot.
So it does have a lot of layers of security there.
So sometimes that requires some extra work.
But, as we see here, it comes in.
We keep getting more.
I can bring in another window over here.
We can do a query, see what's going on with our Data platform here
and we got a bunch of data here.
So I'm just putting this in a database one of the ones I have here for Just a
demo one because this is just a demo you could see the how we created the table
pretty simple we can see what's going on.
Nothing, too complex, but we can see, we're getting, more records are coming in.
And it's pretty straightforward system here.
one thing I didn't show you is the, descriptions.
You get a longer one.
Again, I can apply, use Cortex AI in a query.
And, build this up for RAG, or ask questions on this.
what's going on with, the Performing Arts Center and be able
to get those questions answered.
it's pretty nice here.
I can create a bunch of stuff just right from here and just query it if we need to.
But let's see what's going on with my run, trying to see if this timed out.
It did not time out.
So we're loading more data in there, but that's just one type of system
that makes sense to be reading.
We also grab things from.
Other sources, transit data, this one's Ireland.
I did some work, for the Irish Rail there.
And this is pulling in a 80.
If we look here, I kept a copy just in case I needed to reload
it and their system was down.
And I did this a couple days ago.
If we look at this is a very large file.
That contains a bunch of zip files, that we uncompress and unpack.
See if we still have our, things here.
And you can see all those files we pull out.
And then I split them up.
assign a primary key.
Then I can update a lookup table here.
So the first time I load it, it's automated.
There's no hard code other than, they all have different primary keys.
So I got to set those, some places you don't need to.
pretty straightforward.
Again, I could pull things out of Kafka, don't have to know anything specific.
And right here, I'll just say, oh, it's a JSON file, and I'm going
to insert it into, say, Oracle.
And it's going to be an insert.
very easy to do that.
I don't have to know all the fields or what have you.
Pretty straightforward.
not going to run through a million things here.
But let's, Get back to the slides.
We don't need all these running.
Let's get to our slides.
And we'll get back to it.
Hopefully that was enough of a demo there to give you a little feeling
of what we can do with the different data around smart cities and do
that very quickly and agilely.
A tool like NiFi lets me build all these data pipelines without having to test.
and manually run and deploy things, check the code, very easy in
that UI, very easy to run this.
I've got links to a number of articles I've written around different data
related to, Smart Cities data, whether that's things that, real time
buses, trains, planes, air quality, street cameras, what have you.
And what's cool is, from Brazil, from Boston, from Halifax, all
over the world, same stuff.
You want to try NiFi on your own, you can just download it, it is a Java app.
Or you can run it easier in Docker.
You have to have a login and it always uses SSL.
We care about security, very easy.
Let's automate everything, data everywhere.
hope you enjoyed, I'll be doing some more talks with Comp
42, these guys are awesome.
Enjoy the rest of the talks, thank you.