Mike Flaxman is the Chief Data Scientist at OmniSci, the makers of a spectacular GPU based database.
WHAT IS GPU PROCESSING? HOW DOES IT WORK?
To answer that, let’s talk about basic computer architecture.
If you’re lucky, your computer might have 20 CPUs. If you have an average gamer’s graphics card, you could have 20,000 GPU elements. GPUs have orders of magnitude more processors. However, they’re not as general-purpose as CPUs. They do discrete bits of work that are more or less the same weight. If you can figure out how to subdivide a problem into 20,000 pieces of equal weight, they will do it in less than a snap of a finger.
GIS data often has that characteristic. With raster pixels, for instance, we want to do some logic across layers that’s equivalent for each pixel. On the vector side, we’ve got geospatial features. Let’s say we want to buffer them. The first step in buffering is independent of other features. We can do a lot of that work in parallel. Essentially, the idea is to use GPU to accelerate database operations and then the geospatial database operations. Since you’ve got a lot more processing units, you divide the work into much finer-grained pieces.
A couple of things are outstanding about OmniSci.
One is that it makes that power accessible to you with regular SQL. It uses equivalent syntax to PostGIS for those who are familiar with it. The other one is that if you want to buffer something, you can use ST_Buffer, and you don’t need to know all of the secret sauce that goes into running that on GPU. You ask a query and get a snappier response. 10 to 100 times faster than conventional GIS.
On the back end, it leverages the power of the graphics processors we have increasingly on personal machines, and in data centers.
WHAT DOES THE STACK LOOK LIKE?
OmniSci is a three-layer product.
The core of it is an open-source database doing geospatial operations. It speaks SQL.
There is a rendering layer above that, and the rendering layer is a special purpose renderer on GPU. It takes the results of a query and renders it in situ. It can handle billions of polygons or billions of points gracefully with millisecond rendering times.
And then there is a dedicated front end tool that’s called Immerse. It’s a react dashboard app. From an end-user perspective, most of the experience is in the Immerse environment. You add maps, data, and various chart types. There’s no programming required, but everything is wired together.
If you put a histogram and a map on the same dashboard and you select something on the histogram, just like you expect in GIS, that selection propagates. It works for large numbers of chart types, too. It allows people with limited geospatial background to create dashboards useful for their particular workflow.
Naturally, it requires that some people with geospatial talent on the back end put together the right data in the right coordinate systems to feed the front end. That’s essentially the stack; a front end dashboard environment that speaks to the back end in SQL.
There’s an additional SQL call that’s kind of a render map call, and you can either get back from the back end database query results in the form of tables, or rendered results in the form of maps and charts.
HOW DO YOU PUSH DATA INTO IT?
The good news for geospatial folks is that all major geospatial types are direct imports.
You can bring in shapefiles, Esri file geodatabases, GeoJSON, CSV files that are Quasi, spatial or text space, and spatial formats. These directly import and show up on your map just as they would in GIS.
The digital part that is useful for large data is that you can not only point to data on disk, but you can point to data on remote cloud services, such as Amazon S3. If you have a massive bucket of shapefiles sitting on S3, you can directly import those from your iPad or on a low bandwidth network. It’s being done directly from Amazon to OmniSci. The data import works like conventional GIS.
In processing terms, there are two levels. If you’re stacking map layers that work like conventional GIS, you can add in any number of map layers from any number of sources. There are additional data types, however, and the most common ones are various tabular databases. There are ODBC connections and JDBC connections to different back ends, there are streaming connections, and you can even attach to a Kafka streaming server.
Say, you’ve got event data from a GPS feed coming in. You put it through Kafka, and those things then get updated on your map on a streaming basis. Similar to GIS, but with a broader diversity of data types.
OmniSci is temporal, not just geo. For everything that has a geo timestamp on it, there’s a special magic that happens on the front end. It can integrate things on time without any programming. As long as the timestamps in the data are in ISO, or can be made into ISO, you can integrate things on time, put in a timeline and animate things over time.
WHAT OTHER GIS TOOLS DOES OMNISCI PLUG INTO?
The main interface comes with a magical built-in Jupyter notebook integration.
For everything you create in OmniSci, you can hit a Jupyter notebook button, flip over and look at it from the data science point of view. You’re in a scripting environment, but you’re already authenticated, you already have access to the tables that you’ve just seen on your dashboard. You can get to those tables with a variety of Python-based tools.
Those include, for instance, Esri libraries to push and pull things into the Azure Universe, or open GIS tools for pushing and pulling from web feature services. The tool can speak to the Python notebook environment and through that to a vast diversity of additional data science and GIS tools.
You can add OGC web map services as base layers and have any number of those available ̶ OmniSci can consume OGC data in that format.
You can think of this as a GIS temporal accelerator layer. If there is utility in getting the data back out as raw data, you can do that. Most of the time, however, because the data volumes are enormous, it makes more sense to render things on the back end, for instance, render map tiles, and consume those on the front end side in your GIS.
WHAT’S THE DIFFERENCE BETWEEN CPU AND GPU PROCESSING?
In traditional CPU processing, your software is often written for a single CPU. If it’s a raster, it goes through your file from top left to the bottom right, and it performs a sequence of operations on every pixel.
In a GPU database, there is a direct access path from the file to the GPU, which is something handled by the graphics card. You’re issuing commands to the GPU to do something, for instance, with every pixel in its buffer.
The difference from a GIS processing point of view is in that parallelization. The most expensive thing in both CPU and GPU these days is not necessarily processing, but data transfer. It can take a fair bit of time to stream your large file onto your CPU. The time is significant, and you want to avoid going back and forth.
My workflow used to be in GIS in a lot of cases. I would save intermediate file results, and then have to manage all of those intermediate file results. What I find myself doing in a GPU database is creating a series of views, and they are evaluated on demand.
It’s lazy evaluation.
You can quickly create a whole stack of views, preview any of them to make sure that you’re not going off the rails, that you’re getting the expected results, but then you can set up a workflow with an extensive process. You can be doing that either by looking at a piece of it or looking at the whole.
When you zoom out in OmniSci, you’re computing the subset of the visible pixels first, and then you’re computing the rest of them. It’s a workflow that keeps things interactive while making sure that you’re getting everything right visually. What you want to avoid doing is writing everything to disk or writing all those intermediate files to disk. You can let the GPU database manage the set of views, and when you’ve got something that’s good, you then persist that to disk. That command history is all aggregated on the object. You don’t lose the ability to go back and edit.
I find myself doing a lot more exploratory data analysis interactively, making sure that I’ve looked at every corner of my data, and that I’m confident the analysis is working correctly. Persisting to disk is less common than in traditional GIS where you’re doing one operation at a time, or you use something like a model builder and as a software to build a workflow.
In a GPU database, you’re doing more stuff visually, and the persistence is the database, so you ultimately persist things. It makes good sense to avoid doing so until you need to. The database is the slowest part, and everything else is interactive.
COMPLEX GEOSPATIAL PROCESSING TASKS, DOES THE USER NEED TO HANDLE THOSE?
It depends on the complexity of the geospatial problem.
I just completed a project with a major telco—a line of sight analysis and radiofrequency mapping. We were casting tens of millions of rays from cell phone towers out into the landscape. Some problems are embarrassingly parallel, such as the initial steps of each ray operation. For each ray, we looked at what it intersects, if it’s going through vegetation, buildings, or free air. The next step is order dependent. If you want to look at the fall off of frequency, you need to do that in the order that the signal would follow from the antenna to the destination. In GIS, that would be something like a cost distance function.
In OmniSci, that’s a window function in the database. The name window function comes from database history, and it is very different from GIS. They’re windows on a one-dimensional line. You can do a cumulative sum. Those things are built into the database, and it’s a little different from the traditional GIS operation.
If you learn and use that function or that set of functions, you can operate in parallel on many thousands of features, but with a sequence of operations on each feature. If you do it that way, you get the full benefits of parallelism, in which case you do need to think about the staging.
Then there are other operations that are simply too hard.
For an arbitrary viewshed analysis or any other global operation, GIS ultimately needs to look at all the data. Those things work the same in OmniSci as they would in conventional GIS, but they benefit from the underlying platform speed. It is a modern platform. It’s faster, and parallel reads and writes from disk. You get some benefits under the hood that you don’t need to worry about if you’re doing a global operation that’s on the lower end of performance of OmniSci.
My rule of thumb is to say that it’s ten times faster than conventional GIS for more complex problems. It’s fast because it’s running on a super modern platform. But the speedups you get for the “embarrassingly parallel” stuff are 100 times or more.
It does change the way you think about doing a lot of things. For things that used to be too heavy to do accurately, for instance, you can now crank up the accuracy threshold. You can take on something you might have generalized or smoothed before you could handle them.
WHAT CAN OMNISCI DO FOR FOLKS WITH LARGE AMOUNTS OF LIDAR DATA?
One of the most fun projects I’ve had a chance to work on was for a major California utility that has hundreds of thousands of miles of power lines.
It would not surprise anybody to learn they had issues with fire. Especially fires caused by the contact between vegetation and power lines in high wind events. We built a power line fire risk model for them. It was an interesting project because it took in LIDAR data to describe the physical structure of the trees and also remote sensing data to get at the vegetation health and wind data. Both historical forecast and current data were major contributing factors.
The risk of a tree falling over goes up with the square of the wind speed. You need to understand the variation in wind speed but then that is applied to data that you extract from LIDAR.
Another use case would be a utility company that contracted with a third party to develop their LIDAR data. It came in the form of 5 million individual LIDAR files sitting on an Azure Blob store. That’s difficult to manage data set. We were able to import that data and do analysis on it. For instance, to extract out vegetation from power lines and then look at the spatial relationship between the vegetation and the power lines at scale. That process was done analytically, very similar to the way you might do it in conventional GIS.
But having that scaling ability meant we could deal with a data source that was tens of billions of LIDAR data points. For analysis purposes, we binned them into one-meter slices. A tree became a set of one-meter slices because the main physics model for describing how trees interact with the wind is a pendulum model that wanted to know the weights of things vertically. We sliced the world into one meter by one meter. We could call them voxels—but they were hexagons, so let’s name them hex-voxels.
We could apply a wind attenuation function to all of those, in parallel. You can take the direction of the wind and say, “What if the wind moved from the south to the northeast? And what if it increased by 20%?” You can look at the output risk model in real-time, calculated against those billions of features from a GIS perspective underneath the hood.
It’s a risk suitability analysis model that anyone would be familiar with. A regression run against historical data of how these different factors contribute to risk. But from the user experience point of view, it lets you get ahead of the fire. Now you’ve got a risk model that can not only show you current conditions and update as frequently as your data comes in, but you can also use it in a scenario forecast sense. What if the wind shifts, what if we deployed resources over here and not over there?
The number one benefit to OmniSci is its super speed and scale to take forms of analysis that were traditionally done as a project that you would often deliver in paper reports. Now you can make it into a real-time dashboard experience that shows current conditions or into a scenario tool that lets people change what they can manage and confront based on what’s going on in the live environment.
Apart from the environmental examples, you can find similar use cases on the factory floor. In cases where traditionally you might have built a static model and delivered that one-off, you can now provide a data science workflow or spatial data science workflow that can respond to changes in events.
That’s an exciting new area for GIS people to get into. Those use cases are high value. They rely on spatial analysis skills that a lot of folks already have. It’s cranking it up by bringing in a real-time component, and the ability to deliver that out to enterprise quickly.
HOW USER FRIENDLY IS THIS?
In the utility company case, different subgroups within the utility needed access in different ways.
There was a data science group internally; they wanted analytic access, and they got it through dashboards Jupyter notebooks running on Python.
There was a whole vegetation management section of the enterprise. They wanted the ability to zoom into the high-quality visualizations looking at the full detail raster data because they can see things in that data that you and I couldn’t see. They’ve spent decades dealing with power line interactions. They want to look at the model in considerably more detail, prioritizing, or designing a work program.
They look at the data and bring in their internal knowledge of locations and areas or stuff we don’t have in the data—they have it in their heads. They’re combining those two things to set out a work program. The delivery for them was through dashboards that could render all this LIDAR stuff on the back end but can be accessed through a web browser.
That was an interactively intense engagement. It was serving a built model, the back end groups’ task, and the data scientists who are responsible for building and updating these models. They use the front end also, but for feature engineering and feature discovery. They make sure their model is still on track and performing. They typically go with a Jupyter-based workflow flipping over to the interactive graphics to check that the model is performing correctly or the data is useful. Then they flip back and do stats or data science on it.
These two groups would be fairly common in enterprise these days; a technical group that’s building and maintaining models, and business workgroups that are applying and deploying those models.
THE FUTURE IS CUSTOMIZED INTERACTION ON THE SAME DATA
The concept is empowering but may be unfamiliar to GIS folks that are used to delivering the product in cartographic format.
It’s about designing a user experience.
You give somebody access to the geospatial data and your preliminary cartography, but you also provide them with access to a dashboard. They take it from there. And if their job is workflow order management for vegetation contractors, they will use what you built. Maybe re-jigger it to make their workflow optimal for what they do, day to day.
You build a template or base geospatial dashboard that many people will customize for a lot of different purposes.
One thing I’ve noticed is that nobody can agree on what the relative size and position of the element should be in one of these dashboards. That’s a good thing, right? It means that people are optimizing things for what they want to look at and what they need to look at.
It’s a more comfortable division of labor if the end users have more power to reorganize their dashboards as needed. The delivery mechanism is that you save your dashboard and set the permissions to share. It’s simple from the producer’s point of view. Underneath is a database. If you want to version things and deploy things carefully, you have that capability. You can run multiple versions of dashboards that are driven from multiple generations of data if you need to be cautious about that.
That’s also a use case for government where they need a record of decision. They need to be able to archive and show that we made this decision on this date given this information in front of us. In any use case where you need that record of decision, it’s essential to have the data management capabilities underneath to do that. You’re versioning not only data, but interfaces or dashboards on that data, and increasing the data science models as well.
EMPOWERING USERS TO DO MORE BY DESIGNING USER EXPERIENCES
We are no longer just providing a deliverable result, but we are delivering user experiences. I’m sure a lot of geospatial people breathe a sigh of relief when I say that users can design their dashboards.
I remember doing public meetings with maps, having up to 20 versions of something before it would be approved. It was all done on paper and plot with a long round trip. It would often take a week to get back in front of the stakeholders and present the next version of something and get it batted back and forth. Having those as digital workflows are hugely beneficial.
What about the customization requests that can drive you batty? Change this font, make this a little bigger, change this color, all of that. With OmniSci, you deliver base cartography, but people can manipulate that and make the map part bigger and the table part smaller or vice versa by giving them control over those end-use format issues. You still need to provide them with decent cartography to start with. But you’re able to accommodate a lot of downstream customization. You don’t need to worry about that part. Just worry about the initial experience is good, so that people see all the data.
When I’m delivering such a thing, I need to be careful about the map scale. That’s with delivering anything on a dashboard these days. You need to make sure when people zoom all the way in or all the way out that the cartography still works. Once you get beyond that, you can be comfortable in passing along those dashboards.
You alleviate the demand for custom apps. If people can customize themselves, they will not be coming back to you immediately for a custom app. There’s still a need and there are use cases for custom apps, but they become less frequent if the user can do reasonable customization.
The non-spatial equivalent might be delivering something as a well-formatted Excel document. People can customize that, and they hit the case of Excel. It’s scarier because they can break the formulas, but in the case dashboard, you maintain control of the business logic part. They’re mostly just controlling the cartography and layout.
That’s a reasonable division of labor. Just make sure that the geospatial analyst has done the stats. Most of the time, you don’t want the users trying to redo the stats. They are data scientists, but you’re able to deliver something that could go out to the enterprise. That’s a key benefit of delivering in dashboards.
This is a new experience for me because I spent my life delivering paper map plots for the early part of my career. It’s a critical business benefit for GIS folks—it gets more people engaged with your work.
DO WE NEED TO USE CPUs AGAIN?
There are still use cases for CPUs, but they lie where the data scale horizontally is so enormous that it doesn’t fit modern GPUs.
I think of OmniSci as being a kind of accelerated data analytics, but you can set it on top of a data lake. We are working on a foreign storage interface. The idea there is to be able to point to data at rest inside a data lake and just extract out the bits you need to generate a particular analysis or dashboard.
Today, a large GPU server or cluster can handle 10 to 100 billion rows of data. Once you get beyond 100 billion rows of data, then you’re in a territory where you do need Sparkgeo, or you need some scale out horizontally to the size data. I’m sure there are people out there who are doing that. There’s still that universe, and that borderline will continue to shift over time as different technologies shift. The open-source benefit on the OmniSci database core is that you could move it and install it wherever you like. If you’re smart, you’ll move it to where the data is. Data is big. You can then get the analytics done close to the data.
I work a lot with Sentinel-2 data, which is from the European Space Agency that is hosted by AWS in Frankfurt. Where do I do my data processing? AWS Frankfurt, of course. I move the software to where the data is so I have that capability then to do the analysis close to the data.
As the data volumes get huge, we have to think differently about the relationship between data and software. The future is that you move your tools to where your data is. You don’t start a project by moving all the data to your local, because it’s too big. You don’t want it on your desktop, anyway. It’s a complex universe out there.
All these tools are changing all the time, but we are getting to the point where moving data to your software is a bad strategy. Whether it’s with GPU accelerated technology or CPU scale-out, you’re going to be doing that close to where the data lives.
For the interactivity, having the power of GPU to render this stuff makes a lot of sense. The observation from OmniSci is that it often makes sense to render the data where you’ve done the analysis and just send the interactive visualization to the client.
Machine learning is another future direction. We’re doing a lot of work on better ways to bring in messy geospatial data.
We have a nonprofit client that does open-source intelligence. They look for nuclear missile sites in North Korea. You don’t get a map with those. You got to do the analysis work. Their analysts discovered by looking at the data that dead-end roads that end in mountains that appear to go nowhere are good intelligence tell for where nuclear missile sites might be.
They wanted to analyze where all the dead-end roads in North Korea were. We partnered with Planet that has a new product that’s applying machine learning to their daily planetary imagery, extracting out roads and buildings. We took raw remote sensing data, extracted out roads, we had them changing monthly, and then we did suitability analysis based on which roads were changing over time.
The part that’s new here is fronting GIS with machine learning. You’ve got this raw data out there, you’re using machine learning to interpret and extract features out, and then you’re building analysis on top of that’s iterating. In the immediate future, I see a lot more of that.
We’re building tools to better integrate those front end uses of machine learning in our process. Right now, a lot of people are using this on the back end to do machine learning once you have the features built, but we’re also looking at machine learning to help build the pipeline of current geospatial data.
DO WE NEED A BUNCH OF NEW SKILLS? OR THE COMPLEXITY OF THESE TOOLS WILL BE HIDDEN AWAY FROM USERS?
There’s probably a meeting in the middle.
I would say that the biggest change, at least that I experienced, in this new world is the temporal side. I was trained in traditional geospatial stuff; I didn’t have a lot of courses or background in time series analysis. A lot of the assembled imagery we just talked about, that’s every five days roughly ends you up with a four-year time series of imagery.
It’s not that there aren’t people in the world to teach you how to do time series analysis. It’s just that geospatial traditionally hasn’t had to deal with temporal. That’s the new part that’s dealing analytically with both space and time, but the tooling for that already exists in those two domains. The messy part is whether you can get it to inter-operate.
OmniSci’s interface side certainly has got some of that. Analytically, it’s a deep area. There are deep learning models that only work on time series, for instance. Others will only work on snapshots but work spatially. As we move forward, in terms of skill set, the most important thing probably is to add temporal to your toolkit If you’re not already familiar with it. The access to the underlying tooling is moving to you. You don’t need to move so much to it.
The other piece that everybody is probably already aware of is the change in data science, where you’re using machine learning to extract value from information, often when you don’t get the expected result. The answer isn’t to change the model. It’s to change the data.
GIS people are good at it and aware of it, but it’s interesting. You’re not necessarily spending your time building a better technical model or changing the model architecture. At a certain point, once you’ve got the right model, like architecture, you’re changing the data.
Several cases have come up famously in data science, where the models are giving inaccurate results. And it’s almost always the case that the framing data was biased in some way that people hadn’t spotted initially. That’s going to be as true in geo as anywhere else. We need people that can pay attention to that stuff, and use the whole set of tools already in their tool belt to make sure that the data that’s provided to these models is correct so we’re not training users on nonsense.
It’s not entirely new, but there’s going to be an aspect of the relationship between geo and data science for the coming few years. We’ve got a bunch of people being trained in data science, but not having a background in geo or the analytical tools that we’ve got in our field. As we engage with them more, the common ground is training and high-quality data both are hungry for data. They’re not like a regression that you can train on 12 samples. These are models that want a million samples.
Geospatial folks have a good idea of how to provide those million samples, but also some sophistication about making sure that there’s quality control in that process, and that the stuff we’re training machine learning models on is valid data. We’ve got a lot to contribute to that field. It’s going to be a long-term interaction in the next 20 plus years.
Are you heading over to OmniSci to have a look at some demos? Take a look and kick the tires because there are about a dozen interactive demos there. You can get a good sense of the Immerse side immediately without installing anything. Do you see anything that you could use OmniSci for?