Simeon’s background is in software engineering. He spent most of his career working with engineers and scientists, building tools to make them more efficient and more effective in what they do. He helps take scientific and engineering innovations and turn them into commercial products.
WHAT IS APACHE SPARK?
It’s a framework for Big Data applications, and that means different things to different people.
For some, it’s a data engineering tool used for processing large volumes of data on a regular schedule. For others, it’s an analysis tool for interactive exploration of data.
It’s horizontally scalable because it uses a construct that allows you to take the work on your laptop and deploy it to a larger compute infrastructure, such as a cluster.
Is this the concept of distributed databases?
It talks to distributed databases, but this is distributed computation.
Imagine taking the work you’re doing on your laptop with a small amount of data. Now try the same operation with data that won’t fit into your computer. Apache Spark can do that without reconstructing or reformatting how you’ve expressed your computation.
Say you wanted to count words in a document. You’d probably do it with some software locally on your laptop.
What if you wanted to count all the words in Wikipedia and monitor the number of word changes over time? That will be a larger compute job for your computer, and Apache Spark harnesses the power of larger compute infrastructures without you having to be an expert in distributed computing.
You don’t have to rethink your logic for the programming you’ve done or the algorithms you’ve created. Pass everything over to Apache Spark. It’ll figure out and transform what you have into how it should be in a massively parallel environment.
That’s why Apache Spark was chosen as the foundation for RasterFrames. It’s a tool for people who wouldn’t usually have access to an engineering team to restructure their analyses into a cluster environment. Apache Spark provides APIs that are both friendly to data engineers and “generic” analysts.
Here, data engineers would be software engineers who regularly work to restructure data for other people to analyze. Analysts would be anyone with a quantitative background ̶ cartographers, environmental scientists, data scientists, or machine learning experts.
With its SQL interface (Apache Spark SQL), Apache Spark provides a unified SQL and data frame view into your data and compute job.
WHAT IS RASTERFRAMES?
It’s what makes geospatial raster data data first-class citizens within the Spark ecosystem at a technical level.
Spark is agnostic as to the type of problem you’re going to solve. It’s geared towards typical business analysis or scalar data analysis. Geospatial data is more challenging because without a unified interface over it, as is the case with RasterFrames, you have to deal with significant technical and esoteric constructs to manage the data.
An analyst who wants to measure something about plant health, or detect changes in a construction site, has to deal with too much “bookkeeping”.
When was this raster taken? Where is it? Where on the planet does it exist? What sensors were used in creating the image?
We manage these things and do the “bookkeeping” throughout the analysis inside RasterFrames. We only expose it when the user either asks for it or when it’s absolutely necessary for them to have to think about it.
The focus is on the data, and on the type of processing analysts are looking to construct.
IS THIS A MULTI-DIMENSIONAL ARRAY OF DATA?
RasterFrames is a data frame with geospatial raster data in it. It’s the lingua franca of data scientists. It’s a tabular structure ̶ a database table that’s interoperable with SQL.
Think of it as a large spreadsheet where the data on a particular row of a data frame is a specific place on the planet at a specific point in time. The columns are different layers on a map. You can do operations in this spreadsheet across columns and aggregations throughout the rows, such as computing average luminance for a location.
When it’s across a row, you can combine bands, like taking the RGB channels of an image and combining them into a composite of some sort. That spreadsheet is then stretched out across multiple computers. Now you’re starting to understand how you can think about it as a small table or a small spreadsheet.
But in fact, it’s being distributed across a lot of compute hardware.
TELL ME MORE ABOUT THE SPATIAL BOOKKEEPING IN RASTERFRAMES
We go to extreme lengths to make life easier.
Let’s say you have one RasterFrames containing data over a city at one resolution. You have another data set over the same city at a different time. They’re at a different resolution.
Imagine trying to get those two tables merged as a join, as it is called, and have each of those rows represent the same place on the ground. To do that, you have to ensure that the resolution and the alignment of the images on both tables or RasterFrames are the same. The row that results from the join should be coherent for space and resolution.
Each column has metadata associated with it. A row corresponds to an area on the planet. That extent is encoded in a CRS (the way pixels are projected onto the surface of the earth). When you join two sets of disparate raster data, they need to line up. A pixel on one covers the same areas as a pixel on another.
The metadata encoded inside these columns does all that automatically. It keeps track of what the resolution, CRS, and the extent are. The user can focus on his interest instead of bookkeeping. It’s still there for them if they need it.
ONCE THE DATA IS IN THE RASTERFRAMES, CAN YOU JOIN OTHER DATA WITH IT?
The first thing somebody does after they successfully load an image is to combine it with some other data. Soil samples, building footprints, or census data all have a spatial component ̶ they’re mapped to a place on the ground, and we can do a spatial join. We know where the image was taken, and it’s in a table form.
If there is another table containing our vector data ̶ a polygon with attributes associated with it. You can do a spatial join with those using operators similar to what you might find in PostGIS; things like ST_Intersects or ST_Contains or ST_Overlaps. The geospatial predicates are also available in RasterFrames.
IS THIS AN SQL ONLY SOLUTION? DOES IT SPEAK OTHER LANGUAGES?
One benefit of building something on Apache Spark is the capability of deployment in multiple languages. The one we focus on the most is Python. The underlying implementation is in Scala, which runs on the Java Virtual Machine. You can write your jobs in Python, Scala, Java, or SQL and treat many operations as functions within an SQL query.
ISN’T ALL THIS DEMOCRATIZING DUMBING THINGS DOWN?
The goal is to get the minutia, the plumbing, or the bookkeeping out of the way of science.
Most of our users are data scientists. They spend 80% of their time cleaning data and 20% doing modeling and analysis. Which is what they want to do, really.
Any improvements we can make to the ergonomics of working with the data, such as automatically hand healing the bookkeeping, help to assess the quality of the data, is time they can redirect to spend on modeling and analysis.
We’re not taking a job away from professionals with highly specialized degrees; they’re the ones who understand how to interpret the data. We’re providing all the tools to manipulate and transform the data in the science or expertise they apply to the data.
WHO IS RASTERFRAMES FOR?
Going back to democratization, we wanted to make data available to process for quantitatively oriented people. They might not know about the intricacies of CRS, extents, or cell types and things like that.
The heart of the process of developing a model, or doing an analysis is exploring the data that’s there.
What do I have to work with? What is the quality? What kind of distributions does it fall under?
Doing that on “Big Data” can be difficult. You don’t want to have to launch a batch job ̶ a job that runs offline while you’re out getting coffee ̶ to find cloud cover over a small area.
You want to be able to interactively query things, like ripple interactivity. It’s a command-line tool for typing in a language like Python or SQL. You type in the commands to do the analysis, and then you transform that in other ways again and again until you find what you need.
You may find that for your analysis, you need a little bit of A, a bit of B, and a whole lot of C. You figure out the different ways of doing things by exploring and experimenting ̶ interactively. Once you’ve settled on your method, you can scale it up and run it in a batch job. That might take you hours of compute time and dozens or hundreds of computers if you’re doing it over a very large-scale area.
Spark works in both interactive and batch modalities, whereas a lot of tooling out there only works well in one. It also has streaming capability, which we’ve not introduced into RasterFrames yet.
SOLVE THE PROBLEM IN THE SMALL BUT EXECUTE IT ON THE BIG
Here’s a recent example. Astraea was locating utility-scale solar farms across the planet. Initially, we trained our model in the United States because there’s good data about the US. We trained it by finding locations that are definitely solar farms and areas that are definitely not solar farms and fed them to the machine learning algorithm.
It worked well in the US. We then re-deployed the same to Brazil, China, and some other places. Depending on the region, we got false positives.
Where there’s tilling in the ground, creating a grid-like structure, it can throw off the algorithm. For some reason, radish farms also trigger the algorithm. We go to those locations and pull out radish farms by specifying to the algorithm that this is not a solar farm.
You can iteratively develop the algorithm to do things in a reproducible way. If it ran over the United States, you could shift it to another place in the world.
It’s carving out small samples of imagery, which doesn’t have to be geospatial. It’s regularly sized squares of imagery of things you’re interested in or know what they are and need to use those to train a model. They’re chips in the sense that they’re carved out chunks of data of a regular size.
CAN WE EXTRACT ALL TENNIS COURTS FROM OPENSTREETMAPS?
RasterFrames uses standards as much as possible. A well-known standard is an image format called a GeoTIFF. There is a way of formatting a GeoTIFF that falls under a newer standard called a Cloud Optimized GeoTIFF (COG). It’s a GeoTIFF with a specific prescribed internal tiling and a certain number of zoom levels within that. Structuring a file in that way allows us to do what we call range reads.
Another way of looking at chipping is that if you have a 10-gigabyte file, but you just want a little corner of it, you don’t want to download the entire file. You want five chips over in one corner of the image and quickly get to those bits of data.
RasterFrames pulls out the pixels you need, saving you network bandwidth and time. Suppose you want to find all the radish farms in the US. In that case, you don’t want to have to download entire regions or images that might have stuff that’s not useful to you. You go to OpenStreetMap and find all the farms in an area that might have radishes.
Chipping allows you to download, act, or fetch the data relevant to your machine learning model or the type of analysis you’re doing and get you to an answer quicker.
The chips are returned to the database as a data frame. Not as a tiny image, but as a data frame that you could immediately integrate into whatever else you’re working on.
You can feed them directly into the machine learning, training, or scoring your inference model. Or you can read them, tweak them, normalize them, or whatever you might want to do during the standard ETL processing for machine learning. Save that off as either individual small GeoTIFF or into a Big Data format, such as Parquet, for subsequent analysis.
WHAT ABOUT VARYING RESOLUTIONS AND DIFFERENT TIMEFRAMES?
Tile servers, as an example, can be an optimization around not having to render your vector data all the time.
In analysis, we encourage the user to use vector data first, if they have it, because of its arbitrary resolution. That said, regardless of the source of the raster data, it typically has some sort of temporal component to it. For RasterFrames, that’s just another column in the table that defines the date/time this raster is associated with.
WHAT IS RASTERFRAMES NOT FOR?
It’s not a tool for making maps. It’s not a replacement for Esri or QGIS or cartographic tooling. This is a quantitative analysis tool. You can create map layers, but that’s not its strong suit.
Now if you wanted to prepare the data in a format to create map layers with ̶ it’s excellent at that part of the processing chain.
WHY DO YOU THINK GEOSPATIAL IS THE NEXT BIG DATA SET?
I would argue it’s already a Big Data set.
Commercial satellite providers produce somewhere between 100 and 200 terabytes of imagery a day ̶ a monstrous amount of information. Sentinel 2 has five years of daily refresh data. We have 40+ years of Landsat data. It’s a massive amount, particularly in the temporal dimension, where you can do longitudinal studies.
What other data set do we have that goes 40 years back? It’s probably full of untapped capacity or information. But we’ve not had the capacity, until now, to scale that up to a global perspective without a significant amount of engineering effort.
WHERE ARE WE ON THE HYPE CYCLE AROUND EARTH OBSERVATION DATA?
Depends on who you ask.
For an environmental scientist, an urban planner, or somebody who has been using geographic data for decades, this is just a great new data set they’re already working with to incorporate into their analyses.
But how does a large Fortune 500 company monetize this data? At Astraea, we’re finding handholds to exploit it.
I’d say we are just coming out of the trough of disillusionment.
It all starts off with an innovation trigger, goes steeply up to the peak of inflated expectations, and then drops to the trough of disillusionment. Then it moves on to the slope of enlightenment and heads off to the plateau of productivity.
Progressive businesses see how, through simple integration of raster data, they can tackle tasks that might have required lots of regular manual labor.
Such as energy applications and easement monitoring ̶ to determine plant growth encroaching upon the power lines, or some rogue person building a house on owned property. Those are problems that we’re seeing come up in the industry and will soon be strong contenders for good business use cases.
WHEN WILL RASTERFRAMES GET TO THE PLATEAU OF PRODUCTIVITY?
One goal of RasterFrames is to make the raster imagery look like any other data.
As a corporation you might analyze accounts, transactions, or sales data; those are different data types. The tooling is agnostic to what the contents are. It provides you the fundamental operations to manipulate the data ̶ combine columns, do statistical analysis, those sorts of things.
RasterFrames is adding those capabilities through a concept known as Map Algebra. It’s the fundamental construct for doing things like computing the distance or difference between two sets of times or combining imagery bands to create something that highlights vegetation. This will make geospatial raster data just like any of your other data. At that point, it will become mundane, boring, and unsexy.
IF YOU HAD TO LEARN A PROGRAMMING LANGUAGE, WHAT WOULD IT BE?
Not because it’s the best language in the world, but it has the best ecosystem in the world for this type of domain or any kind of analysis.
R would be the second contender. We’ve thought about adding R support to RasterFrames, but we just haven’t had the demand. 95% of the touch points we have with RasterFrames are people who are Python, understand Python, and are comfortable with Python.
After Python, I suggest learning Pandas. It’s the natural data frame library. Spark DataFrames are modeled after Pandas DataFrames, and there’s interoperability between the two.
Did you think Simeon was going to choose SQL as the choice of language for geospatial data? Were you surprised he picked Python?
I can’t wait to see imagery indistinguishable from other data types and access to geodata and functionality being democratized. It’s hard to even imagine the kinds of economies that will evolve from this in 20 years. One thing is for sure; if we allow people to contribute, they will. If we ask them to participate and give them the tools, they will.