An Introduction To Cloud-Native Geospatial Vector Formats
In a previous post, we discussed cloud-native geospatial file formats for raster data and for an in-depth discussion on Cloud Native Geospatial we recommend this podcast episode
Cloud-native refers to a paradigm of storing data for efficient and direct access from the cloud, without the need for a server client, or local caching. The geospatial community is moving into somewhat of a cloud-native revolution with the advent of the rapidly growing Cloud-Optimized GeoTIFF (COG) raster format.
The logical next step is to develop a vector (point, line, and polygon) format with all, or most, of the best features in COGs. In this post, we will review several of the most prominent formats in development and show examples of how you can work with these formats for yourself.
COGs contain internal overviews, which are lower resolution copies of the original that speed up visualization at different scales, and a spatial index that can be accessed via HTTP Range Requests, a way of accessing only a portion of a file at a time rather than the entire thing.
There are several challenges to making a vector format that acts like a COG, and is compatible with many big data frameworks. See this great blog post by Chris Holmes for an in-depth explanation of the challenges and possible solutions for cloud-optimized vectors.
What makes Flatgeobuf a truly cloud-native format is the spatial indexing schema.
Spatial data is indexed within Flatgeobuf in a way that makes sure that data stored in a location within the file corresponds to data that are in similar spatial localities. What this means is that you can use HTTPS Range Requests to access just the portion of the file that contains data from a specified area of interest.
To see an example of Flatgeobuf in action, take a look at this demo.
Spatial indexing isn’t a new technology, but this implementation in Flatgeobuf enables streaming of data directly from cloud storage (think streaming video on YouTube where small portions of the video load at a time, same concept!).
Understanding the technology behind Flatgeobuf is interesting, but not necessary to working with the format in a practical sense. The important concept to understand is that Flatgeobuf enables faster reading and streaming capabilities.
This opens up the possibility of storing your data in cloud storage, like Amazon’s S3 buckets and accessing the data directly from a client without ever having to read the entire file and incurring the associated time and storage costs.
Spatial indexing lets you query only the data within a relevant bounding box via HTTP range requests.
The Flatgeobuf page specifies that the format has read/write support in both GDAL (3.1+), Fiona (1.8.18+), and QGIS (3.16+).
Apache Parquet is a free and open-source column-oriented data format which is used in many cloud computing frameworks, and is the standard for many data lakes and warehouses. It has emerged as the go-to data format for many cloud computing applications because of its speed, convenience, and versatility.
GeoParquet is a format in development by opengeospatial and built on the standard Parquet format, with the goal of bringing geospatial data to cloud computing. On top of the columnar Parquet format, GeoParquet specifies which columns define the spatial data component, the coordinate or spatial reference system, and the spatial data format.
An advantage of this is that data can be read into many cloud data infrastructures that already have support for the Parquet format, meaning you can read, analyze, and write geospatial data using the exact same tools as you would with tabular data.
Currently, GeoParquet does not have a stable long-term release. It is actively being developed and is seeing rapid support for read and write capability being added to the following libraries, some of which we covered in a previous blog post on geospatial Python. For the most up-to-date list of features and support, see the Geoparquet GutHub page.
Want to start using GeoParquet? GeoPandas has great read/write support with the following code snippets.
“` <GeoDataFrame> = geopandas.read_parquet(<filename>.parquet) “`
“` <GeoDataFrame>.to_parquet(“<filename>.parquet) “`
See this for an example of working with GeoParquet in GeoPandas and creating an interactive map with Folium.
The last format we will touch on is the Cloud Optimized Shapefile, which is currently still in its conceptual stage. When talking about FlatGeoBuf, I mentioned that it has an optimized spatial index. This isn’t a new piece of tech. In fact, the original Shapefile format developed in the early 90’s included a spatial index – the proprietary “.shx” file or, an open-source “.qix” alternative.
The main advantage of creating a Cloud Optimized Shapefile format is that it would have full backwards compatibility, and that almost all geospatial software already has Shapefile reading capability, making it highly versatile. However, the Shapefile is an older format with several other issues that make it challenging to scale on the cloud (see http://switchfromshapefile.org/).
Cloud optimized vector formats will be versatile to many different applications, from the multi-million-dollar commercial satellite company, to the individual freelance GIS analyst. For example, companies like Planet and Carto seed the open development of GeoParquet to use in their cloud systems and user interfaces, but these formats can also help you in your solo GIS work.
Overviews and tiling mean you don’t have to load an entire dataset every time you want to access the data, making rendering quicker and easier on processors This is just the beginning; cloud optimized vector is in its adolescence, and the best is yet to come.
So, which of these formats is the pending gold-standard? Well… the jury is still out on that one. This is an exciting time to be working in the geospatial field. The cloud-native geospatial movement is rapidly developing solutions to work with location data at scale.
The formats that we use two years from now might not even be in development yet! Overall, FlatGeobuf is a fantastic format with the most documentation, but GeoParquet is quickly becoming the leader of the pack.
Thank you for reading this far and thank you to the fantastic open-source geospatial communities that are driving progress for the greater good.