The Implications Of Storing Rasters In Databases
This episode explores the intriguing question of putting rasters in a database. An insightful conversation about why you should or shouldn’t put rasters in a database – and how to do it when you need to.
About The Guest
Paul Ramsey is a renowned advocate for open-source geospatial software and co-founder of PostGIS. With expertise in databases for maps and locations, he specializes in a niche yet vital area of geospatial technology.
The Challenges of Storing Rasters in Databases
The prospect of storing rasters in a database stem from its inherent promise of fast data handling. And given the bulky nature of rasters and the need for swift data retrieval, it seems logical to leverage a database’s capabilities. However, the voluminous nature of raster data poses unique challenges. Here are some of the cons of storing rasters in databases.
Difficult To Manage Database
As a database scales up, it becomes increasingly challenging to manage. Usually, non-spatial databases that are considered large often weigh in at around 20 gigabytes. But raster data, easily tips the scales at terabytes. This not only affects operational aspects like backup and restoration but also introduces added complexity around data replication and storage.
Furthermore, rasters are already indexed data structures, making them accessible without the additional layer of database integration. Therefore, in terms of speed, integrating rasters into a database offers little to no gain.
Underutilized Database Capabilities
Databases are fundamentally built to handle dynamic data and ensure fault tolerance, promising never to lose your data regardless of the circumstances. However, this strength is underutilized with rasters, which are rarely modified. This mismatch of requirements and capabilities further accentuates the difficulties of raster-database integration.
Cloud Native Raster Formats
To optimize the storage and retrieval of rasters without the complexities of database integration, cloud-native raster formats emerge as a viable solution. These formats store data in convenient, easy-to-access structures, enabling efficient retrieval of relevant portions of the data. This eliminates the need to download massive datasets for small requirements and simplifies data management.
Benefits Of Storing Raster Data in Databases
Despite the challenges, there are legitimate reasons to consider raster-database integration. Rasters excel at representing surface information across large areas, such as rainfall or elevation. Geospatial professionals working with vectors can benefit from querying rasters and vectors simultaneously, as it can yield valuable insights. By storing rasters and vectors in the same database, complex analytics using SQL tools become possible, enhancing the capabilities of GIS operators. This fusion opens up new avenues for geospatial analysis, making it a compelling area of exploration despite its challenges.
An alternative approach to storing rasters in databases involves using out-of-database (out-DB) rasters. This method entails storing references or pointers to external pixels instead of the actual pixels. A metadata shell is created in the database, containing all necessary information about the underlying data. This flexibility makes it possible to leverage cloud-native raster files through your database, allowing for seamless access and analysis of large raster datasets without the need for extensive local storage.
Connecting To Out-DB Rasters With GDAL
For out-DB rasters, GDAL plays a crucial role in connecting the database with the actual raster data, which could be stored anywhere from a local file system to a cloud-based storage location such as an S3 bucket or Azure blob. In this regard, GDAL acts as a bridge, creating virtual files that point to the data stored in various locations.
However, while the concept of virtual files offers impressive flexibility, it comes with a significant caveat. Since the data isn’t local to your computing environment, accessing it may involve a certain degree of latency. This latency can be even more pronounced if the database and the geospatial data aren’t located within the same cloud.
In-DB Rasters refer to raster data that is stored directly within the database. In this process, each pixel of your raster data is stored in the database, making the data immediately available for analysis.
Rasters can be loaded into post-GIS using raster2PGSQL, which can handle a multitude of raster formats. It allows you to set some flags determining how the data will be loaded into the database and then generates a SQL file that can be piped directly into your database.
PostGIS Raster Module
PostGIS Raster Module, is an extension of the PostGIS system that caters to the needs of handling raster data. It provides tools for reading raster information, populating the Z dimension of vectors with data from underlying rasters, and more. The module was designed optimistically, envisioning users leveraging more advanced raster functionalities such as:
With the PostGIS raster module, you can convert vector data—points, lines, or polygons—into raster geometry. Conversely, a categorical raster can be transformed into a polygon, facilitating conversion between these two formats. This conversion flexibility is particularly useful for creating masks, processing elevation data, and other applications where raster and vector data need to interact.
Raster Map Algebra
Raster Map Algebra functionality allows you to perform mathematical operations on raster pixels, such as adding two rasters together, subtracting one from another, or creating a new raster from calculated results. While the potential seems limitless, the practicality of this feature faces some execution thread constraints since PostGIS, being an extension of Postgres, can only leverage the core capabilities provided by the Postgres database.
Parallelism In PostGIS Raster Operations
While raster data processing is inherently parallelizable, it’s not a good fit with traditional database models. SQL-based systems, typically operate on a row-by-row basis and lack the detailed knowledge about the data structure that raster-native frameworks possess. And even though the architecture of PostGIS can facilitate some parallelism, it can’t spread the workload across an extensive number of cores. The level of parallelism is dependent on what the core database (POSTGRES) provides.
Tools like GeoTrellis or Google Earth Engine, designed explicitly for raster data, can efficiently distribute the processing load across multiple nodes. However, the database, unlike these specialized tools, is not aware of the intricate data structure of rasters. It merely views the data as a sequence of rows, making it difficult to exploit the inherent parallelism of raster operations.
Exporting Raster from SQL
When it comes to exporting raster data, SQL offers some flexibility. For instance, it supports visual formats like PNG and JPEG, along with the GeoTIFF format. Although it does not support all GDAL output formats, the flexible GeoTIFF and two visual formats usually suffice for most use-cases.
Should You Store Rasters in A Database?
Whether you should store your rasters in a database depends on your specific needs. If your rasters are non-visual data and your questions involve mapping your vectors to these rasters, it might be worthwhile. This is particularly beneficial if the questions you’re asking are ever changing, as SQL provides flexibility in handling different questions. However, if you’re only running the same query repeatedly, you might find it more efficient to create a script that accesses the raw files directly, rather than storing them in a database. Ultimately, the choice depends on your unique use-case and needs.
Dynamic Vector Tiles
Blog posts by Paul about Rasters in the Database