Spatial Databases: Distributed Database Systems
In this episode, the topic is distributed database systems. The talk covers what distributed databases are, how they work, and why they are important. While this sounds like a technical topic, the show is presented in a way that gives you a soft start to it. And guest does an amazing job of making the concepts digestible and easy to understand.
About The Guest
Mo Sarwat is an expert in enterprise-scale geospatial systems and an associate professor in computer science at Arizona State University. He is the creator of Apache Sedona, an enterprise-scale geospatial data system. His work focuses on how to build robust software systems that can manage, query, and interact with geospatial data at an enterprise scale.
What Is A Distributed Database?
In a distributed database, data is partitioned or replicated and stored in multiple locations. This differs from centralized databases that store data at one ‘central’ location and have one server that handles all queries on the data. Distributed databases have multiple data centers and multiple servers that are distributed over a network.
A big property of distributed databases is transparency. The details of the distribution are hidden from the user of the system. The applications running on a distributed database would still run the same queries and workload as they would on a centralized database. The only difference is that in a distributed system, the system internally takes that query and distributes its execution based on where the data is after being partitioned.
What Is Parallel Processing In Distributed Databases?
With parallel processing, a task has split a task into smaller tasks that can be executed independently across multiple servers of a distributed database. Parallelizing the execution of a task across multiple computer nodes results in faster and more efficient execution. Splitting a task into 10 smaller parts that can be solved simultaneously results to a 10 times faster execution. The speed to execute a task increases when more computer nodes are added to the network.
What Is Load Balancing?
In load balancing, every single compute node that is responsible for one partition of a distributed database should do the same amount of work as the others in the network. Ideally, this means that each database that is part of the distributed network should store the same amount of records.
One way to achieve this is by storing the records in a round-robin fashion across the databases. Eventually, each node ends up with the same number of records.
With geospatial data, it is preferable to use a partitioning scheme that takes into account the geospatial distribution of the data. This is beneficial when running geospatial queries because the data that have close spatial proximity will be at the same location and do not need to be pulled from different servers. But the problem with this is that geospatial data is hardly uniform. A small geographical area such as a big city may have a very high concentration of geospatial records than a vast desert area. This data skew misbalances the workload because some partitions may end up with a lot of data while others have less data.
In parallel query processing the partition with less data will finish work fast because it does not have so much data to process. When partitioning geospatial records, the method used should take into account this data skew and the proximity as well. Otherwise running parallel queries on the data will result in terrible performance.
What Are The Benefits Of A Distributed Database?
A distributed database system brings several benefits to a network. One of them is ensuring the availability of the system. Let’s take Uber as an example. Uber receives tons of requests per second. If all customer and driver data is stored in one location and served only through a single node, the compute power of such a node may not be able to service all the requests that it receives.
The system may become unavailable because there is only a limited number of requests that a single node can handle per second. But with a distributed database system, the network benefits from a higher compute power to process the requests, which boosts the availability of the system.
Using a distributed database to store data close to where it will be used greatly boosts the performance of a system. For Uber, all the driver and customer data in a city will be stored in the same server because they will be accessed by users in the same city. Since data is close to where it is being used, the system’s performance will be better. Additionally, running batch analytics on the data will benefit from parallel processing capability of distributed databases.
Parallelizing the execution across different nodes can scale the analytics that are run on the data.
What Is Apache Sedona?
Apache Sedona is an open-source project under the Apache Software Foundation. It’s designed to enable scalable and efficient analysis of large datasets and provides tools for working with geospatial data in a distributed computing environment. If you are running spatial SQL queries that do analytics on data stored in a centralized database or a data lake, but you wish to run the analytics at scale in a distributed system, Apache Sedona can help you achieve that. When the data is loaded into Sedona, it will be repartitioned based on the geospatial aspect and enable the analytics to benefit from parallel processing of distributed database systems.
Why Do Enterprises Have A Lot Of Unused Data?
According to a 2022 Gartner report, 97% of data collected in enterprises remains unused. This is quite ironic because there is a notion today that data is the new oil. However, being able to derive value from only 3% of the data does not mean that the other 97% does not have any value. There is still a lot of value that can be derived from the unused data. The big challenge could be that enterprises and organizations do not have the tools to put the data into the right context and derive business value from of it.
Similar to oil which is only beneficial when put in the right context, placing data in the right context would make a big difference. Putting data into the right context by assigning geospatial and temporal aspects to it can open possibilities for deriving business value because you can tell a story about the data.
For so long, the geospatial community has focused on GIS technology. But alongside these efforts, the geospatial community can make a greater impact by helping organizations to put their data into the right context. In that way, there will be less unused data, and more usable and valuable data.
During the conversation, we mention PostgreSQL and PostGIS a few times which are topics that we have covered in previous podcast episodes
Servicing Dynamic Vector Tiles from PostGIS
An introduction to PostgreSQL
Toward the end of the conversation, we touch on the idea of cloud-native geospatial formats, and if you are interested in understanding this concept you might find these two previous episodes helpful
Cloud Optimized Point Clouds
Cloud Native Geospatial
If you have any questions or comments please feel free to reach out! I would love to hear from you