Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
podcast
Filter by Categories
ArcGIS Pro
GDAL
GeoJson
Map
Map Tools
Maps
postgis
Python
QGIS
Uncategorized

DuckDB to Access Parquet Files

How to Use DuckDB to Access Parquet Files for Geospatial Data Analysis

Introduction

Are you working with geospatial data and looking for an efficient way to query Parquet files stored on S3? Look no further! DuckDB offers a powerful solution for querying large datasets, and it’s particularly useful for geospatial data analysis. This blog post is inspired by a YouTube tutorial from “Learn Data with Mark,” which provides an excellent walkthrough on how to use DuckDB to query Parquet files on S3. Let’s dive in!

For more geospatial goodness, Check out our podcast!

What is DuckDB?

DuckDB is an in-memory analytical database that is designed to be easy to set up and use. It’s particularly useful for data scientists and analysts who need to run complex queries on large datasets. For those dealing with geospatial data, DuckDB can be a game-changer.

Setting Up DuckDB

  1. Download and Install: First, download and install the DuckDB CLI on your system.
  2. Initialize: Open the DuckDB CLI and initialize it by storing anything created in a file, for example, mark.doc.db.
duckdb mark.doc.db

Installing Necessary Extensions

Before you can query Parquet files on S3, you need to install the HTTPFS extension. This extension allows you to access files that are not stored locally.

INSTALL HTTPFS;

Configuring S3 Credentials

To access S3, you’ll need to configure your credentials. Ensure that your IAM policy is set up to have list bucket and get object access.

-- Replace with your actual credentials
SET s3_access_key='your-access-key';
SET s3_secret_key='your-secret-key';

Querying Geospatial Data

Now that you’re all set up, let’s write a query to count the number of geospatial records in a Parquet file for a specific month, say June 2022.

SELECT COUNT(*) FROM 's3://your-bucket/your-geospatial-data-for-June-2022.parquet';

Two Approaches to Data Querying

Creating a Table

If you’re going to run multiple queries on the same dataset, it’s efficient to create a table, which makes a local copy of the data on your disk.

CREATE TABLE local_geospatial_data AS SELECT * FROM 's3://your-bucket/your-geospatial-data-for-June-2022.parquet';

Creating a View

For ad-hoc queries, creating a view is more flexible. It doesn’t pull any data but creates a wrapper around the query.

CREATE VIEW geospatial_view AS SELECT * FROM 's3://your-bucket/your-geospatial-data-for-June-2022.parquet';

Performance Considerations

  • Local Table: Queries run faster but consume disk space.
  • View: More flexible for ad-hoc queries but may be slower for full scans of the data.

Conclusion

DuckDB offers a robust and efficient way to query Parquet files on S3, making it an excellent choice for geospatial data analysis. Whether you opt for creating a table or a view depends on your specific needs and the nature of your queries. Either way, DuckDB provides a powerful toolset for your geospatial data analysis needs.

Happy querying!

Frequently asked questions about accessing Parquet files using DuckDB:

1. How Do I Install DuckDB?

  • Answer: You can install DuckDB using various package managers. For Python, you can use pip:
  pip install duckdb

For other languages and operating systems, you can find installation instructions on the official DuckDB GitHub page.

2. How to Read a Parquet File from S3?

  • Answer: To read a Parquet file from S3, you first need to set your S3 credentials in DuckDB. Then you can use SQL queries to read the file.
  SET s3_access_key='your-access-key';
  SET s3_secret_key='your-secret-key';
  SELECT * FROM read_parquet('s3://your-bucket/your-file.parquet');

3. What Are the Required Permissions for S3?

  • Answer: You’ll need an IAM role with s3:GetObject and s3:ListBucket permissions for the S3 bucket where your Parquet files are stored.

4. How to Optimize Query Performance?

  • Answer: DuckDB supports predicate pushdown, which allows only necessary columns and rows to be read, improving performance. Make sure to filter your queries appropriately to take advantage of this feature.

5. Can I Write to a Parquet File?

  • Answer: Yes, DuckDB allows you to write query results back to a Parquet file using the COPY command.
  COPY (SELECT * FROM your_table) TO 'your-file.parquet' (FORMAT 'parquet');

6. How to Handle Large Parquet Files?

  • Answer: DuckDB is optimized for in-memory analytics but can also spill to disk, allowing it to handle large Parquet files. However, performance may be impacted depending on the size of the file and available system resources.

7. Is Predicate Pushdown Supported?

  • Answer: Yes, DuckDB supports predicate pushdown for Parquet files. This means that only the data that satisfies the conditions of your query will be read from the Parquet file, improving query performance.

8. How to Create a View on a Parquet File?

  • Answer: You can create an SQL view on top of a Parquet file to simplify your queries.
  CREATE VIEW my_view AS SELECT * FROM read_parquet('path/to/your/file.parquet');

9. Are Joins Supported Between Tables and Parquet Files?

  • Answer: Yes, you can perform SQL joins between a DuckDB table and a Parquet file. Just read the Parquet file into a query and use it in a JOIN operation with another table.

10. How to Handle Nested Data in Parquet?

  • Answer: DuckDB supports querying nested structures like arrays or maps in Parquet files. You can use SQL functions like UNNEST to flatten these structures for easier querying.

I hope these detailed answers help you in working with Parquet files using DuckDB. If you have any more questions or need further clarification, feel free to ask!


If you found this blog post useful, don’t forget to like and share it. For more tutorials and tips on geospatial data analysis, Check out our podcast!

About the Author
I'm Daniel O'Donohue, the voice and creator behind The MapScaping Podcast ( A podcast for the geospatial community ). With a professional background as a geospatial specialist, I've spent years harnessing the power of spatial to unravel the complexities of our world, one layer at a time.