Alex Leith is a Digital Earth Architect, and in this episode you will learn what “Infrastructure as code” is — hint: it is the opposite of the “clicky-clicky” — and so much more!
Connect with Alex here: https://auspatious.com/
Recommended listening
- Cloud Optimized Point Clouds
- Cloud Native Geospatial
- Planet Scale Tiled Maps Without A Server
- What Is Modern GIS
In Conversation
The Digital Earth Architect
Daniel: Alex, welcome to the podcast. Would you introduce yourself before we head into the conversation?
Alex: I’m a geospatial professional with a long-standing passion for open geospatial. I’ve found myself working with Earth observation data on a number of digital Earth platforms — Digital Earth Australia, Digital Earth Africa, and now Digital Earth Pacific. I’ve started a bit cheekily calling myself a Digital Earth Architect. My role is more around geospatial technology, cloud infrastructure, and architecture considerations like security and maintainability — almost a niche of information technology around big data, the digital Earth, and cloud native.
What Cloud Native Means
Daniel: Let’s start with a definition. How do you think about cloud native, and how is it different from what we had in the past?
Alex: Initially people used the cloud as a buzzword and just lifted and shifted their current ways of working onto a cloud platform. Over time people started understanding these “as a service” concepts — like a database as a service — and a whole bunch of practices came out organically once people understood how to use the cloud better. Things like using an object store to store data rather than thinking about spinning discs and file systems. Some of it requires a paradigm shift, but once you’ve got that shift in your mind it makes things much faster and simpler, even though under the hood it’s more complex — and it’s much easier to get big.
Servers as Goldfish, Not Pets
Daniel: Could you rip apart my understanding and explain how these “as a service” concepts fit in?
Alex: About 10 years ago I worked for a local council and we wanted to deploy a public web mapping system. I’d been exploring AWS at home, and I discovered it’s really easy — the pricing is just there as a shopping list, whereas a managed service provider takes a week and sends you a giant incomprehensible document. So I deployed a server, installed software manually, and learned as I went. Later I started using an autoscaling group. Instead of having a pet — a server you name and care for, and would be devastated if it died — you start thinking of servers like a goldfish: if it dies, you just put another one in the bowl and nobody notices. An autoscaling group has a recipe for what to do with a server, so each one launches, follows the recipe, and is added to the pool.
Alex: My self-managed GeoServer instance for that council had PostgreSQL on it, and it started having performance issues. I really didn’t want to be a DBA, so I moved to a database as a service — Amazon’s RDS — which manages backups, disaster recovery, scaling, and minor version patches. I never had a problem again with the boring part of that database. The lesson is: rather than learning how to administer a PostgreSQL server, it’s better to get a handle on combining many services together to work smarter, not harder.
Infrastructure as Code
Daniel: You mentioned scripting a server — a recipe written out as code. Walk me through infrastructure as code.
Alex: A colleague of mine used to call manual configuration the “clicky-clicky” — so don’t do clicky-clicky, put it into infrastructure as code. You write code, in something like Terraform or the AWS CDK, that defines the infrastructure you want — a network, a database, a server, DNS — as a declarative set of instructions. It slows you down at the start, but then you have a simple, repeatable environment, and you know how it’s configured because it’s there in the code. It’s not “what button did Alex click six months ago.” You can parameterize it for staging and production, and you can treat it like any other code — put it in a git repository with continuous integration and continuous deployment. A pull request process forces code review of changes to your infrastructure, and you can be confident the git repo is exactly what’s deployed in production.
Containers and Kubernetes
Daniel: How does this tie into containerization?
Alex: A Docker container comes from a Dockerfile, which is a recipe: start with this operating system, install some things, copy some files. A Docker image is the at-rest version — a kind of canned environment ready to be opened up and run. The advantage is a known environment you can launch once, a thousand times, or millions of times. If you do your development work in a Docker container, your development environment becomes closer to production and staging, so it’s much harder for devs and operations to blame each other. A container shouldn’t be writing state to the file system — you store data in an object store and the database, so you can throw the container away just like the goldfish.
Daniel: And kubernetes — how does that fit in?
Alex: The history of computing is all about abstractions — from hardware to software, from machine code to programming languages. The cloud abstracts hardware, even entire data centers, and kubernetes is an abstraction layer for clouds. You create a kubernetes cluster with a management layer and some compute, and you just ask it to run resources — a web application that publishes as a service, or a data processing workload. You can run a small container that uses half a CPU, or a big one that wants 60 CPUs and 500 GB of memory. Kubernetes abstracts the work you want to run from deploying the servers it runs on. It is ridiculously complex, but once it’s set up it’s hugely empowering.
Processing Africa at Scale
Daniel: Can you give an example of the workloads you’re running?
Alex: One workload is creating an annual mosaic. We take Sentinel-2 data and build a geomedian — in simple terms, a data-robust, cloud-free annual mosaic. For Africa, which is a fifth of the Earth’s land surface, there’s about 400 terabytes of Sentinel-2 data captured each year. For each year going back to about 2018, we process 400 terabytes over something like a thousand servers, using about 15 terabytes of memory and thousands of CPUs — and we know it cost us something like $4,000 to process a year. So we can go to senior executives and say, here’s the workflow, we think four years will cost us $16,000. It’s reliable and repeatable because we use containers, and using spot instances — bidding for unused cloud resources at about a tenth of the price — means we get that scale cheaply. It’s a small team doing this, and it’s pretty fantastic and empowering.
Serverless and Event-Driven Pipelines
Daniel: Can you explain serverless functions?
Alex: Serverless has a time and a place. For the big mosaic tiles, each one can take 500 GB of memory and 20 minutes, so you can’t use serverless for that. But for smaller, well-defined units of work, it’s perfect. In Digital Earth Africa, we store a copy of Sentinel-2 and Landsat in the Cape Town data center, and we need to copy it from the source. When a scene lands in a bucket in Oregon, it creates a notification, we subscribe to it with an AWS Lambda function, and that function copies the files over and creates its own notification saying a scene has arrived in Cape Town.
Alex: That’s an event-driven framework. The really nice outcome is that a running job doesn’t need to know what happens next — it just says “I’ve finished, I’ll create a notification,” and zero or more queues subscribe to it. It separates the work from the downstream work. And we can make those notifications public — we’re building open data platforms, so if someone else is building a business on the data, they can listen to our notifications and know when there’s a new scene, without building their own infrastructure.
Cloud-Native Data Formats and Disintermediating the Data
Daniel: Do you see an advantage for non-developers in just starting to use cloud-native file formats?
Alex: It’s a really big deal. Digital Earth Africa has an Africa-wide digital elevation model stored in Cape Town, but I can add it as a URL into QGIS from here in Hobart and it renders at the Africa scale, or I can zoom right into Kilimanjaro at full resolution. It’s a 60 GB file, but I don’t have to download it — I just stream the data I need over the network. That’s the opportunity in cloud-native geospatial. With GeoParquet I fed a file with 10 million rows some path-rows and counted all the Landsat scenes over the Pacific — it lazy-loads, so it doesn’t download the whole file, and it returned in less than a minute.
Alex: My ridiculous term of the month is “disintermediating the data.” Rather than having OGC web services or open data portals where you learn a whole new API language, if our applications and code can go straight to the data without an intermediary, things are simpler and better. That’s what Cloud Optimized GeoTIFFs are — I can query a STAC API or a GeoParquet file to find which COGs to load, then read just the bits I need into memory. It makes the tools and code I write simpler, and it means less of other people’s servers between me and the data. So my call to action for data custodians is to consider these analysis-ready, cloud-optimized data formats — if you release your data that way, people can just stream it and use it.
Daniel: Reproducibility comes up again and again here.
Alex: There’s a throwaway statistic that when someone does a PhD, they spend 80% of their time organizing data before they can start the science. If we can remove that 80% and let them spend 100% of their time using data, that’s gold. Jupyter notebooks are a great way of doing your science iteratively, and with cloud-native formats and open APIs like STAC, I don’t need access to a supercomputer — I can run a notebook on my laptop that finds all the Sentinel-2 data over Tasmania, do an analysis, and share it with someone who can run it anywhere. That’s hugely empowering.

