Anita Graser is a legendary open-source geospatial Python expert. With her extensive knowledge of the subject, she is here to convince us of why Python is a great language and how we can all get started learning it. Stick around to see the benefits and learn why Python may or may not be an option for your GIS project.
Anita comes from a GIS background. She’s been working with QGIS and Python since 2008 as an integration solution to automate mapping and to look at data in different fashions, not just from the command line or in graphs but also in maps.
It wasn’t always clear that Python would be the best language to learn for GIS. Not until ArcPy and PyQGIS came out around 12 years ago. These two python implementations taught us that Python is versatile and easy to learn, and you can manipulate data with it. And who in the GIS world wouldn’t want to use a flexible tool for wrangling their data from a file or a database into something usable? Python does precisely that.
It’s also easy to interface it with PostgreSQL and PostGIS, and the possibilities are endless from then on for automating workflows with scripts. For model builders, for example, it’s possible to export models as Python scripts or write them from scratch in whichever workflow you prefer. There is also a vast opportunity of building extensions for desktop GIS and server-side GIS applications using Python with plugins in open-source as well as in proprietary systems. There are many reasons why Python is now the universal language of GIS – it’s a glue that holds things together.
Once you know Python and realize its usefulness for geospatial data manipulation, you are no longer just pushing buttons provided to you, you are in control and have the freedom to create your own tools and processes. It has an element of self-documentation that’s hard to find. You can’t forget to document a certain parameter when you’re writing code, and you can look it up later if you need to go back. This is helpful in cases when you inherit someone else’s workflow.
Python is widely adopted in the geospatial world and as such geospatial processes written in python are sharable and repeatable. While there may be different environmental variables that need to be tweaked and data that also needs to be shared it is possible to share your work and let others use your code and build on top of your work.
If you already know some programming language, it’s possible to get into geospatial and apply Python specifics as you go because it’s not a hard language to learn.
If you don’t have a programming background, you’d be smart to cover the basics first, such as loops, functions, classes.
In both cases, most users, especially GIS people, do better if they have geospatial specific motivation and inspiration. They want to see something on a map, really quick. They want the first steps into this new, unknown to be related to what they do in geospatial.
A good intro to writing python code is to create a model in a graphical model building and then export it a python script. You can play around with feeding data the different parameters in the script and see how they affect the outcome.
This also gives you an understanding of how Python code is structured and how the different components are chained together.
When people see that they’re not tied to the standard tools in the graphical interface, they realize how flexible programming is and how much they can get out of a model builder. This is real motivation.
Model builder scripts are only the first step. Once you start executing things outside of the program, like manipulating parameters, you’ll come across things you can’t solve quickly with a model builder. Knowing Python and how you can program something from scratch is a great motivation.
GeoPandas is a relatively new, open-source library that’s a spatial extension for another library called Pandas. It’s been around since 2008, and it’s been designed to make data analysis easy.
Pandas uses a concept called data frames - they’re tables of data or time series of data if indexed by timestamp. Pandas acts like a database by putting on indexes to filter the data.
It comes with convenient functions to read and write files with missing numbers. If you have null values (no measurements have been recorded in a time series, for example), Pandas gives you options to calculate values for those rows or correctly interpret the null value in the same way a database would.
This could be the last observed value or the interpolation between the previously observed value and the following value that’s in the data set. Who doesn’t want these functions when working with real-world data?
The Pandas library also comes with the ability to pivot and reshape tables and groups, do merges and plot.
There’s a lot you can then do in Python that generally requires a database. You can write a standalone script and no longer depend on a database or having to carry out your data analysis in cookie-cutter ways.
In 2013, GeoPandas entered the scene and made it possible to store geometries in the data frames (much like Postgres and PostGIS) by building on the existing Pandas libraries. Libraries such as:
GeoPandas is a fantastic tool for geospatial programmers because it’s easy to write standalone code that can be used outside of the typical desktop GIS environment. It’s a good choice for non-GIS programmers who are familiar with Pandas and it makes it easier to build geospatial capabilities into existing python codebases without the need to install desktop environments like QGIS or ArcGIS.
Good programmers take what’s working (GeoPandas) and build on top of it or extend it. There is no need to reinvent the wheel every time. Use what’s already working and build a component yourself that will solve your particular problem. If Fiona has been reading your geospatial file formats for years, integrate that. Assemble compatible modules - the nature of models is evolving, and versions keep changing so remember to check their compatibility!
You should always follow the installation instructions of the respective library you use. They know the current working configuration best. In the case of GeoPandas, use Conda installation. (Python installations come with PIP for package installing.
PIP, however, doesn’t work with some of the GeoPandas dependencies, particularly on Windows.) Conda is therefore recommended by GeoPandas to cover all major operating systems. You can run Conda from the command line or use a desktop application, Anaconda, with a graphical user interface. It lists available packages, and you can click the ones you want. It will automatically resolve dependencies and install the correct versions to ensure a working environment.
Once you’ve done your set up, Anaconda has multiple IDEs (Integrated Development Environment) or editors. Spyder and PyCharm are two options, they are available for free or with a free community edition, respectively. PyCharm has the advantage that it has the exact same layout as IntelliJ - a popular Java editor that Java developers are familiar with. It has convenient functions for refactoring and making it easy to read code that’s self-explanatory.
Python has proven to be a reliable companion to data scientists from various different backgrounds. Libraries like GeoPandas fill the gap between nonspatial data scientists and people with geospatial expertise. They can work together on integrating spatial analysis capabilities and machine learning, deep learning, and AI that most data scientists work with.
For research, there is considerable potential to improve reproducibility, particularly with technologies such as Jupyter Notebooks. You can record and analyze step by step and show the intermediate results and the plots you might generate for a report or for a scientific paper in the context of that code.
In the past, you wrote a script, you ran it, and it dumped images into a directory. You then looked at both sides to find a figure in the output directory and decide if it made sense and reflected on what was going on.
In Jupyter Notebooks, you execute one part of the notebook, called a cell, and it will immediately plot the output under that cell. It can be text or interactive graphs, such as a leaflet map or a plot. You can see how this would make it easier to debug issues and understand the data analysis flow. If you’ve ever had the honor to inherit someone else’s data processing workflow, you’ll appreciate this step by step debugging functionality and managing the code.
Its popularity is still on the rise, and there aren’t many contenders on the horizon. There is something for everyone in Python. It’s easy to get into as a beginner, and it’s efficient, especially if you can write some parts in CPython, which is what under the hood users see, and it’s much more performant. Once you get into Python, there aren’t too many reasons why you’d want to abandon it.
People from the Java community, and people who work in Big Data settings (Hadoop and Spark), have started to build a bridge to Python. PySpark allows Python to interface with these Java Virtual Machine worlds and Big Data settings – it will be around for a long time, and I encourage people to learn Python.
If you’re working with a pre-established system that’s a Java-based language, it’s not recommended that you introduce this interface without a valid reason to do so. You’d be better off sticking to the Java world. There are libraries for geospatial use such as GeoTools. Mixing and matching languages isn’t a good idea.
If you are starting from scratch, and your work is related to data science, then use Python.
Scala is efficient and advanced. Knowing Scala and Java is immensely helpful – they are related and can be used in combination with each other. Either of those would be able to solve challenges for large datasets that need to be manipulated effectively in distributed computing environments.
If you work with movement data, you need a specific tool. There is a library called MovingPandas, and if you have vehicles, people, or goods that move and you need to track them or analyze the data, it’s a library you should go to and use.
Do you feel inspired to start learning Python? Would you be able to use what it can offer in terms of flexibility with what you do on a day to day basis? I’d love to know your take on Anita’s advice.
WHAT IS OPEN SOURCE SOFTWARE? It’s software thatshares the actual instruction code, and the binary you run on your computer. There are different versions and variants of how that sharing happens. Essentially, you get access to the underlying recipe of the software. You can modify it and adapt it for your needs. This episode is all about building a business based on open source GIS software.
On one side, we have the promise of personalization of being seen and understood, provided we share our data about how we interact with the world. That data can be used to make things better. On the other side, we risk being exposed or being manipulated and treated like a product instead of the customer. In terms of data privacy, location plays a massive role. We, the geospatial community, play a role in this conversation.
Artists, mathematicians, and scientists think about their work as a structure that already exists. Their job is to chip away at the detritus. They clear the noise and the content that distracts. They dig up the bones of a fossil and work on revealing a structure that’s already there.
Take Michelangelo, who, by his own admission, was onlyfreeing David from the surrounding marble.
Mapmakers let the geographic content ̶ the data and the layers ̶ communicate.