Derek did his education at Stanford in architectural design, civil and structural engineering. He got into data analytics, urban planning, and policy after finishing his graduate degree. He was hired back as a lecturer to increase the amount of community engagement, research, and education in the San Francisco Bay Area. During the past ten weeks of the academic quarter, he’s been focusing on COVID-19 rapid response with local community and government partners around the Stanford area.
WHAT IS THE SAFEGRAPH DATA CONSORTIUM?
SafeGraph is a data set that’s available to consumers broadly. For the COVID-19 crisis, the SafeGraph team put together a data consortium of researchers and academics to use their data to produce intelligence for COVID-19 response. There’s also a Slack workspace where academics discuss daily what kinds of additional data sets and features they’d like to see released regularly.
The consortium members share work in progress and get direct conversation going on about the research itself. It’s wonderful to see the result of open and transparent discussion amongst academics and between academics and a data vendor. All this will improve the way this data can be leveraged to understand how human movement and interactions might relate to COVID-19 exposure and spread in our communities.
The data consortium gives us an access key to free data. Some data products they provide through the consortium are the same as what can be purchased publicly on their data portal. But some unique data sets have been explicitly refined for understanding COVID-19. At the end of this process, they will have turned that back around and made it a general product available to the public.
WHAT’S THE DATA LIKE?
The SafeGraph data can be described as two separate products.
The first product is called places, and it is independently a well contained, compiled, and refined data set of places where economic transactions happen in the US. In the Bay Area, for example, we’ve got over 100,000 individually tagged businesses, retail establishments, and parks. For the typical geospatial user, they are well-geocoded latitude longitude coordinates, plus building footprints and boundaries that are continually refined over time. Regardless of what you might want to know about how visits happen to these places, the places data set alone is already valuable for practical and research work.
Places stands as its own product and is something that the research community made use of, even pre- COVID-19. It helps us understand what business development and accessibility to different amenities look like in the Bay Area.
The layer that gets us excited on top of that is called patterns data. It’s a large weekly, or monthly data set that you can download. In it, the row by row makeup of the data set are places. For example, you’ll have a Starbucks show up every time in the weekly patterns data set if it exists in the Bay Area. The fields of information you have about that location are the counts of devices SafeGraph has access to on cell phone movement data, and whether those devices have visited the specific location.
There’s a disaggregation of a few different qualities and characteristics of that visit data. For example, you can break it down to hourly and daily visit data, and you can break it down to some origins of visitors. If you see 100 visits to Starbucks, there will be a JSON object field that shows block group IDs for the origin neighborhoods, where those visitors came from, and it will break up that number 100 into smaller numbers.
Maybe ten of those came from a specific block group that you can find in GIS, and 12 came from another one. Understanding the movement from point A to point B, especially for those who want to think about the travel that happened, helps to develop an origin-destination matrix. It also gives information about the device pool that SafeGraph has, which of course, isn’t 100% of a population.
But SafeGraph lets us know how big that pool is. If we look at the US overall, it’s on the ten to one ratio. If there are 300 million or so people in the US, then for any given month of data that SafeGraph provides, they’re looking at about 30 million devices. Academics can use those ratios to extrapolate true visit counts out of a limited sample set that we see as numbers in the data set.
As a result, we have useful insights from research and from a practical perspective of having a reliably maintained and produced data set of places, and the visit counts to and from those places.
All those pieces coming together means that we can build a record of where people move around in a specific locality. For COVID-19, we can then talk about the public health and community spread implications of that movement.
CAN WE EXTRAPOLATE OUT AND TALK ABOUT THE ENTIRE US?
The consortium provides the data as a US-wide data set, which is substantial.
In the course environment, our work is predominantly focused on nine counties in the Bay Area. As a teaching team, we have a processing step we provide the students. We remove data for the entire “other” US and bring that data set down in size to the places in the Bay Area, and the possible origins of visitors to places in the Bay Area. That’s a data processing step that we do, so it’s easier to pass this data back and forth on our servers and do analysis. But the original size of the data is already “full” US.
I agree that this data unlocks this consideration of urban planning and COVID-19 response questions instantly for the whole US.
IS THE SIMPLE SOLUTION TRACKING DATA?
For most people, when they think about the relationship between GIS and COVID-19, the obvious solution is tracking data.
At the highest level, if there is any spatial relationship between where we are in our societies, where we go, who we interact with, and spread the disease, our measurements, then our policy tools must also have a geospatial component to capture that effect. It’s not a question of whether spatial data and spatial analysis matter for COVID-19. The problem is understanding what tools exactly can be used for specific insights and decisions.
It probably matters that a specific location is of a certain size and cramming 100 people into that bar or that grocery store has a different outcome than having just ten people there through careful social distancing compliance. The question is whether ten different census block groups are frequenting the same grocery store, or if it’s more of a local corner grocery store with just one census block group that visits there is likely to reflect a degree of mixing that can also intensify disease spread.
These are precisely the kinds of measurements and observations that can be directly made using the SafeGraph data. In its absence, you may only be able to postulate about questions of concentration of people in space and time and across different census block groups. SafeGraph allows us to observe what happened in specific localities and then potentially do the work to connect that to case growth we see a few weeks later.
From our perspective, and the perspective of many consortium members, we’ve been seeing promising results in the research and analysis to connect SafeGraph movement and specific ways to construct those indicators with case growth that we’re seeing through county health data across the US.
IS DENSITY AND MOVEMENT OF PEOPLE EQUALLY IMPORTANT IN TERMS OF THE VIRUS?
The visits location data that SafeGraph provides tend to be indoor locations.
It is still a hypothesis. But we would all agree that from what we’ve seen of stories about COVID-19, the indoor transmission is likely to be greater than outdoor because of how particles exist in an air-conditioned or ventilated space compared to the free-flowing air outside. It may just be coincidental that the view SafeGraph provides is human movement in the environments that are the most important from a disease transmission perspective.
That is still, of course, a hypothesis. The SafeGraph data doesn’t help us track the route people took from point A to point B. We only know where they might have started from and what time of day they showed up at a Starbucks. There could be many important factors for the route itself.
For example, if this is a place like San Francisco where public transit is often used, it could be that buses and trains are important vehicles for disease spread. That is invisible to us in terms of SafeGraph data. It would rely on us bringing in other kinds of insights, like getting ridership data from a local transit agency or doing the network analysis to infer the potential modes of travel.
It’s fair to establish here, and it’s certainly come up in the consortium, that we can get some view into the questions of the relationship between human movement and disease spread, but it tends to be concentrated on indoor establishments, which is what SafeGraph data provides. Whether or not we like it, that’s what we have. A lot of us feel confident that what we have is getting at the core of what is risky human interaction.
WHAT WOULD BE THE PERFECT DATA TO WORK WITH?
Let’s try a brief thought experiment here. Let’s say we could have the perfect data, tracking the entire population 100% of the time down to millimeter accuracy. Would that be a silver bullet then? Can we solve the problem? Can we answer the questions? Or are we still left with other challenges?
Contact tracing and tools will eventually be able to use Bluetooth data, or other forms of cell phone data, to figure out what the full visit pattern of somebody who ended up getting the disease was.
That, maybe we could call a bronze bullet here. There is so much more insight that can be gained once that kind of data becomes available to analysts than what we’re talking about with SafeGraph here.
SafeGraph is an incredible proxy, but one of the practical problems with it, besides it just being a sample set, is that it tends to provide a lot of the movement behavior, aggregated up to the entire size of the business establishment or a census block group. The problem you run into there is an ecological inference problem, where you only know averages or summaries of data for a larger group of people.
But the direct way the Coronavirus works is at the person to person level. Ultimately, we’re unable to get down to tracking individual people longitudinally using a data set like SafeGraph and for a probably very good reason. Contact tracing would be a game-changer from my perspective, given all the privacy considerations attached to it, to be able to see the true individual movement and interaction behavior at that fine level of granularity.
It’s a bronze bullet, in our perfect data experiment, because it will certainly include gaps in the full understanding.
For example, contact tracing doesn’t necessarily sound to me like it’s entirely GIS data. It may take the form of Bluetooth distance between your device and other devices that also have the app and then create a trace of who was within your vicinity in the recent past. This is very useful for the act of contact tracing, but that could potentially ignore the geospatial implications of where those interactions happened, whether they were in a tight indoor establishment or an outdoor park. Knowing what the trace of these movements was exactly in a geospatial grid would clue us into other characteristics of the urban environment that may be important to understand.
A view of contact tracing won’t tell us that the access people have to backyard space for recreation is potentially a considerable driver to how likely they are to go outside to parks, or into streets to try to get that recreational engagement. The important systems change there is to think about public or private access to open space. You wouldn’t get that from any of these bullets we’re talking about; it takes a system’s perspective.
Even in a thought experiment of perfect GIS knowledge, there can still be missing pieces of a holistic systems understanding of all the levers and factors that affect human wellbeing, human movement, and, in this case, public health.
WHY IS IT SO DIFFICULT TO LINK MOVEMENT IN SPACE WITH THE SPREAD OF DISEASE?
A lot of the research we want to do with SafeGraph data is confidence-building; that there is, in fact, a relationship between human movement in space and disease spread. But to construct answers to those questions assumes not only that you have useful movement data, which we think we have from SafeGraph to some level of detail, but that you have case data as well.
We need data about the disease spread.
That’s a whole other can of worms. SafeGraph cannot provide us the solution to that problem. In the Bay Area, we have nine counties that are reasonably good in terms of data access. What you’ll find right now (speaking at the end of June 2020) is dashboards from individual county health departments showing daily data about cumulative cases down to the zip code level. In other views on the dashboard, you get breakdowns by race, ethnicity, age, so forth. The zip code level daily data is promising for being able to bring that case data the understanding of disease spread, in line with the level of granularity of SafeGraph data.
If there is something that matters in terms of movement, we can trace that movement back to specific census block groups, at least from SafeGraph’s perspective. Through analysis of the patterns data, we can then say that this census block group had this many visits to these types of establishments in a given week and dwelled there for this amount of time. Then the signal we want to see or measure for that census block group is what did case growth and testing look like some weeks into the future.
In many places in the US, and possibly elsewhere even more so, you can’t get data about case behavior and case growth at anything lower than the county level. You then have a geospatial disconnect between county-level aggregations and a much richer SafeGraph data about places and census block groups. No matter how good SafeGraph data gets and how many of these silver bullets we can get, if we can’t link that back to the same granularity of case outcomes, then we’ll always have that disconnect and inability to refine specific questions.
Only two out of nine counties in the Bay Area provide us zip code data ̶ we’re always trying to get that data from all counties. Thus, we have to do a bit of GIS work. We take our SafeGraph understanding of movement and population from census statistics, and we bring that up and scale it to zip codes because zip codes are bigger than census block groups.
With that, we have apples to apples comparisons we can make. We have decent daily data on both sides of the equation; movement data and case growth data. We still have questions about just how long cases take to manifest and then make their way through the testing system to become reported by these counties. To make the matter worse, these counties still have unique ways in which they’re reporting that data. But this has been a useful starting point to find some of these relationships.
In some of our recent work, taking SafeGraph movement data at the zip code level, and attaching to it some census characteristics like income, the number of people per household, and the number of people per room in a household, we’ve been able to explain the variations in those measures over 75% of the case growth, over some amount of time, in zip codes and a specific county in the Bay Area.
That’s not 100%.
Maybe contact tracing as a tool can go way further if you end up knowing that somebody contracted the disease. A predictive tool with explanatory power is better than blindly making policy decisions about whether you’re shutting off the entire economy or turning it all back on at the same time. I hope that in the coming weeks and months of work across the consortium, and here in the Bay Area, we can put these predictive tools in the hands of decision-makers. They can make more informed decisions about the various spatial ways in which these policy tools and the health outcomes ultimately play out.
WHAT’S AFFECTING CASE GROWTH IN DIFFERENT AREAS?
We are still at an early stage in our research, but I can say that a lot of the fixed characteristics that we’re getting from census data – income, age distribution, race, ethnicity and wealth affect case growth.
We wouldn’t necessarily expect that to have a lot of explanatory power on where case growth happens, but in fact, it does. It has higher explanatory power than any of the SafeGraph visit behavior, which is basically to say that it appears to be the case that if you have higher income, however that manifests, you have a greater ability to shelter in place and avoid the impacts of COVID-19.
Your age distribution in your block group, your language ability in terms of communicating with health departments and government entities, and these kinds of fixed community socio-economic, demographic characteristics tell quite a bit of the story of what equitable distribution of impact looks like. Not just in the Bay Area, but we see similar findings across the US through the consortium.
SO THIS IS NOT A PURELY SPATIAL ISSUE, IS IT?
Spatial data can be an effective communication tool to bring awareness of a pandemic. Those systemic issues of inequality in our systems are also symptoms of underlying urban spatial arrangements and decisions. Many of the conversations that are hot right now in terms of racial inequality, which can stem from underlying conditions, besides racism, or structural decisions that have been made in the past, are still with us today, whether we like it or not.
These affected the urban arrangements of where housing is, where development goes, and where transportation goes. These get locked into the geography of the places we live in, and they’re working in the background all the time, affecting our opportunities and our livelihoods.
Spatial data is the magnifying glass we need to put to history and to outcomes we can measure right now. I would say that it’s not a sign of despair that there are systemic forces. I think that only doubles down my insistence that if we can put tools in the hands of students, policymakers, and community activists to tell the spatial stories, then we can first illuminate and make it clear to everybody how inequality takes root in measured outcomes in our societies.
Going back to COVID-19’s example, we’re looking at case growth in neighborhoods with predominantly people of color, low income, and pre-existing health conditions. That’s a spatial story. We need to keep building these spatial tools so that those voices can be heard. Those objective truths can’t leave the conversations where the real decision making happens.
HOW MUCH BETTER PREPARED ARE WE FOR THE NEXT VIRUS?
In the Bay Area, and across the US, we’re starting to see counties make vastly different independent decisions about how best to reopen the economy. A lot gets lost in the chaos of policy decisions that are made without institutional memory of what the consequences were of past actions.
The SafeGraph data and the research that’s coming out of the consortium, with many other efforts across the world, are putting down an objective record of what was our best understanding of how x activity and action led to the outcome. In our society, that’s perhaps visits to grocery stores or a bar, and we do have this predictive power on what case growth looks like.
Before having this data and consortium, a lot of counties were sometimes making wise and sometimes random choices about blunt policy decisions, like sheltering in place or reopening entire swaths of the economy.
We’re preparing to get these tools in the hands of our local decision-makers so they have not only blunt options. With more refined insights, you have more refined policy tools. For example, mobile testing in specific neighborhoods where we see case growth, or having to close down and monitor a subset of businesses as opposed to an entire industry sector. Those will be useful right away. As we speak, with the sobering surge of cases we’re witnessing, we’re still seeing the erratic diversity of choices that counties across the US are making with the reopening of the economy.
We want to get these tools in place. They will be useful as soon as counties make more well-informed decisions based on these data tools. The feedback loop will continue to reinforce their efficacy. It’s just a matter of having set these kinds of structures in place where researchers can work with each other and work with data and work with policymakers.
It would have been great if that happened back in January 2020. But maybe we needed a bit of a kick to see that there’s such a crisis that demands this kind of connection and collaboration and data-driven decision making.
The good news is that we’re already learning that lesson, at least here in places like the Bay Area in the US.
It’s always heart-warming to learn that community spirit and academic collaboration is alive at times like these. Is there any way you’d be able to contribute to the consortium? Do reach out to Derek.