Tory Smith is the product manager of Mapbox’s computer vision and augmented reality platform.
When he joined Mapbox, they were developing a product designed to use a camera as a sensor to understand the environment in a driving context. Moving through the road network, understanding the surrounding area, and figuring out how to bring aspects of driving into the frame of reference of what a driver would see, rather than looking down at a map.
For the past three years, he’s been working on various ways of using computer vision and augmented reality together to build new types of experience, focusing specifically on navigation and getting from point A to point B.
WHY ARE CAMERAS AND COMPUTER VISION GOOD TOOLS TO UNDERSTAND THE LOCATION CONTEXT?
Cameras are everywhere.
Pretty much every driver in the United States has at least one camera with them, whether it’s integrated into the vehicle itself or if it’s on a device they carry with them.
Cameras are on every single laptop that ships and they’re also included in IoT devices. They’re ubiquitous.
The hardware to build a camera is simple and it’s inexpensive to manufacture them.
It’s easy to connect them to the Internet of Things and get not just what’s happening right at the camera head but also in the cloud, or a mix of both.
Cameras record the visible spectrum of light, which is something very intuitive for humans to work with. It means that there are tons of computer vision applications (including open source) that can do incredible things with cameras today.
HOW DOES THE CAMERA HELP US UNDERSTAND LOCATION CONTEXT?
Depends a lot on what you’re doing.
Images are information rich. There are millions of pixels, and every one of them has a bunch of color information.
It’s an enormous problem space. Most of the algorithms out there to understand the environment and then build augmented reality applications within them are built on some assumptions that constrain the types of things you might encounter.
Imagine a simple AR application to make a pizza show up on a dining table.
Your computer vision application looks for things it thinks are tables, or at the very least, a totally level flat surface that’s rectangular or square. It then figures out a surface that you want to render your pizza upon.
One tool we use for AR navigation is looking out into the distance. Assuming you have a forward-facing camera, it will look for where the horizon is in this scene and then figure out where the vanishing point is.
Where do all the straight lines have the same point of intersection on the horizon?
If you look out into the distance along a long straight highway, like out in the American West, the left lane markings and the right lane markings will converge at one point. If you take that point, you can use several concepts and basic geometry to understand where each pixel of space will correspond to in three dimensions.
In computer vision, we look for patterns that allow us to assign the camera pixels to three-dimensional spaces. Each pixel transforms into a world coordinate you need in order to make any basic AR application work.
Cameras are a powerful tool for doing that.
WHAT DO WE NEED TO KNOW IN TERMS OF LOCATION? XYZ COORDINATES?
Depends on what you’re building.
If we talk about applications associated with moving around in the world, typically along the road network, then there are a couple of distinct steps that you’d have to take.
You want to end up with something that makes sense while objects show up in the environment where you expect them to be. An example could be a parking space you’re trying to direct a driver to.
For the most straightforward type of AR, you can make an object show up at the side of the road.
But suppose this is something that also needs to be specific to a global coordinate, rather than just identifying where the curve is. In that case, it’s very different from identifying exactly where this xyz coordinate occurred on the entire Earth.
You’d need to understand where objects are relative to the camera and where that camera is relative to the Earth. Cameras are helpful in this because so many of them are attached to devices that have GPS sensors on them.
WHAT ELSE DO WE NEED TO CALCULATE THE VIEWPOINT?
The camera has six degrees of freedom.
The first three are straightforward — xyz coordinates. Or latitude, longitude, and elevation. Those give you a point in space.
For a camera we also need to define the Euler angles — the yaw, the pitch, and the roll — especially if we care about what the camera is pointed at.
This is the full six degrees of freedom state, also referred to as the pose.
To begin, you need to know where the camera is and where it’s pointed. You get your latitude, longitude, and elevation, in most conditions, with GPS.
Satellites will not give you a pose. Figuring out where the camera is pointed from relies on other sensors we often also have available.
Every mobile device created today has an IMU (Inertial Measurement Unit), which helps us understand how the phone is pointed because it can measure gravity. That’s how your phone knows whether it should be in portrait mode or landscape mode when you’re taking a picture. It can sense the direction of gravity.
There are several things computer vision can do (indoor or outdoor) to further understand the camera’s pose, specifically the yaw, the pitch, and the roll.
In a driving context, assuming this is a forward-facing camera, there won’t be a lot of roll because the cameras are mounted, facing forward. There won’t be that much pitch either unless you’re really bouncing up and down a lot.
You just have to figure out what the yaw is, which is the z-axis.
The one pointed straight up in the air. Are we looking a bit to the left? A bit to the right? Are we pointed straight forward?
The minimum viable sensor set for driving to make these AR applications work is a camera, a GPS sensor, and an IMU.
WHAT IS AUGMENTED REALITY?
When most people think of augmented reality, they probably think about games, like Pokémon Go.
Augmented reality understands the physical environment around the user to the degree that you can add virtual components to it in ways that make sense.
If you’re walking around your house with your camera pointing at a table, you can understand where the table is and make a pizza show up on the table. It’s not in the table, it’s not floating above the table, it’s actually on the surface of the table.
To achieve this, step one is to understand the 3D environment. Where are the main surfaces? What are the physical constraints of that environment?
Step two is to create objects, which we refer to as rendering. How do we render objects inside of that environment that behave by following the basic rules of physics? So much that someone looking at that screen or experiencing AR will believe a pizza is sitting on the table the same way they would expect it to.
Before you buy a couch, you want to see if it fits in your apartment. You can render it in AR in your home and say, “It fits, it looks good. I like how it looks with the rest of the color and design scheme I have for this room.”
ARE YOU DOWNLOADING BASE MAPS TO UNDERSTAND THE 3D STRUCTURES OR CALCULATING ON THE FLY?
For some applications, it doesn’t really matter where you are. Most tables have the same basic attributes — they’re flat with a 2D surface suspended several feet above what we identify as the floor.
But there are some applications where it matters where you are. In the Pokémon Go application, you can create content that corresponds with public spaces from the world’s map. If you’re in San Francisco, there might be many Pokémon in Alamo Square Park, or in Golden Gate Park, or in the Presidio.
In these scenarios, game makers care a lot about where the user is. With GPS, Pokémon isn’t just standing on any ground, but on the ground in a particular area you know.
It’s a great way to encourage congregation.
CAN WE USE AUGMENTED REALITY TO NAVIGATE THE REAL WORLD?
Yes, let’s use driving as an example.
There are parallels between the experiences you can create in AR and the autonomous vehicle world’s.
For driving, there are already several computer vision algorithms out there (many of them open source) that can identify lane markings, cars, or pedestrians on the road.
A straightforward application would be that every time there’s a car in front of you, a red dot shows up above the car in 3D. When there’s a pedestrian on the road, a little exclamation point shows up above their head to send a warning to the driver.
These work without any location context.
If you need to guide a driver, the real-world location and real-world destination are essential. You not only need to understand the environment around the driver in its local context (where the camera is relative to what the driver is looking at), but you have another two things to figure out:
- Where is the driver relative to the Earth?
- Where are the different components of what the driver is looking at, relative to the driver?
Let’s say you’re driving across town, and you’ve almost reached your destination. You want to highlight the Starbucks you’re going to and see the drive-through entrance and have that floating in the environment. For that to work, you need to understand the pose of the camera and the real-world coordinates of that Starbucks — whether it’s the building itself, the entrance to the parking lot, or a parking space on the street.
You can render a 2D map into the real world in a way a driver would point their camera at it and understand where something is, as long as you know the coordinates of that object.
The onus is on the AR app maker to figure out the camera’s pose in that situation. And if they can do that, the object will show up in AR exactly where the driver expects to see it.
HOW CAN DIGITAL OBJECTS INTERACT WITH THE PHYSICAL WORLD?
Are you looking at a simple map with latitude and longitude coordinates for the building? If so, you’ve got a point, and you can create a big balloon that floats at that location above the ground.
If you’re after something more advanced, you need more map information, such as the road network.
Where is the road relative to that building? What is the 3D representation of this building? Can you render it as a triangular prism in the environment?
If you can understand those eight different coordinates that define the eight corners of that rectangular prism, you can highlight the distinct faces of that object.
It’s not just a point coordinate you need but an understanding of what physical three-dimensional space you expect to be occupied by the building you’re trying to highlight.
WHAT CAN WE DO WITH AR? WHAT IS POSSIBLE WITH THIS KIND OF TECHNOLOGY TODAY?
Until now, we’ve seen AR in video games, flight simulators, and Star Fox games.
A pilot who wears a heads-up display (HUD) sees the environment enriched with UI. When he spots another plane in the sky, it’ll have a green square around it, “Friendly.” Or a red square around it, “Enemy.” It might even have text around it, “Enemy Spitfire” or “Friendly Hurricane.” He can keep his eyes on where they’re supposed to be, but the environment is augmented with information.
For drivers, our end game is to build augmented reality experiences.
This brings me to the third challenge — heads-up display.
Once you understand where you are, you look at the environment and understand where the environment is relative to you. And then, as with the heads-up display, the screen is no longer fixed.
It’s projected in between where the driver’s eyes are and what the driver is looking at. In a lot of cases, the application would be painted onto the windshield. When you’re doing that, you also need to understand where the driver’s head is. You track three different things simultaneously, which means you need sensors for all three things.
This is difficult to do.
We need to provide information to the driver that corresponds to what they’re looking at without them looking down at the cluster or looking into the center stack where a screen would typically be.
IS THERE A CROSSOVER BETWEEN AUTONOMOUS VEHICLES AND AUGMENTED REALITY? OR ARE THEY TWO COMPLETELY SEPARATE THINGS?
20 years ago, you didn’t even have GPS in your car.
Today, you probably have a couple of cameras. A backup camera has a forward-facing camera with computer vision built into it and a radar sensor looking ahead, identifying other vehicles in the environment.
More advanced sensors understand the speed of your vehicle as you drive.
Plus, you have GPS.
All the above can solve the first of our two problems — where the vehicle is and where other objects in the environment are relative to the vehicle. There’s no AR here. These features are solely there for safety to stop you from drifting from one lane to another or help you back out of your parking space.
Yet, the same sensors supply information for an AR application. Some already do, for safety. On some vehicles, as you back up, the ultrasonic sensors see things that are out of the driver’s field of vision. They render rectangles showing the extent of, say, a wall that’s right there.
AR will be just as useful for vehicles driving themselves, perhaps as textual information for the driver, “The car is coming to a stop,” or “The building where you’re dropping off your package is coming up on the right.”
ISN’T ALL THIS CLUTTERING THE VISION? SHOULDN’T THIS BE MAKING THINGS SIMPLER?
It’s not always clutter that you get. If you take backup cameras, as an example, they give you a view that you’d never be able to see.
If you think of a car full of people and about to change lanes, the driver looks left, then right, and considers the blind spots. “A” pillar is over in the front, connecting the windshield to the side windows. “B” pillar is where the backseat passengers are.
With AR, you could show the driver an uninterrupted view of the blind spot, as if the camera were located just outside of the vehicle. Tesla is already experimenting with using cameras to replace mirrors. This gives more freedom in reducing blind spots that generally occur if the driver is constrained to looking at a true mirror from inside the vehicle.
SO WE REMOVE THE “A” AND “B” PILLARS FROM THE WORLD TO SEE THROUGH?
One of the earliest applications of this is on fighter jets.
The pilot of an F-22 needs to know if there’s another plane or another object directly below. Fighter jets don’t have rearview mirrors. If a pilot wants to know what’s directly beneath without doing a barrel roll, they have advanced helmets and sensors to do that.
A lot of the same innovations can apply to driving scenarios.
WHERE DO YOU SEE PROBLEMS WITH ADOPTION? ARE THEY TECHNICAL OR CULTURAL?
The number one thing for driving use cases is probably a hardware problem. It’s challenging to implement a heads-up display. Still, progress is being made with wearables right now. They can be fun and immersive experiences.
Take Oculus, which is VR, not AR. You put it on and you’re inside a virtual environment that tracks your head movement.
Wouldn’t it be fascinating to experience something not through a screen but directly augmenting your vision?
Imagine Google Glass, or something similar, and connecting a real-time understanding of what someone is looking at with the physical world around them, and then drawing objects into that environment. You can build many powerful experiences on top of that if a user no longer has to hold something but has free use of their limbs.
For a game like Beat Saber, you’d wear something on your head but hold two sabers in your hands to interact with that environment. That’s not yet possible.
You need a screen to look at to see the AR experience. If you’re a driver and you don’t have a head-up display, you’d have to look somewhere that’s not directly ahead of you. That’s obviously a safety concern.
So we either put that screen as close as possible to the real world or design it to be only used in certain situations. Your backup camera does the same. With it, you see much better. It only turns on and shows the screen when you’re backing up, never moving forward.
For driving applications, safety is the number one concern. We want to give the driver a natural experience while they can use their hands for something else.
For now, we’re still waiting on what’s mostly a hardware and systems challenge.
IT’S NOT EASY TO CONVINCE THE PUBLIC TO WEAR SOMETHING ON THEIR FACES
Anyone who’s experienced a prototype of augmented reality navigation via a heads-up display loves it. It gives you the same overlays and information you get in a video game, adds extra context, and allows you to keep your eyes on the road.
The earliest heads-up displays showed the speed in the windshield without looking down at a speedometer. Maximizing eyes on the road is ideal and a clear-cut advantage.
In the gaming space, they use wearables all the time.
Cultural pushback might come as elitism — a perception that only a few people can afford or access the technology, which was the case with Google Glass. As these things become more democratized and affordable, they’ll be as ubiquitous as smartphones.
15 years ago, when iPhones and Android devices first started being something that everyone had, there was a similar attitude to people with the very first iPhones who browsed the internet on the go. Now we take them for granted. There will be a similar gradual adoption after the initial discomfort. We can overcome that if the real value is demonstrated and experienced by the broader public.
IS THIS STILL A GEOSPATIAL PROBLEM? DO WE STILL NEED A BACKGROUND MAP?
Most drivers today don’t use navigation when they drive in familiar places unless they need to know about live traffic.
The map is still vital if you’re going somewhere you’re not familiar with or need guidance.
Apps that give you AI on finding a parking spot or how much parking costs per hour in that area don’t need more details on the map than we already have.
Many of the applications we’ve been working on at Mapbox use the same road network to drive the AR application as a traditional two-dimensional navigation application.
There are some exceptions, like Google’s app for walking with directions that appear in AR.
Point your phone ahead of you as you walk, and it’ll figure out where you are and highlight a walking path to take. The way they implemented this is something that most companies wouldn’t be able to do.
They image match what your phone sees against their base of Street View imagery.
They solve the camera’s location and orientation’s pose problem by comparing the real-time imagery you’re looking at with the stored historical imagery they have from their Street View project.
That’s a Titanic amount of data on the server-side to compare your real-time observations to give you a precise location, even if your GPS isn’t very good.
However, this is not something that would scale well to all use cases.
Vehicles, for example, move much faster and the environment changes quickly. To compare real-time video with cloud imagery is expensive.
Also, the level of vision you get in a vehicle as you’re driving through a city is not just reading your GPS — it’s doing matching, too. It’s looking at your past few seconds of GPS fixes and then doing dead reckoning using the IMU on your device. It snaps, so to speak, each of those GPS fixes onto the road network and makes assumptions on the way a car would have to travel through there.
Google did this differently for walking directions because if you walk around a park, you don’t have to follow the rules of a car. You can change directions quickly. You can turn the phone quickly. You don’t have to walk on a path; you can walk off it.
If you’re a pedestrian, the approach of looking at 3D imagery makes a lot more sense.
For driving, there are also some assumptions we can make to constrain the problem of solving that pose to build a generalized understanding of how a vehicle will move through the road network and how it may interact with different objects in that road network.
There are a few things that stuck out for me.
First, the idea of local and global. An excellent example of local augmented reality is the filters you’re used to seeing on social media — bunny ears on your head. This is not augmented reality in a global context. It’s local to what’s happening on the screen. It identifies your head and then puts the rabbit ears on top of it.
Then, there’s the augmented global reality where we want features to interact with what’s happening in the real world. Tory’s work focuses on the navigation aspect of augmented reality. Digital objects can interact with reality in a global context — it’s fascinating.
AR is an experience for people. It’s personal, relevant, and relative to you. The way I experience the internet (based on my search history, the data Google knows about me, my location, my age, and the rest) differs from the way you experience it. I look at the Internet as something that folds itself around me, personalizes that self, and adjusts itself to me as I move through it based on these things that Google knows.
I wonder what will happen when we travel around in these filter bubbles, not just for the internet and in our digital lives, but also when we take it out of the digital world and overlay it over the physical world. I wonder what this personalization of physical space and personalizing experiences will mean for our shared understanding of locations and if that common understanding of place will be further diluted.