Steve Hwan's Photography Blog: A Brief Tangent: Computing Depth Maps

I think a depth map is an integral part of the Lytro photo processing. I may express some discontent about this, so I want to back up and talk about what I think are weaknesses in computing depth maps.

Note that Lytro keeps their depth algorithm secret, so I know nothing about Lytro's depth algorithm.

I'll just go over some basics of what I think are the basics of a common algorithm and its problems.

Reconstructing a Scene From Two Cameras

Images from two cameras

Start with a scene with two cameras (L and R). Each takes an image of the world (this scene has 4 balls with different shades of gray.) We know where our cameras were and what direction they were pointed and we want to use the images to figure out where the balls were. In particular, we want to know their depth.

Find the position of one ball. The ball could be anywhere on this ray.

Let's say we're trying to figure out the position of the second lightest gray ball.

We can imagine a virtual image plane in front of the camera that is the same width as our image. We shoot a ray from the camera through that point in the image into the world.

We know the ball is somewhere in this direction. But there is some ambiguity as it could be anywhere along the ray.

I didn't illustrate it here, but there is also an ambiguity in the size of the ball, depending on how far away it is.

Adding information from camera L.

We can use the other camera to resolve the ambiguity.

Shoot out some rays from camera L to triangulate the location. When we're looking for the point in the other image, we don't necessarily know anything about the structure of the scene.

We should shoot rays across through the whole image in L. For now, let's just consider 4 rays corresponding to the ball positions as seen from L and think about which is the most likely to correspond to the one in R. In this example, shooting out other rays would just correspond to a white point in the image.

For example, for the same ball, the color/value should be about the same. So we can pick the ray corresponding to the closest color/value.
There can be many refinements to this algorithm - first finding crosses and T-junctions, looking for other features, examining straight lines to find a vanishing point...

But for now, just to get the idea across, I'll take the naive color correlation.

Full solve for all four balls.

We can do this to solve for the positions of all four balls.

Weaknesses of This Algorithm

There's an inherent weakness in this though. This algorithm is completely dependent on being able to find the corresponding points between the two images. There is a lot that can get in the way of that.

Uniform color.

For example, the scene may have areas that just have one uniform color. Maybe someone's face. Maybe someone's clothing. Maybe a background wall. But when there's too much of the same color, it's difficult to find the corresponding point in the other image.

Greatly exaggerated noisy image.

An image could just have too much noise. Maybe you're shooting in low light and the image sensor can't clean the thermal noise. Though the Lytro camera can shoot at ISO 3200, I didn't really trust it above 1000. If there is too much noise, you can get false hits and it becomes difficult to find the feature points at all.


Motion blur.

The objects could be moving. Not only does this make it harder to find corresponding points between the two images, but it introduces the additional problem that some pixels actually hold information for multiple objects at different depths.

Relating This to Light Fields

Rays as a light field.

So far, I've been describing the algorithm for finding the depth from images from two cameras. But I think that's analogous to having a light field that is just a collection of rays between two planes. One pixel has multiple rays coming out.

The light field will have many more rays than this, and will not center on the same two focus points.

But I would argue the idea is the same. Though again, I'm just saying this is one way to use a light field to compute depth.

Improvements

This was a very old and basic algorithm, and there's a lot of ways to improve it.

Slightly blur/smooth/de-noise the image first to deal with a noisy image (like high ISO).
Have a second pass to incorporate assumptions about the scene. For example we might assume that neighboring pixels are probably at the same or similar depth. If there are multiple matches, pick the one that maintains coherency.
A light variation (I can't remember which SIGGRAPH I saw this in, but it was a very long time ago) is: instead of starting from one camera and searching for the corresponding point in the other camera to establish depth, propose a depth, project both images to a plane at that depth, and look for coherent sections.
Try to search for some reliable markers, such as crosses and T-junctions that appear in both images and give more weight to those.

And I'm sure there are many other ways to refine the algorithm. At the core though, these improvements still fundamentally rely on being able to find correlations between colors/values or patterns between them. They may be less vulnerable, but they are still vulnerable to the weaknesses I described above.

This is not to say this is the only way to find a depth maps from images(or light fields). If more is known about the contents or structure of the scene, that could be exploited. For instance:

Straight lines could be analyzed to look for architectural features and vanishing points. Then make some assumptions about the size of buildings, doors, windows, etc. to get the scale. Though this is not a general solution, and wouldn't help with people, nature, landscapes, etc.
Some intensive pattern recognition, probably with the help of machine learning, to identify objects (head, eye, doorway, cat, finger, flower, ...) in the scene, along with knowledge of the approximate sizes of these objects and the focal length of the lens, could tell us the world space position of the object(even with a single image).
An extension to the previous idea, if there are some assumptions about the 3D structure of the object (round, flat) and the placement of the light, a program could fine tune the depth.

Back to Lytro

Again, I have no idea what algorithms Lytro uses. But judging from the errors I've seen with the depth maps I've gotten for my photos, they share a lot of the basic weaknesses I've described above, and I haven't seen evidence of these more content-based methods.

While the theory behind finding a depth map isn't too complicated, there are some complications with using real world data with all of its noise. In Lytro's case, I think the depth map might be used to get faster processing for changing the depth of field, but it is unreliable and results in inaccurate renderings. Though it would likely be slower, I think the image could be rendered directly from the light field data and might not be subject to the same problems.

Steve Hwan's Photography Blog

Monday, November 14, 2016

A Brief Tangent: Computing Depth Maps

Reconstructing a Scene From Two Cameras

Weaknesses of This Algorithm

Relating This to Light Fields

Improvements

Back to Lytro

No comments:

Post a Comment