Note that Lytro keeps their depth algorithm secret, so I know nothing about Lytro's depth algorithm.
I'll just go over some basics of what I think are the basics of a common algorithm and its problems.
Reconstructing a Scene From Two Cameras
|Images from two cameras
Start with a scene with two cameras (L and R). Each takes an image of the world (this scene has 4 balls with different shades of gray.) We know where our cameras were and what direction they were pointed and we want to use the images to figure out where the balls were. In particular, we want to know their depth.
|Find the position of one ball. The ball could be anywhere on this ray.
We can imagine a virtual image plane in front of the camera that is the same width as our image. We shoot a ray from the camera through that point in the image into the world.
We know the ball is somewhere in this direction. But there is some ambiguity as it could be anywhere along the ray.
I didn't illustrate it here, but there is also an ambiguity in the size of the ball, depending on how far away it is.
|Adding information from camera L.
Shoot out some rays from camera L to triangulate the location. When we're looking for the point in the other image, we don't necessarily know anything about the structure of the scene.
We should shoot rays across through the whole image in L. For now, let's just consider 4 rays corresponding to the ball positions as seen from L and think about which is the most likely to correspond to the one in R. In this example, shooting out other rays would just correspond to a white point in the image.
For example, for the same ball, the color/value should be about the same. So we can pick the ray corresponding to the closest color/value.
There can be many refinements to this algorithm - first finding crosses and T-junctions, looking for other features, examining straight lines to find a vanishing point...
But for now, just to get the idea across, I'll take the naive color correlation.
|Full solve for all four balls.
Weaknesses of This AlgorithmThere's an inherent weakness in this though. This algorithm is completely dependent on being able to find the corresponding points between the two images. There is a lot that can get in the way of that.
|Greatly exaggerated noisy image.
Relating This to Light Fields
|Rays as a light field.
The light field will have many more rays than this, and will not center on the same two focus points.
But I would argue the idea is the same. Though again, I'm just saying this is one way to use a light field to compute depth.
ImprovementsThis was a very old and basic algorithm, and there's a lot of ways to improve it.
- Slightly blur/smooth/de-noise the image first to deal with a noisy image (like high ISO).
- Have a second pass to incorporate assumptions about the scene. For example we might assume that neighboring pixels are probably at the same or similar depth. If there are multiple matches, pick the one that maintains coherency.
- A light variation (I can't remember which SIGGRAPH I saw this in, but it was a very long time ago) is: instead of starting from one camera and searching for the corresponding point in the other camera to establish depth, propose a depth, project both images to a plane at that depth, and look for coherent sections.
- Try to search for some reliable markers, such as crosses and T-junctions that appear in both images and give more weight to those.
This is not to say this is the only way to find a depth maps from images(or light fields). If more is known about the contents or structure of the scene, that could be exploited. For instance:
- Straight lines could be analyzed to look for architectural features and vanishing points. Then make some assumptions about the size of buildings, doors, windows, etc. to get the scale. Though this is not a general solution, and wouldn't help with people, nature, landscapes, etc.
- Some intensive pattern recognition, probably with the help of machine learning, to identify objects (head, eye, doorway, cat, finger, flower, ...) in the scene, along with knowledge of the approximate sizes of these objects and the focal length of the lens, could tell us the world space position of the object(even with a single image).
- An extension to the previous idea, if there are some assumptions about the 3D structure of the object (round, flat) and the placement of the light, a program could fine tune the depth.
Back to LytroAgain, I have no idea what algorithms Lytro uses. But judging from the errors I've seen with the depth maps I've gotten for my photos, they share a lot of the basic weaknesses I've described above, and I haven't seen evidence of these more content-based methods.
While the theory behind finding a depth map isn't too complicated, there are some complications with using real world data with all of its noise. In Lytro's case, I think the depth map might be used to get faster processing for changing the depth of field, but it is unreliable and results in inaccurate renderings. Though it would likely be slower, I think the image could be rendered directly from the light field data and might not be subject to the same problems.