Skip to content

The Virtual Stereo Camera Rig

tschw edited this page Feb 8, 2021 · 4 revisions

This article explains the basics of stereoscopy and the geometry of a virtual stereoscopic camera rig for 3D image synthesis.

The second part shows how the geometrically interdependent values within this optical system are decoupled to derive reasonable parameters.

Basic setup

Letting both eyes of a viewer see two different images from a slightly different perspective can create a perception of depth.

When looking at an object, the brain instructs the eyes to adjust to the distance and matches the two images in respect to that object. In other words; it makes sure the object is perceived sharply and that both of its images from both eyes converge in the same (perceived) spot. The eyes also rotate inward during this process. However, as we will see, this effect is unimportant for crafting a working geometrical model and the so called toe-in is not further discussed in this article.

There are two simple experiments to expose the convergence process:

  • Stretch out your arm and point at an object in the room at twice the distance of your finger,
  • focus on the distant object,
  • alternatingly close the left and the right eye,
  • see how your finger jumps left and right.

For the second experiment (which may take several attempts and a little more concentration) just focus on the finger and observe the distant object jump instead.

You may have noticed that the finger jumps to left and the remote object jumps to the right when switching from the left to the right eye: The so-called parallax is zero at the convergence distance and that's where its direction changes.

The eyes initially see two separate images. Both images of the object converge to one image of the object as the result of an adjustment process: The viewer finds two images and an interpretation of them, that allows the consistent perception of the object of interest.

In order to find a geometric model, the image planes at the desired convergence depth of two parallel, perspective projections can simply be moved into the same position as illustrated in the figure below. Since the object remains in the same spot, the two image planes must be moved in the opposite directions.

Convergence

The bottom images show the boundaries of the field of view of each eye in a plan view.

Horizontal shear is applied along the depth axis to achieve congruence at a specific convergence depth. That is, the volumes representing the field of view of each eye are skew. Another way to concieve this geometry (assuming rectangular projections of pyramidal volumes as typical for 3D computer graphics) is to cut away pyramidal wedges off wider pyramidal volume, this way removing the vertical stripes at the left and the right end each of the eyes' images without a matching portion in the image seen by the respective other eye.

Parametrization

As we have seen, it is fairly easy to construct a stereoscopic camera system. This section will discuss how to parametrize this system in a reasonable way.

By setting up two parallel perspective camera projections and applying shear to achieve convergence at a specific depth, the projected FOV of each camera is changed by the shear, causing a distortion of the inner angle in horizontal direction. This distortion has to be considered in order to get back to an angle that actually exists somewhere in the final system, but the corrected angle would still depend on both the convergence distance and the translation of the cameras: The shear is the quotient of these values.

Our model is symmetric, so one may ask whether there is a formulation for a field of view considering both eyes, as this would allow for a meaningful definition of a corresponding input.

For this purpose it makes very much sense to use the area concomitantly covered by both eyes before the convergence plane, which then becomes the area covered by either one of the eyes behind it, as illustrated below.

Effective FOV

This definition accurately captures the field of view for both eyes and also allows to create an equivalent image without stereo effect.

We ignore the two extra regions "in the corners of the eyes" which cannot be perceived consistently: These are typically small and most of their collective area will probably be near clipped. If they can contain objects, they should never draw the user's attention and custom clipping should be used to prevent that from happening where appropriate.

The horizontal inner angle of the combined field of view can serve as a useful input: It has an accurate unit, can easily be understood and works the same way without stereo effect.

In order to decouple other parameters we will want to keep the field of view constant. To achieve this goal, it is crucial to understand the geometric coupling of the values in the system:

Naive Convergence Control

The animation shows that changing the distance to the convergence plane (marked green) by modifiying the skew also changes the field of view of view (depicted as dashed black lines). The origin of the frustum moves thus the distance to a near clipping plane, which is within a fixed distance in respect to the cameras (marked yellow) changes in respect to the origin of the frustum.

Similar effects can be observed when changing the horizontal translation of the camera as illustrated below.

Naive Camera Translation

The moving origin of the viewing volume is a problem. To solve it, we use it as the center of the system and translate the cameras along the depth axis. We also specify distances of the near and far clipping planes and the convergence distance in respect to the new origin. These distances can then be adjusted to remain in the same spot:

-- Given:
--
--    hFov		- horizontal inner FOV		[radians]
--    dFocus		- convergence depth		[scene units]
--
--    hCamDistance	- horiz. camera distance	[scene units]


widthPerDepth := 2 * Math.tan( 0.5 * hFov )


dTranslation := hCamDistance / widthPerDepth

-- applies to
-- camera positions & depth clipping bounds, and
-- convergence distance.


hShearZ := 0.5 * hCamDistance / ( focus + dTranslation )

-- applies to the projection matrix
-- multiply with 'near' for glFrustum-style L/R bounds

With these modifications in effect, the convergence distance can be specified independently from the FOV angle and the translation of the cameras as illustrated in the next two animations:

Decoupled Convergence Control

Decoupled Camera Translation

The convergence distance in scene units is a meaningful input as-is, although the user might be interested in the scale of scene units in respect to physical units.

The camera translation is just the interpupillary distance of the viewer scaled to scene units. However, this scale is by no means a fixed value but actually depends on several factors:

In the physical world, the screen occupies a certain area of the viewer's field of view. As long as the distance to the screen remains constant, the corresponding viewing angle does not change.

For best realism, one would use the same FOV angle for projection, however, since this angle is unknown (unless using a head-mounted device calibrated to its user), these angles may very well differ. Also, the human brain can adapt to different fields of view quite easily, looking through binoculars temporarily narrows the field of view, for instance. A simulation of this effect can be found in many gaming applications.

As a consequence of these effects, intentional or not, the screen is no longer a window into another world but in fact also applies some kind of zoom, although its size in the optical system however remains constant. Therefore the scale between scene units and physical units changes with the FOV angle. When this phenomenon is ignored, zooming in can yield parallax by many orders of magnitude too large and zooming out will make the parallax vanishingly small.

A scale of a scene can only be specified when there is a (known or assumed) physical FOV angle to anchor it to. When an application does not use the entire screen for 3D output, the fractional amount of the screen must also be considered:

-- Given:
--
--    hFov		- horizontal inner FOV		[radians]
--
--    eyeDistance	- interpupillary distance	[meters]
--    oneMeter		- when hFov == hScreenFov	[scene units]
--    hScreenFov	- characterizes the display	[radians]
--    screenWidth	- full screen width		[pixels]
--    viewportWidth - part of the screen used		[pixels]
--

scale := oneMeter * tan( 0.5 * hFov ) 		* viewportWidth /
		( 			tan( 0.5 * hScreenFov ) * screenWidth 	)

hCamDistance := eyeDistance * scale

Below animation illustrates zooming out by widening the FOV considering the changing scale.

Zooming Out, Scale Considered.

Clone this wiki locally