Saturday, April 28, 2007

Better Video -- Gamma and A/D converters

Okay folks, this is going to get somewhat detailed, because I think I have a half-decent and possibly new idea here. As you read along, just keep in mind that the overall goal is to make a ramp-compare A/D converter that has really large dynamic range (16 bits) and goes fast so we can minimize the frame scan time.
[The referenced wikipedia article has some significantly wrong bits. Just reading the referred articles shows the problems.]
First we're going to talk about gamma. Most digital sensors generate digital values from the A/D converter which are linear with the amount of light received by the pixel. One of the first steps in the processing pipeline is to convert this e.g. 12 bit value into a nonlinear 8 bit value. You might wonder why we would go to all that trouble to get 12 bits off the sensor, only to throw away 4 of them.

Consider just four sources of noise in the image for a moment:
  1. Readout noise. This noise is pretty much constant across varying light levels. For the purposes of discussion, let's suppose we have a standard deviation of 20 electrons of readout noise.
  2. kTC noise. Turn off a switch to a capacitor, and you unavoidably sample the state of the electrons diffusing back and forth across the switch. What you are left with is kTC noise, e.g. 28 electrons in a 5 fF well at 300 degrees K. Correlated double sampling (described below) can cancel this noise.
  3. Photon shot noise. This rises as the square root of the electrons captured.
  4. Quantization noise. This is the difference between the true signal and what the digital system is able to represent. Standard deviation is 1/12 of the step size between quantization levels.
You can't add standard deviations but you can add variances. To add these noise sources, take the square root of the sum of the squares. So, if we have a sensor (such as the Kodak KAF-18000) with 20 electrons of readout noise, and a full well capacity of 100,000 electrons, read by a 14 bit A/D with a range that perfectly matches the sensor, then we will see total noise which is dominated by photon shot noise. I've done a spreadsheet which lets you see this here.

Amazingly enough, we can represent the response of this sensor in just 7 bits without adding significant quantization noise. This is why an 8-bit JPG output from a camera with a 12-bit A/D converter really can capture nearly all of what the sensor saw. JPG uses a standard gamma value, which is tuned for visually pleasing results rather than optimal data compression, but the effect is similar. 8-bit JPG doesn't have quite the dynamic range of today's sensors, but it is pretty good.

The ramp-compare A/D converters described in my last blog entry work by comparing the voltage to be converted to a reference voltage which increases over time. When the comparator says the voltages are equal, the A/D samples a digital counter which rises along with the analog reference voltage. Each extra bit of precision requires the time taken to find the voltage value to double. When we realize that much of that time is spent discerning small difference in large signal values that will subsequently be ignored, the extra time spent seems quite wasteful.

Instead of having the reference voltage linearly ramp up, we could have the reference voltage exponentially ramp up, so that the A/D converter would generate the 8b values from the gamma curve directly. The advantage would be that the ramp could take 2^8=256 compares instead of, say, 2^12=4096 compares -- a lot faster!

It's not quite so easy, however. In order to eliminate kTC noise, the A/D converter actually makes two measurements: one of the pixel value after reset (which has sampled kTC noise), and another of the pixel value after exposure (which has the same sample of kTC noise plus the accumulated photoelectrons). Because the kTC sample is the same, the difference between the two has no kTC noise. This technique is called correlated double sampling (CDS), and it is essential. Because gamma-coded values are nonlinear, there is no easy way to subtract them -- you have to convert to linear space, then subtract, then convert back. As I mentioned, for a typical 5 fF capacitance, kTC noise at room temperature is 28 electrons, so this can easily dominate the noise in low illumination operation.

So what we need is an A/D that produces logarithmically encoded values that are easy to subtract. That's easy -- floating point numbers!

If we assume we have a full well capacity of 8000 electrons and we want the equivalent of 10b dynamic range but only need 6b of precision, then the floating-point ramp-compare A/D does the following:
Mantissa 6 bits, 8 e- step size
64 steps of 8 e- to 512 electrons, measure kTC noise
64 steps of 8 e- to 512 electrons
32 more steps of 16 e- to 1024
32 more steps of 32 e- to 2048
32 more steps of 64 e- to 4096
32 more steps of 128 e- to 8192

That's just 256 compares, and gets 10b dynamic range, so it's 4x faster than a normal ramp-compare.

In the last blog post, I described how you could do sequential, faster exposures per pixel to get increased dynamic range (in highlights, not shadow, of course). For example, each faster exposure might be 1/4 the time of the exposure before. The value from one of these faster exposures would only be used if the well had collected between 2000 and 8000 electrons, since if there are fewer electrons the next longer exposure would be used for more accuracy, and if there are more electrons the well is saturated and inaccurate.

One nice thing about having a minimum of 2000 electrons in the signal you are sampling is that the signal-to-noise ratio will be around 40, mainly due to photon shot noise. kTC noise will be swamped, so there is no need for correlated double sampling for these extra exposures. 40:1 is a good SNR ratio. For comparison, you can read tiny white-on-black text through a decent lens with just 10:1 SNR.

If you make the ratio between exposures larger, say 8:1, then you either lose SNR at the bottom portion of the subsequent exposures, or you need a larger well capacity, and in either case the A/D conversion will take more steps. These highlight exposures are very quick to convert because they don't need lots of high-precision LSB steps.

When digitizing the faster exposures, the ramp-compare A/D coverters just do:
64 steps of 64 e- to 4096
32 more steps of 128 e- to 8192

That's 96 compares and gets another 2 bits of dynamic range.

1 base exposure and 3 such faster exposures would give 16b equivalent precision in 544 compares, which is faster than the 10b linear ramp-compare A/D converters used by Micron and Sony. Now as I said in my previous post, this is a dream camera, not a reality. There is a lot of technical risk in this A/D scheme. These ADCs are very touchy devices. For example, 8000 electrons on a 5 fF capacitor is just 0.256 volts and requires distinguishing 0.256 millivolt signal levels. If the compare rate is 50 MHz, you get just 20 ns to make that quarter-millivolt distinction. It's tough.

But, the bottom line is that this scheme can deliver a wall of A/Ds which can do variable dynamic range with short conversion times. The next post will show how we'll use these to construct a very high resolution, high sensitivity, high frame rate sensor for reasonable cost.

Friday, April 20, 2007

Better Video -- A/D converters

Most of what I shoot with my camera is my kids and extended family, vacations and so forth. I need a better camera, one that can do DSLR quality stills, and also better-than-HDTV video. I'm going to write about a few things I'd like to see in that better camera.

Wall of A/D Converters

Modern DSLRs have a focal plane shutter which transits the focal plane in 4 to 5 ms. This shutter is limited to 200k to 300k operations, or about 166 hours of video, so it's incompatible with video camera operation.

Video cameras typically read their images out in 16 to 33 ms with what is known as an electronic rolling shutter. The camera has two counters which count through the rows of pixels on the sensor. Each row pointed to by the first counter is reset, and each row pointed to by the second counter is read. The time delay between the two counters sets the exposure time, up to a maximum of the time between frames, which is usually 33 ms.

A lot can happen in 33 ms, so the action at the top of the frame can look different from that at the bottom. In video, since the picture is displayed with the bottom delayed relative to the top, this can be okay, but it looks wierd in still shots. ERS is even worse in most higher resolution CMOS sensors which can take a hundred or more ms to read out.

It turns out there is a solution which serves both camps. Micron and Sony both have CMOS sensors (Micron's 4MP 200 FPS and Sony's 6.6MP 60 FPS) designed to scan the image out in about the same time as a DSLR shutter. Instead of running all the pixels through a single or small number of A/D converters, they have an A/D converter per column, and digitize all the pixels in a row simultaneously. The A/D converters are slower, so there is still a limit to how fast the thing can run, but it is feasible (the Micron chip does it) to read the sensor in 5 ms.

These A/D converters are cool because they allow good-looking stop motion like a focal plane DSLR shutter, they can be used for video, and here's the kicker: you get the capability of 200 frame-per-second video!

Currently these A/D converters have 10 bits of precision. Sony's chip can digitize at 1/4 speed and get 12 bits of precision, matching what DSLRs have delivered for years. We can do better than that.

The basic idea is to combine multiple exposures. Generally this is done by doing one exposure at, say, 16ms, and then another at 4ms immediately afterwards, and combining in software. The trouble with this technique is that there is a minimum delay between the exposures of whatever the readout time of the sensor frame is -- call it 5 ms. Enough motion happens in this 5 ms to blur bright objects which one would otherwise expect to be sharp.

Instead, let's have all the exposures at each pixel be done sequentially with no intervening gaps. Three counters progress down the sensor: a first reset, a second which reads and then immediately resets, and a third which just reads. The delay between the first and second waves is 16 times greater than the delay between the second and third waves. The sensor alternates between reading the pixels on the second and third wave rows, and alternates between resetting the first and second wave rows.

Because one exposure is 16x the other, we get 4 more bits than the basic A/D converter would give us otherwise. If the base A/D converter is 10 bits, this would get us to 14 bits. We don't want to have more than a 16x difference, because pixels that just barely saturate the long-exposure A/D have just 6 bits of precision in the short-exposure A/D. 5 bits or less might look funny (you'd see a little quantization noise right at the switchover where darker pixels had less).

But we can do still better. These column-parallel 10 bit A/D converters work by sharing a common voltage line which is cycled through 1024 possible voltage levels by a single D/A converter. So for a 1000 row sensor has to cycle through 1024000 voltages in 5 ms -- the D/A is running at an effective 205 MHz. I'm pretty sure they actually run at 1/2 to 1/4 this clock speed and take multiple samples during each clock cycle. Each column A/D is actually just a comparator which signals when the common voltage exceeds the pixel voltage. If we're willing to have just 9 bits of precision, the thing can run 2 times faster. In low light, that gives us ample time for 4 successive exposures down the sensor (not just two), each, say, 8x smaller than the one before. Now we have 9+3+3+3=18 bits of dynamic range, good for about 14 stops of variation in the scene, with at least six significant bits everywhere but the bottom of the range.

Why bother? Well, if the sensor has a decent pixel size and reasonably low readout noise (I'm thinking of the Micron sensor, but can't say numbers), then an e.g. 16 ms shot with an f/2.8 lens should capture an EV 4 interior reasonably well (here's the wikipedia explanation of EV). That's a dim house interior, or something like a church. Using the 18b A/D scheme above, we could capture an EV 18 stained glass window in that church and a bride and groom at the same time, with no artificial lighting, assuming the camera is on a tripod. That's pretty cool.

The fact that it takes twice as long (e.g. 10 ms instead of 5 ms) to read the sensor is fine. You'd only do this in low light, where your exposures will have to be long anyway. Even if you could read the sensor in 5 ms, if the exposure is 16 ms you can't possibly have better than 60 frames per second anyway. And people who want slow-motion high-resolution video with natural lighting in church interiors are simply asking for too much.

When the scene doesn't need the dynamic range, (say, you are outdoors), you can drop down to 12 bits and run as fast as the 10b column-parallel A/Ds allow in the Micron and Sony chips. This gives you 8 stops of EV range, similar to what most DSLRs deliver today. If you want extra-high frame rates (400 fps full frame), drop to 9 bits of precision.

Actually, if the camcorder back end can handle 8x the data rate, you can imagine very high frame rates (and correspondingly short exposures) done by dropping to 8 or 7 bits of precision, and binning the CMOS pixels together or using a subset of the sensor rows. I think 432-line resolution at 8000 fps would be a pretty awesome feature on a consumer camcorder, even if it couldn't sustain that for more than a second or two after a shutter trigger. By using a subset of the sensor columns or binning CMOS pixels horizontally, you might get the back end data rates down to 1-2x normal operation. That'd be amazing: normal TV resolution, sustained capture of 8000 fps video. Looking at it another way, it gives an idea of how hard it is to swallow the data off a sensor such as I am describing. (I'm getting ahead of myself, talking about resolution here, but bear with me.)

Side note: you don't have and don't want an actual 18b number to represent the brightness at a pixel. Instead, the sensor selects which of the 4 exposures was brightest but not saturated. The output value is then 2 bits to indicate the exposure and 9 bits to indicate the value. This data reduction step happens in the sensor: If the maximum exposure time at full frame rate is 16 ms, then the sensor needs to carry just 1 ms worth of data from the first wave of pixel readouts to the second and later waves... at most about 1/20 of the total number of pixels. That's 560 KB of state for an 8 MP chip. Since the chip is CMOS, that's a pretty reasonable amount of state to carry around.

Stay tuned for an even better place to stuff that 560 KB.