Color Image Fidelity Metrics Evaluated Using Image Distortion Maps

Xuemei Zhang, Hewlett Packard Labs
Brian Wandell, Stanford University

Abstract:

Several color image fidelity metrics are evaluated by comparing the metric predictions to empirical measurements. Subjects examined image pairs consisting of an original and a reproduction. They marked locations on the reproduction that differed detectably from the original. We refer to the distribution of error marks by the subjects as image distortion maps.

The empirically obtained image distortion maps are compared to the predicited visible difference calculated using (1) the widely used root mean square error (point-by-point RMS) computed in uncalibrated RGB values, (2) the point-by-point CIELAB $\Delta E_{94}$ values (CIE, 1994), and (3) S-CIELAB $\Delta E_{94}$, a spatial extension of CIELAB $\Delta E$ metric. The uncalibrated RMS metric did not predict the perceptual image distortion data well. The point-by-point CIELAB $\Delta E_{94}$ metric provided better predictions, and the S-CIELAB metric, which incorporated the spatial color sensitivity of the eye, gave the most accurate predictions.

None of the metrics provided an excellent fit to the data. Image areas with poor predictions were concentrated in regions containing large negative local contrast. When these areas were excluded from our data analysis, both S-CIELAB and CIELAB predictions had much better agreement with the perceptual data. This suggests that the next step in improving color image fidelity metrics is to re-define color difference formula such as CIELAB $\Delta E_{94}$ in terms of local contrast.

Introduction

Measurement of perceptual image fidelity is an important issue in digital image reproduction. When a reproduction appears identical to the original, we say it has perfect perceptual image fidelity. In current imaging applications, printed and displayed reproductions have visible distortions when compared to the original. To develop methods for improving image fidelity, it is useful to have metrics to measure fidelity. These metrics must account for the characteristics of the human visual system.

Metrics for predicting the visibility of color changes of large uniform targets have been used widely to describe tolerances for color reproduction of large samples in the paint and dye industry. The CIELAB metric is a standard that specifies how to transform physical image measurements into perceptual differences. The metric was derived from perceptual measurements of color discrimination of large uniform targets. Though not perfect, the metric has been in use for twenty years, and it has served as a satisfactory tool for measuring perceptual difference between large uniform patches of colors. A modification of the $\Delta E$ formula was released by CIE in 1994 based on new experimental data. The new formula was found to predict color difference slightly better than the old formula [1]. Hence, in this paper we will use the CIE94 formula to calculate $\Delta E$ values.

The CIELAB metric is not suited for image fidelity. Many studies have found that color discrimination and appearance depend on spatial pattern of the images [2,3,4,5,6,7]. For example, the human visual system is not as sensitive to color differences in fine details as compared to large patches, yet the CIELAB color metric will predict the same perceptual difference for the two cases since there is no spatial variable in the CIELAB transformation.

In a spatial extension of CIELAB, named S-CIELAB [8], the spatial-color sensitivity of the human eye is included in the metric. The S-CIELAB metric incorporates the different spatial sensitivities of the three opponent color channels by adding a spatial pre-processing step before the standard CIELAB $\Delta E$ calculation. This spatial extension is designed to accounts for human spatial-color sensitivity and thus improve the performance of the CIELAB $\Delta E$metric for patterned targets.

To test how well the S-CIELAB metric predicts image fidelity of natural images, we measured perceptual image distortion for a set of color images. These images were displayed on a CRT monitor, and reproduced using either (a) simple halftoning algorithm or (b) a simple image compression algorithm [9]. In this paper we compare how well several color metrics predict the perceptual image fidelity.

Color image fidelity metrics to be evaluated

The image fidelity metrics evaluated in this paper are (1) the widely used root mean square error (point-by-point RMS) computed in uncalibrated RGB values, (2) the point-by-point CIELAB $\Delta E_{94}$values [10], and (3) S-CIELAB $\Delta E_{94}$, a spatial extension of CIELAB.

RMS on RGB

The RMS error values were computed as point-by-point vector length of the RGB difference image between an original image and its reproduction. For example, the point-by-point RMS error at position (i,j) of the images is:

\begin{displaymath}RMS_{ij}=\sqrt{(\Delta R_{ij})^2 + (\Delta G_{ij})^2 + (\Delta B_{ij})^2}
\end{displaymath} 1

where $\Delta R$, $\Delta G$, and $\Delta B$ represent the difference in R, G, and B values between the original color image and the reproduction.

The RMS metric does not include any information about the device used to present the images. Therefore, the RMSE value computed using the above equation is un-calibrated. Using un-calibrated image values to measure perceptual difference is poor practice, because the displayed image can differ depending on the display hardware. Still, because the RMS measure is commonly used in this fashion, we included it in this analysis.

CIE $\Delta$E metric

CIELAB is a CIE-recommended standard color metric space [11]. A color is defined in this space by the coordinate values $L^\ast$, $a^\ast$, and $b^\ast$, which are transformations of the CIE tri-stimulus values X, Y, and Z:


where $\frac{X}{X_n}, \frac{Y}{Y_n}, \frac{Z}{Z_n} > 0.01 $, and Xn, Yn, Zn define the white point.

The CIE X,Y,Z values contain information about the level of light absorption by each of the 3 types of cone photorceptors. The XYZ values are close to being linear transformations of the light absorption levels of the long, middle, and short wavelength sensitive cones on the human retina. The non-linear $L^\ast a^\ast b^\ast$transformation takes into account the findings that perceptual difference between two colors is nonlinear and better predicted by the contrast difference between the two colors [12]. This nonlinearity can be quite large in some regions of the color space, resulting in strongly elongated iso-sensitivity contours in the XYZ space. The $L^\ast a^\ast b^\ast$ transformation attempts to make the iso-sensitivity contours circular in the $L^\ast a^\ast b^\ast$ space.

The CIELAB space was intended to be a perceptually uniform color space, so that equal distances in the color space represent equal perceived differences in appearance. Color difference is defined as the Euclidean distance between two colors in this color space:

\begin{displaymath}\Delta E_{ab}^\ast = \sqrt{(\Delta L^\ast)^2 + (\Delta a^\ast)^2 +
(\Delta b^\ast)^2}
\end{displaymath} 2

The CIELAB transformation is relatively simple, but it does not result in a perfectly uniform perceptual space [13]. The goal of the transformation was to make iso-sensitivity contours circular and about the same size everywhere in the color space, and the CIELAB transformation achieves this goal only approximately. In addition, visual environmental factors such as the ambient illumination, background color, etc. also affects color discrimination. In 1994, the $\Delta E$ formula was modified based on new experimental data, and in an effort to allow easy parametric correction of the formula. The new color difference formula is calculated as a weighted distance between two colors in the lightness, chroma, and hue space ($L^\ast$, $C^\ast$, $H^\ast$):


The symbols $\Delta L^\ast_{ab}$, $\Delta C^\ast_{ab}$, and $\Delta
H^\ast_{ab}$ represent the differences between the two colors to be compared along the lightness, chroma, and hue dimensions, SL, SC, and SH represent weighting factors calculated from the chroma coordinates of the ``standard'' of the two colors compared, and kL, kC, and kH are parameters specific to experimental conditions [1]. In this paper, we will use the CIE94 formula to compute CIELAB $\Delta E$ values, and the result will be denoted as $\Delta E_{94}$. We set the kL, kC, and kH values to be 1. The values SC and SH are calculated from the chroma coordinates $C^\ast_{ab}$ of the colors in the original (un-distorted) images.

The $\Delta$E calculations require specification of a white point. For the calculations in this study, the white point is set to be the white point of the display used in the Image Distortion Map experiment. This is not the best way to choose a white point, because the white point is supposed to depend on individual images, not the display device. However, since the display white was always visible on the screen when subjects did the Image Distortion Map experiment (as a white button on the gray background), this choice of white point was considered acceptable.

S-CIELAB: Adding spatial sensitivity to CIELAB

A spatial extension to CIELAB was developed to account for how spatial pattern influences color appearance and color discrimination. The new spatial color metric is called Spatial-CIELAB, or S-CIELAB [8].

To make the spatial extension to CIELAB modular, the extension is added as a spatial pre-processing of the images before computing CIELAB differences. The purpose of the extension is to remove the image components that cannot be seen by the eye. S-CIELAB consists of three processing steps. First, the original and distorted images, which are represented in a device-dependent space, are converted into a device-independent representation consisting of one luminance and two chrominance color components for each image [2,3]. Second, each component image is passed through a spatial filter that is selected according to the spatial sensitivity of the human eye for that color component. Third, the filtered images are transformed into the CIE-XYZ format such that the CIELAB color difference formula can be applied to give a S-CIELAB $\Delta E_{94}$ map, which tells us where the visible distortions are in the image, and how large the distortions are.

The interpretation of the $\Delta$E values is the same as the interpretation of the standard CIELAB $\Delta E$ values, i.e., distortions of 1 $\Delta E_{94}$ unit is at threshold visibility at optimum viewing condition. Under less-controlled viewing conditions in practice, distortions with $\Delta E_{94}$ values around 2 or below are generally not visible.

Because of the design of the spatial filters, the S-CIELAB predictions are the same as the CIELAB predictions for large uniform targets. S-CIELAB will typically predict lower visibility of color differences for textured regions, however. Qualitatively, these predictions are consistent with measurements of human spatial-color sensitivities.

Summary

We chose the above three metrics to evaluate because they represent a base line metric (RMS on RGB), a metric with color sensitivity added (CIELAB $\Delta E_{94}$), and a metric that further adds spatial-color sensitivity (S-CIELAB $\Delta E_{94}$). The comparison will reveal how much we gain by progressively refining our metrics to incorporate more perceptual factors that influence color difference perception.

Other factors that affect perceptual fidelity measurements include, but are not limited to, the adaptation of the eye to ambient illumination [14,15,16,17,18], contrast masking effect of spatial patterns [19,20,21], and higher level cognitive processes such as memory and attention [22]. Other color image metrics exist that take into account one or more of these effects, such as DCTune [23]. For this paper, however, we will limit our scope to the three metrics described above, because they are all simple metrics with the calculation tools readily implemented, and they generate predictions in the format of pixel-by-pixel distortion maps, which is consistent with the format of the data we will use.

Perceptual image distortion map

To test the accuracy of color image fidelity models, it is necessary to have a data set of experimental measurements establishing where and how subjects perceive image reproduction errors on real images. In this paper we use a set of measurements of perceived reproduction errors between six natural images and reproductions of these images created using (a) digital halftoning (void and cluster), and (b) image compression (JPEG-DCT).


  
Figure 1: The image distortion map for an original and its reproduction. (A) The original color image is shown in grayscale. (B) The halftoned reproduction using void-and-cluster matrix is shown in grayscale. (C) The image distortion map measured by pooling the data from all observers is shown.
\begin{figure}
\centerline{ \psfig{figure=figs/dataMap.eps} }
\suppressfloats
\end{figure}

In this experiment, subjects identified regions in halftoned or compressed images that appeared to be different from the original image. They were asked to mark all image regions that had visible distortions with a digital marker stamp in different sizes, the smallest of which is circular with a diameter of 10 pixels, or 0.4 degree of visual angle. Subjects were encouraged to use the smallest stamp size whenever they can (it's also the default stamp size at the beginning of each image presentation). They were instructed to mark all regions that have visible difference until all differences had been covered by marks. Because of the fixed stamp shape and sizes, most of the times it was impossible to avoid stamping identical regions of the two images. The smallest marker size limits the spatial resolution of the measured image distortion map and we account for this in the data analysis.

A total of 24 subjects performed the task for all image pairs. The error marks produced by the subjects were pooled for each pair of images. From these pooled data we calculate the probability of a mark covering each pixel in a halftoned or compressed image, which we call image distortion maps. Figure 1 shows the image distortion map for an original and its reproduction. The probability that a pixel is marked is represented by the gray level: Light regions correspond to frequently marked areas (high visible differences) and dark regions correspond to infrequently marked areas (low visible difference). More details about the method and procedures of this experiment can be found in another paper [9].

This experiment provides a large set of perceptual distortion data on real images in a calibrated format. This data set has several benefits when used to evaluate image fidelity metrics. First, the data was collected for images, not simple spatial frequency patterns, so that the evaluation result can generalize better to practical situations. Second, people can identify where the distortions are on an image very well, so marking regions of distortion is a more natural task than rating an image as a whole for its fidelity. Third, the image distortion maps contain a large amount of perceptual data. Instead of having single number ratings, we have a full map of empirical distortion measures for each image. They provide large amount of information that we can use to evaluate the accuracy of theoretical image fidelity metrics.

Evaluation of metrics

From the monitor calibration data, we computed the CIE XYZ representations of each image as shown on the CRT display. These XYZ values were used to compute the point-by-point CIELAB and S-CIELAB error values. Point-by-point RMS errors were computed from the frame buffer values and needed no calibration information.

Data selection

To evaluate how well each metric measures the perceptual image fidelity, we compare the probability that an image location was marked (areas with visible distortion) with the distortion computed from the metric.

The empirical image distortion maps cannot be compared directly to the distortion values calculated from metrics. Some image regions have very small or no visible differences between the original and the reproduction, but they were close to regions of large perceptual error. These regions may be covered by marks intended to cover the nearby high distortion regions as a result of the fixed stamp sizes and shape. Data from such regions in the image distortion map do not accurately reflect the perceptual difference between the two images at those locations, therefore we should exclude these points before comparing the data from the image distortion maps to the metric predictions.

To determine which pixel locations are likely to have been marked due to proximity to pixels with large perceptual error, we compare the metric values at each pixel location with the largest metric value in its 10-pixel diameter circular surround (the size of the smallest marker). For a target pixel location, we search for nearby pixels with significantly larger predicted perceptual error. If such a pixel exists, then the probability of the target pixel's location being marked is likely to be affected more by perceptual distortion levels of its neighbors than its own distortion level. We exclude such target points from the data analysis.

Specifically, let Mi,j represent the measured perceptual distortion value by a particular metric at pixel location (i,j), and Mc represent a criterion value for this metric, then we select points on the image distortion map that satisfy the following:

\begin{displaymath}[\max_{{(p-i)}^2 + {(q-j)}^2 \leq 5} M_{p,q}]- M_{i,j} \leq M_c.
\end{displaymath} 3

Pixel locations with a distortion value significantly smaller than any of its neighbors in the 5-pixel radius circular neighborhood (according to the criterion Mc) are not used in the data analysis; the probability of these points being marked is likely to be affected by a neighboring point with much larger perceptual distortions. Ideally, the selection criterion Mc should be determined by the ``true'' perceptual distortion values in the image. However, such data is not available to us. Therefore, we use the theoretical perceptual distortion values instead to select Mc. The criterion value Mcwe used were chosen separately for each metric, so that evaluation of each metric is not dependent on predictions of other metrics. For S-CIELAB $\Delta E$ values, we used Mc = 0.5. For CIE94 $\Delta E$, we used Mc = 1.5. For the RMS metric, we used Mc = 0.02. The selection of these criterion values is somewhat arbitary - the only constraint we have is that the proportion of image points selected using these criterions should not be higher than a few percent (which is roughly the percentage of points that can be independently marked given the size and shape of the markers). The above criterion values were selected to keep approximately the same number of image points in the data analysis for the three metrics. Varying the criterion values up and down within the above constraint did not change the analysis result significantly.

Because the data selection method depends on perceptual distortion measures by different metrics, we obtain a different selection of image pixels for different metrics. Assuming that the metric predicts perceptual distortion accurately, the selected image pixel locations should have relatively independent probabilities of being marked by the subjects. Thus, we can treat subjects' marks at these selected pixel locations as independent binary choices.


  
Figure 2: Selection of image pixels for metric evaluation. (A) The predicted error map based on the CIELAB metric for one test image. (B) The pixels selected for data analysis are represented as black dots. Only pixels representing relatively large local error values are selected. The pixels are widely distributed across the entire image.
\begin{figure}
\centerline{ \psfig{figure=figs/selmapE94.eps} }
\end{figure}

Figure 2 shows the image points selected for data analysis for a perceptual error map predicted by the CIELAB $\Delta E_{94}$ metric. The selections were computed for all image pairs used in the experiment and for all metrics tested.

For all three metrics, approximately 3-6% of the image pixel locations were selected for data analysis. This still leaves us with a large number of data points that we can use to evaluate the metrics. Each image was presented to the subjects 35-40 times, and between 3500-11000 pixels of each image distortion map (depending on the size of the image) were selected for data analysis.

For all three metrics, the data selection improves the correlation between the data and metric predictions. An example is shown in Figure 3. Each point in panel (A) shows a selected image pixel. The horizontal axis measures the S-CIELAB value of that pixel, and the vertical axis measures the probability that a subject marked that pixel. Panel (B) shows the same distribution but this time for points rejected from the analysis. Notice that in panel (A) there are no points with small $\Delta E_{94}$ values and large probability of being marked. Panel (B) contains many points in this category, and these were rejected in the selection process. We suspect that such points are accurate reproductions that are marked only because of their proximity to points with large error.


  
Figure 3: Empirical probability plotted as a function of S-CIELAB $\Delta E_{94}$ values, for (a) all selected image pixels (upper plot), and (b) all rejected image pixels (lower plot). These plots are typical of all images and all three metrics.
\begin{figure}
\centerline{ \psfig{figure=figs/scatterSC94.eps} }
\end{figure}

Method of comparison

We have examined the agreement between metrics and data by treating the metrics as signal detectors that try to predict the subject's responses given the image data. In the experiment, subjects made binary decisions about whether distortions were visible at various image locations. Where the human observer decides there is image distortion, a good metric should predict distortion; and where the human observer does not see distortion, the metric should not. By setting a threshold level for each metric so that any distortion measures above the threshold will be predicted as visible, we can calculate the hit rate (HR) and false alarm rate (FAR) of each metric as a signal detector of the perceptual data. Let Mi,j represent the distortion measure by a metric at image location (i,j), Mcrepresent a threshold level for this metric, and Di,j represent the binary decision each subject made about the visibility of distortion at image point (i,j) (visible=1, invisible=0). Then the hit rate and false alarm rate are defined as:


where P(X) represents the probability of the event X for all image points and all subjects.

A good metric should have high hit rates and low false alarm rates. In a system with noise, it is easy to see that the hit rate of 1 and false alarm rate of 0 will not happen at the same time. If the metric outputs are monotonically related to the probability of detection, both the hit and false alarm rate will increase as threshold levels decrease. A good metric will have much larger hit rates than false alarm rates at all threshold levels. Therefore, plotting the hit rate as a function of false alarm rate (the ROC curve, [24]) for each metric describes how sensitive each metric is in predicting the data. This method does not require us to assume a particular function for converting metric output to probability of being marked. We will plot the ROC curves for the RMS, CIELAB $\Delta E_{94}$ and S-CIELAB $\Delta E_{94}$ metrics as a means to compare them with the experimental data.

The same kind of ROC curve can be used to describe the reliability of the data as well. We treated the mean image distortion map as another ``metric'' that is used to predict individual subjects' responses, and calculated an ROC curve for the mean distortion map. This ROC curve reflects how well the mean distortion map predicts individual subjects' data, therefore reflect the reliability of the image distortion data. It will be used as a reference for how well a theoretical metric predicts the perceptual data.

Results for halftone images

Images with halftone errors and JPEG errors are analyzed separately, because they represent different types of distortions. The halftone distortions generally are a random texture noise, while the JPEG distortions generally include blurring and blocking artifacts. First we look at images with halftone distortions.


  
Figure 4: ROC curves of different metrics, for halftone images.
\begin{figure}
\centerline{ \psfig{figure=figs/rocHT.eps} }
\end{figure}

In Figure 4, we plotted the ROC curves for RMSE, CIELAB $\Delta E_{94}$, and S-CIELAB $\Delta E_{94}$, using hit rates and false alarm rates computed from data on the halftone image pairs, at many threshold levels for each metric. A metric that accurately predicts the subjective decisions should have hit rates as large as possible, and false alarm rates as small as possible, i.e. an ROC curve that bows away from the diagonal line as much as possible. The ROC curve labeled Data is generated using mean image distortion map as the predictor. This marks the best a metric can do in terms of prediction accuracy.

From Figure 4, we can see that for halftone images, the RMS error measure one device RGB values can predict the data to some degree (dashed line), in that the hit rates are always larger than the false alarm rates. For the particular display we used in the experiment, the RMS error calculated on un-calibrated RGB values is somewhat consistent with the data.

The RMS metric is not a perceptual metric. One would expect that it does not predict perceptual distortions at all. So why did it work to some degree in this instance? Part of the reason lies in the fact that the device-dependent RGB values sent to the display frame buffer is related to the actual intensity of light emitted from a CRT display by a power function (approximately). One of the early observations about perception of light is that the perceived brightness of a light signal is related to its physical luminance by a non-linear function (Steven's Power Law). This perceptual non-linear function is also a approximately a power function of the form:

B = aI0.4 4

where B represents perceived brightness, I represent physical intensity of the light, and a is a scale factor [25].

For the CRT display used in our experiment, the intensity of the RGB phosphor emissions are linearly related to the RGB frame buffer values raised to a power of 2.5. The value 2.5 is called the gamma value of the display. Due to the non-linear relation between intensity and brightness, the RGB frame buffer values were actually linearly related to perceptual brightness for the particular CRT display we used. The device RMS measure had coincidentally taken into account the non-linear nature of brightness perception this time, therefore it predicted the perceptual data to some degree.

To confirm this reasoning, we calculated the RMS prediction of image distortions again using a linear display gamma value. Assume that the same images were displayed on another CRT monitor with a linear gamma value, then different RGB frame buffer values will be necessary to generate the same physical image seen by the subjects. We can calculate the RMS error between the original images and the halftoned images on this hypothetical display, and look at how well it predicts the data. In Figure 5, the ROC curves for the linear RMS error measure and the device-dependent RMS metric were plotted together. For the linear RMS measure, the hit rates are not higher than the false alarms rates at any threshold level, indicating that it does not predict the perceptual data at all.


  
Figure 5: ROC curves for two RMS error measures assuming different gamma curves. The ROC curves were computed from data on halftone images.
\begin{figure}
\centerline{ \psfig{figure=figs/rocRMS.eps} }
\end{figure}

The ROC curves in Figure 5 tell us that using device RMS to measure perceptual image fidelity may or may not work, depending on the characteristics of the display device. The RMS results for this data set are probably the best possible one can get using device dependent RMS metric, because the frame buffer RGB values for the display we used in the experiment is almost exactly linear with perceptual brightness. For other display devices, using RMS measure will give similar or worse results.

Next we look at CIELAB $\Delta E_{94}$ results. Figure 4 shows the ROC curve for the CIELAB $\Delta E_{94}$ metric (dotted line). As expected, the CIELAB $\Delta E_{94}$ metric is an improvement over the RMS metric. The CIELAB $\Delta E_{94}$ metric not only accounted for the non-linear nature of brightness perception (through a cube root transform from XYZ values), but also accounted for non-uniform perceptual discrimination thresholds at different color directions. In addition, CIELAB $\Delta E_{94}$ is a calibrated metric, calculated from XYZ values, which are device independent. Therefore, when more advanced color metrics are not available, CIELAB $\Delta E_{94}$ should be used instead of RMS.

As a side note, we also calculated standard CIELAB $\Delta E$predictions for the data. The standard $\Delta E$ predicted the data slightly worse than the $\Delta E_{94}$ color difference metric. This suggests that the CIE94 color difference formula is indeed an improvement over the older formula.

The S-CIELAB $\Delta E_{94}$ metric extended CIELAB to include spatial sensitivity. As shown in Figure 4, its ROC curve (thin solid line) bows further away from the diagonal, indicating a further improvement over CIELAB $\Delta E_{94}$ in accuracy of predicting the perceptual data. This result is consistent with an earlier experimental test of S-CIELAB [26].

Still, the ROC curve for S-CIELAB $\Delta E_{94}$ metric is quite far from the upper limit of how well a metric can do to predict the data. There are many perceptual factors that are not yet included in the S-CIELAB metric. What is the next factor to include in an image fidelity metric to achieve a significant improvement in accuracy of measurement?

What next?

The data analysis so far indicated that non-linearity of brightness perception, non-uniform color discrimination along different color directions, and different spatial sensitivities for different color components all influence perceptual fidelity in real images. Factors that are still not incorporated in S-CIELAB include adaptation, contrast masking, attention, etc. Which of these factors have significant influence on the perceptual fidelity of images over and above what was already included in S-CIELAB?


  
Figure 6: ROC curves for S-CIELAB. The dash-dotted line shows S-CIELAB results when image points with large negative local contrast were removed. The dashed line shows S-CIELAB results for image points with large negative local contrast. The ROC curves were computed from data on halftone images.
\begin{figure}
\centerline{ \psfig{figure=figs/rocSC.eps} }
\end{figure}

Closer inspection of the predicted image distortion maps suggested that most of the regions where the S-CIELAB predictions were poor fell in areas with large negative local contrast. To confirm this observation, we identified image points with negative local contrast from image points with zero or positive local contrast, and looked at the agreement between S-CIELAB predictions and the data for these two groups of image points separately. For each original-reproduction pair, the S-CIELAB spatial filtering was applied first, which resulted in S-CIELAB opponent representations of both the original and the halftoned images. Because most of the image variance is in the luminance channel [27], only the luminance images were used to calculate local contrast. The local contrast of an image was computed as the difference between the image luminance and the local mean luminance, scaled by the local mean luminance [28]. Specifically,

1.
First, a low pass filtered copy of the luminance image was computed, The low pass filter used was a Gaussian filter with a half width of about 0.5 degree of visual angle, and a support of about 1 degree of visual angle. The values at point (i,j) of this low pass image represents the local mean at that point, denoted Fi,j.

2.
Second, the contrast at each image location were computed relative to the local mean at that point. Let Li,j represent the luminance value of an image at location (i,j) after the spatial processing step in S-CIELAB, then the local contrast Ci,j of this image is defined as:
\begin{displaymath}C_{i,j} = \frac{L_{i,j} - F_{i,j}}{F_{i,j}}
\end{displaymath} 5

After we calculated the local luminance contrast image for both the original image and the halftone image, we found out the image locations at which both the original and the halftoned image had negative local contrast. These image locations were excluded from the data analysis. For the remaining image points, we examined the agreement between S-CIELAB $\Delta E_{94}$ predictions and the empirical image distortion maps by plotting an ROC curve using the same method as described in the previous section. This ROC curve is shown in Figure 6, along with the S-CIELAB $\Delta E_{94}$ ROC curve. The new ROC curve computed by excluding image points with negative local luminance contrast on both the original and the halftone images, labeled ``S-CIELAB positive'', bows away further from the S-CIELAB curve, indicating a better agreement between the metric's predictions to the data. When excluding image points with negative local luminance contrast, the S-CIELAB predictions are much closer to the predictions generated using the mean image distortion maps themselves (curve labeled ``Data'). On the same graph, the S-CIELAB ROC curve for the image points with negative local luminance contrast is also plotted, labeled ``S-CIELAB negative''. The S-CIELAB predictions for the negative contrast points are not very good.

Why does S-CIELAB make much worse predictions for image locations with negative local luminance contrast? This seems to be a limitation of the CIELAB $\Delta E$ metric. The CIELAB $\Delta E$ measure is defined in terms of absolute levels of cone absorption. It is not defined in terms of contrast. When the two colors to be discriminated are on a bright background, perceptual difference between the two colors is not as visible as when the background is very close to their own level of brightness (Crispening effect, [29]). Therefore, the $\Delta E$ metric tends to over-estimate perceptual color differences at image regions with large negative contrast. When we excluded those regions from the analysis, the S-CIELAB $\Delta E_{94}$ metric values correlates with the data much better.


  
Figure 7: ROC curves for CIELAB. The dashed line shows CIELAB results when image points with large negative local contrast were removed. The dash-dotted line shows CIELAB results for image points with large negative local contrast. The ROC curves were computed from data on halftone images.
\begin{figure}
\centerline{ \psfig{figure=figs/rocE94.eps} }
\end{figure}

If the above argument is valid, excluding negative contrast points should improve the CIELAB $\Delta E_{94}$ predictions as well. This is indeed the case. In Figure 7, the dashed line labeled ``CIELAB positive'' represents the performance of CIELAB $\Delta E_{94}$ metric when excluding negative contrast image points. The area under this ROC curve is much larger than the CIELAB $\Delta E_{94}$curve including negative contrast points. Its agreement with empirical image distortion map is even better than the original S-CIELAB $\Delta E_{94}$ metric, and only a little worse than the S-CIELAB $\Delta E_{94}$ metric with negative contrast points excluded.

From the above analysis, it seems that re-defining CIELAB $\Delta E$value on the basis of local contrast will provide significant improvements. This improvement can be implemented as a separate extension to CIELAB that translates negative local contrasts to positive local contrasts, or in terms of a completely new perceptual color space that is based on contrast. Some existing color image fidelity metrics, such as DCTune [23] already work in contrast space, but they do not use the CIELAB $\Delta E$metric. Evaluation of such metrics using the image distortion data will be helpful in revealing what factors make a metric work, and what factors do not. If possible, a computationally simple extention that is based on CIELAB $\Delta E$ is still desirable, since the CIELAB $\Delta E$ unit is familiar and meaningful to the color reproduction industry.

Results for JPEG images

For this data set, none of the RMS, CIE94, and S-CIELAB metrics made satisfactory predictions of the marked errors in the JPEG-DCT reproductions. Figure 8 shows the ROC curves for RMS, CIELAB, and S-CIELAB error measures, using hit rate and false alarm rate computed on JPEG image data.


  
Figure 8: ROC curves of different metrics, for JPEG images.
\begin{figure}
\centerline{ \psfig{figure=figs/rocJPEG.eps} }
\end{figure}

This inconsistency between the metric error measures and the data can be a result of either poor data quality or poor metric predictions.

From close examinations of the image distortion maps for the JPEG images and from subjects' feedback on the experimental task, it seems that the JPEG data may not be as reliable as the halftone data. First, many subjects reported difficulty in performing the task for JPEG images since they saw general blurring on the JPEG images and could not pinpoint particular locations of visible distortion. Second, there seem to be a large difference in visibility of JPEG distortion between experienced observers and naive observers. Many of the subjects commented that the JPEG distorted images looked just fine except for maybe a tiny bit of blurring, whereas experienced observers were able to identify block artifacts at many locations. Overall, since most of the subjects were naive observers, they clicked on very few places on most of the JPEG-DCT images. The proportion of mark coverage for 2 of the 6 JPEG images is below 3%, and between 17% and 34% for the rest. Therefore, there are good reasons to believe that the image distortion data on the JPEG-DCT images are not reliable and thus failure of a metric to predict this data may not indicate inability of the metric to predict JPEG-DCT distortions in general.

Limitations in the metrics can also contribute to the inconsistency between the metric error measures and the JPEG-DCT image distortion data. Given the nature of JPEG artifacts and the characteristics of the RMS, CIELAB, and S-CIELAB color difference metrics, it can be expected that none of them should do well in predicting visibility of JPEG-DCT distortions. The JPEG-DCT artifacts arise from (1) the coarse quantization of high frequency components, and (2) the block processing structure of the algorithm. In the case of quantization, the errors are typically correlated with lines or edges in the images, and therefore hidden by the effect of orientation selective masking and contrast masking [19,30,31,32]. The RMS, CIELAB, or S-CIELAB metrics do not include effects of contrast masking or orientation selective masking. These metrics should not be expected to make accurate predictions about visibility of JPEG artifacts.

Conclusions

Subjects identified visible reproduction errors in a collection of halftone and JPEG-DCT reproductions. The responses were summarized as image distortion maps. Using these maps, three image distortion metrics were evaluated: RMS, CIELAB, and S-CIELAB. From results on images with halftone distortions, the RMS metric made the least accurate predictions, and S-CIELAB predictions were significantly better in consistency with the image distortion data.

The RMS metric was calculated on RGB frame buffer values. Depending on gamma curve for particular display devices, the RMS error value may correspond somewhat to perceptual difference, or may not correspond to perceptual difference at all. The CIELAB metric accounts for color sensitivity of the human eye but does not include spatial sensitivity mechanisms, therefore it is better than RMS metric in predicting perceptual distortions, but does not explain all the variance in the data. The S-CIELAB metric incorporates the human spatial-color sensitivity. As expected, it provides improved measurement of image distortion visibility over CIELAB $\Delta E$. An interesting observation is that when we excluded image regions of negative local luminance contrast from the data analysis, the predictions of both S-CIELAB and CIELAB $\Delta E$ were much more consistent with the perceptual image distortions. This suggests that defining CIELAB $\Delta E$ in terms of contrast is a good next step for improvement of image fidelity metrics.

The RMS, CIELAB, and S-CIELAB metrics all failed to predict the image distortion maps measured with JPEG-DCT reproductions in this experiment. Likely causes are the lack of orientation selectivity and masking mechanisms in these metrics, and also the possible low reliability of the distortion data on JPEG images. Further experimental and theorectical work is needed to better evaluate metric predictions of JPEG-DCT distortions against empirical data.

Acknowledgment

This study was supported by a grant from the Hewlett Packard Company.

Bibliography

1
R. S. Berns, ``Deriving instrumental tolerances from pass-fail and colorimetric data,'' Color Research and Applications 21, pp. 459-472, Dec. 1996.

2
A. B. Poirson and B. A. Wandell, ``Appearance of colored patterns: pattern-color separability,'' Journal of the Optical Society of America 10(12), pp. 2458-2470, 1993.

3
A. B. Poirson and B. A. Wandell, ``Pattern-color separable pathways predict sensitivity to simple colored patterns,'' Vision Research 36(4), pp. 515-526, 1996.

4
G. J. C. van der Horst and M. A. Bouman, ``Spatiotemporal chromaticity discrimination,'' Journal of the Optical Society of America 59, pp. 1482-1488, 1969.

5
K. T. Mullen, ``The contrast sensitivity of human colour vision to red-green and blue-yellow chromatic gratings,'' Journal of Physiology 359, pp. 381-400, 1985.

6
E. M. Granger and J. C. Heurtley, ``Visual chromaticity -- modulation transfer function,'' Journal of the Optical Society of America 63, pp. 1173-1174, 1973.

7
N. Sekiguchi, D. R. Williams, and D. H. Brainard, ``Efficiency in detection of isoluminant and isochromatic interference fringes,'' Journal of the Optical Society of America 10, pp. 2118-2133, Oct. 1993.

8
X. Zhang and B. A. Wandell, ``A spatial extension to CIELAB for digital color image reproduction,'' in Society for Information Display Symposium Technical Digest, vol. 27, pp. 731-734, SID, 1996.

9
X. Zhang, E. Setiawan, and B. A. Wandell, ``Image distortion maps,'' in Final Program and Proceedings of the Fifth IS&T/SID Color Imaging Conference. Color Science, Systems and Applications, pp. 120-125, IS&T, SID, (Scottsdale, AZ, USA), Nov. 1997.

10
International Commission on Illumination (CIE), Industrial Colour-Difference Evaluation.
Publication CIE 116-95, Bureau Central de la CIE, Vienna, 1995.

11
International Commission on Illumination (CIE) in Recommendations on uniform color spaces, color difference equations, psychometric color terms, Publication CIE 15 (E.-1.3.1), Supplement No.2, Bureau Central de la CIE, Vienna, 1971.

12
D. L. MacAdam, ``Visual sensitivities to color differences in daylight,'' Journal of the Optical Society of America 32, pp. 247-274, 1942.

13
E. M. Granger, ``Is CIEL*a*b* good enough for desktop publishing?,'' technical report, Light Source, 1994.

14
E.-J. Chichilnisky and B. A. Wandell, ``Photoreceptor sensitivity changes explain color appearance shifts induced by large uniform backgrounds in dichoptic matching,'' Vision Research 35, pp. 239-254, Jan. 1995.

15
M. D. Fairchild and P. Lennie, ``Chromatic adaptation to natural and incandescent illuminants,'' Vision Research 32(11), pp. 2077-2085, 1992.

16
K.-H. Bäuml, ``Color appearance: effects of illuminant changes under different surface collections,'' Journal of the the Optical Society of America A 11, pp. 531-542, Feb. 1994.

17
K.-H. Bäuml, ``Illuminant changes under different surface collections: examining some principles of color appearance,'' Journal of the the Optical Society of America A 12, pp. 261-271, Feb. 1995.

18
D. H. Brainard and B. A. Wandell, ``Asymmetric color matching: how color appearance depends on the illuminant,'' Journal of the Optical Society of America 9(9), pp. 1433-1448, 1992.

19
G. G. E. Legge and J. M. Foley, ``Contrast masking in human vision,'' Journal of the Optical Society of America 70, pp. 1458-1471, 1980.

20
H. R. Wilson, D. K. McFarlane, and G. C. Phillips, ``Spatial frequency tuning of orientation selective units estimated by oblique masking,'' Vision Research 23, pp. 873-882, 1983.

21
K. K. De Valois and E. Switkes, ``Simultaneous masking interactions between chromatic and luminance gratings,'' Jounal of the Optical Society of America 73, pp. 11-18, Jan. 1983.

22
K. Miyata, M. Saito, N. Tsumura, H. Haneishi, and Y. Miyake, ``Eye movement analysis and its application to evaluation of image quality,'' in Final Program and Proceedings of the Fifth IS&T/SID Color Imaging Conference. Color Science, Systems and Applications, pp. 116-119, IS&T, SID, (Scottsdale, AZ, USA), Nov. 1997.

23
A. B. Watson, ``DCT quantization matrices visually optimized for individual images,'' in SPIE Proceedings, 1993.

24
D. M. Green and J. A. Swets, Signal detection theory and psychophysics, R. E. Krieger Pub. Co., Huntington, N.Y., 1974.

25
B. Wandell, Foundations of Vision, Sinauer Press, Sunderland, MA, 1995.

26
X. Zhang, D. A. Silverstein, J. E. Farrell, and B. A. Wandell, ``Color image quality metric S-CIELAB and its application on halftone texture visibility,'' in COMPCON97 Digest of Papers, pp. 44-48, IEEE, 1997.

27
G. Buchsbaum and A. Gottschalk, ``Trichromacy opponent color coding and color transmission in the retina,'' Proceedings of the Royal Society of London (B) 220, pp. 89-113, 1983.

28
E. Peli, ``Contrast in complex images,'' Journal of the Optical Society of America A 7, pp. 2032-2040, Oct. 1990.

29
G. Wyszecki and W. S. Stiles, Color science: concepts and methods, quantitative data and formulae, Wiley, New York, 1982.

30
M. M. A. Losada and K. T. Mullen, ``The spatial tuning of chromatic mechanisms identified by simultaneous masking,'' Vision Research 34(3), pp. 331-341, 1994.

31
J. O. Limb, ``Distortion criteria of the human viewer,'' IEEE Transactions on Systems, Man, and Cybernetics SMC-9(12), pp. 778-793, 1979.

32
N. Chaddha and T. H. Meng, ``Psycho-visual based distortion measure for monochrome images compression,'' in SPIE Proceedings of Visual Communications and Image Processing, SPIE, Nov. 1993.


About this document ...

Color Image Fidelity Metrics Evaluated Using Image Distortion Maps

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html compare.tex.

The translation was initiated by Xuemei Zhang on 1998-10-07


Xuemei Zhang
1998-10-07