In our last few columns (1–5), we devised a test for the amount of nonlinearity present in a set of comparative data (for
example, as are created by any of the standard methods of calibration for spectroscopic analysis), and then pointed out a
flaw in the method. The concept of a measure of nonlinearity that is independent of the units that the X and Y data have is a good one. The flaw is that the nonlinearity measurement depends upon the distribution of the data; uniformly
distributed data will provide one value, normally distributed data will provide a different value, randomly distributed (i.e.,
what is found commonly in "real" data sets will give still a different value, and so forth, even if the underlying relationship
between the pairs of values is the same in all cases.
"Real" data, in fact, might not follow any particular describable distribution at all. Or the data might not be sufficient
to determine what distribution it does follow, if any. But does that matter? At the point we have reached in our discussion,
we already have determined that the data under investigation do indeed show a statistically significant amount of nonlinearity,
and we have developed a method of characterizing that nonlinearity in terms of the coefficients of the linear and quadratic
contributions to the functional form that describes the relationship between the X and Y values.
Our task now is to come up with a way to quantify the amount of nonlinearity the data exhibits, independent of the scale (that
is, units) of either variable, and even independent of the data itself. The first condition is met by converting the nonlinear
component of the data to a dimensionless number (that is, a statistic), akin to but different than the correlation coefficient,
as shown in our previous column (5).
The second condition can be met simply by ignoring the data itself, once we have reached this point. What we need is a standard
way to express the data so that when the statistic in computed, the standard data expression will give rise to a given value
of the statistic, regardless of the nature of the original data. For this purpose, then, it would suffice to replace the original data with a set of synthetic data with the necessary properties.
What are those properties? The key properties comprise the number of data values, the range of the data values, and their
distribution.
The range of the synthetic data we want to generate should be such that the X-values have the same range as the original data. The reason for this is obvious — when we apply the empirically derived quadratic
function (found from the regression) to the data, to compute the Y-values, those should fall on the same line, and in the same relationship to the X as did the original data.
Choosing the distribution is a little more nebulous. However, a uniform distribution is not only easy to compute, but also
will not go outside the specified range, nor will the range change with the number of samples, as data following other distributions
might (see for example reference 6 or chapter 6 in reference 7, where we discussed the relationship between the range and
the standard deviation for the Normal distribution when the number of data differ, although our discussion was in a different
context). Therefore, in the interest of having the range and the nonlinearity measure be independent of the number of readings,
we should generate data following a uniform distribution.