Treatment of ‘less-than’ measurements

Overview

Measurements reported as below the detection limit are often known as ‘less-than’ measurements. Less-thans are examples of left-censored data. Provided there are not too many less-thans, the same contaminant time series models can be fitted provided the likelihood is adjusted accordingly. Further refinements are required to prevent over-fitting if there are many less-thans or if the less-thans are unevenly distributed across the time series. When most of the data are less-thans, a non-parametric test is used to compare levels with assessment criteria.

Adjustments to the likelihood

The likelihood when some measurements are less-thans is straightforward when there is only one measurement each year, because the measurements are then (assumed to be) statistically independent. Let \(y_i\) be the logarithm of the reported concentration in year \(t_i, i = 1...N\), and let \(A\) be {\(i\): \(y_i\) is a non-censored measurement} and \(\bar A\) be {\(i\): \(y_i\) is a less-than}. The likelihood of the data is then:

\[\prod_{i \in A} \frac 1 {\omega_i} \phi \left( \frac {y_i - \text f(t_i)} {\omega _i} \right) \prod_{i \in \bar A} \Phi \left( \frac {y_i - \text f(t_i)} {\omega _i} \right) \]

where \(\phi\) is the density function of a standard normal distribution (with zero mean and unit variance), \(\Phi\) is the corresponding cumulative density function, \(\text f(t_i)\) is the expected value of \(y_i\), and \(\omega _i\) is the standard deviation of \(y_i\) given by:

\[\omega _i^2 = \sigma_\text{year}^2 + \sigma_\text{sample}^2 + \sigma_{\text{analytical},i}^2\]

where \(\sigma_\text{year}\), \(\sigma_\text{sample}\), \(\sigma_{\text{analytical},i}\) are the between-year, between-sample and analytical standard deviations respectively. Note that the analytical standard deviations are measurement specific and are based on the uncertainties reported with the data.

The likelihood is more complicated when there are several measurements in a year, because these measurements are dependent. Extending the previous notation, let \(y_{ij}\) be the logarithm of the \(j\)th reported concentration in year \(t_i\), and let \(A_i\) be {\(j\): \(y_{ij}\) is a non-censored measurement} and \(\bar A_i\) be {\(j\): \(y_{ij}\) is a less-than}. Then the likelihood of the data is

\[\prod_i \int_{-\infty}^\infty \phi \left( \frac {z - \text f(t_i)} {\sigma _\text{year}} \right) \prod_{j \in A_i} \frac 1 {\omega_{ij}} \phi \left( \frac {y_{ij} - z} {\omega_{ij}} \right) \prod_{j \in \bar A_i} \Phi \left( \frac {y_{ij} - z} {\omega_{ij}} \right) \text dz\]

where \(\omega _{ij}\) is the within-year standard deviation of \(y_{ij}\) given by:

\[\omega _{ij}^2 = \sigma_\text{sample}^2 + \sigma_{\text{analytical},ij}^2\]

Refinements

Less-than measurements contain less infomation about changes in concentration over time than non-censored measurements. Therefore, the form of \(\text{f}(t)\) fitted to the data is based on \(N_+\), the number of years of data with at least one non-censored measurement, rather than \(N\), the total number of years of data (although \(N\) is also considered for short time series). Specifically:

\(N_+ \leq 1\): no model is fitted
\(N_+ = 2\) and \(N = 2\): no model is fitted
\(2 \leq N_+ \leq 4\) and \(N \geq 3\): mean model \(\text f(t) = \mu\)
\(5 \leq N_+ \leq 6\): linear model \(\text f(t) = \mu + \beta t\)
\(N_+ \geq 7\): smooth model \(\text f(t) = \text s(t)\); Smoothers on 2 degrees of freedom (df) are considered when \(7 \leq N_+ \leq 9\), on 2 and 3 df when \(10 \leq N_+ \leq 14\) and on 2, 3, and 4 df when \(N_+ \geq 15\).

For consistency, \(N_+\) is also used instead of \(N\) in the calculation of AICc and residual degrees of freedom.

When \(N_+\) is relatively small compared to \(N\), the model fits can become environmentally implausible, particularly if there are changes in the limit of detection over time, or if a linear or smooth model is fitted and the years at the start and end of the time series only have less-than measurements. To protect against this behaviour, three additional constraints are placed on the time series.

The time series is truncated from the left (i.e. early years are omitted) until \(N_+ \geq N/2\). For example, if there are ten years of data (each with a single measurement) and the measurements in years 6, 7, and 9 are non-censored, then the time series assessed comprises the data from years 5 through 10.
If a linear or smooth model is fitted (i.e. \(N_+ \geq 5\)), then the first year of the time series is taken to be the first year with a non-censored measurement (i.e. all earlier years, which only contain less-thans, are omitted). For example, if there are ten years of data and the measurements in years 3, 4, 6, 8, 9, and 10 are non-censored, then the time series assessed comprises the data from years 3 through 10.
If a linear or smooth model is fitted (i.e. \(N_+ \geq 5\)), and the measurements in the most recent year(s) of the time series are all less-thans, then the expected concentration in the most recent year(s) is assumed to be constant. Specifically, if \(t_\text{last}\) is the last year with a non-censored measurement, then \(\text f(t)\) is adjusted to:

\[\text f(t) = \begin{cases} \alpha + \beta t, & \text{if } t < t_\text{last} \\ \alpha + \beta t_\text{last}, & \text{if } t \geq t_\text{last} \end{cases}\]

for the linear model and similarly for the smooth model.

Non-parametric assessment of environmental status

If the length of the truncated time series is 2 years of less, then there are insufficient years to fit a parametric model and make a formal assessment of environmental status. However, if the original time series has more than five years of data, a one-sided sign test is used instead to provide a non-parametric test of status. The median log concentration measurement each year is first calculated (with less-thans treated as if they were non-censored measurements) and then back transformed to the concentration scale. These can be thought of annual contaminant indices. The indices in the last twenty years (the same period used to assess recent trends for the summary maps) are then used to test the null hypothesis: H₀: median concentration \(\geq\) AC against the alternative: H₁: median concentration < AC, where AC is the assessment criterion¹.

¹ This approach might lack power, particularly for longer time series where there are non-censored measurements at the start of the time series, but all recent measurements are less-thans. In such cases, a better approach might be to model how the probability that the annual index is below the AC changes with year and to use the upper one-sided 95% confidence limit on the fitted value in the final monitoring year to assess status. This is a topic for future development. ↩