Assessment methodology for imposex

Overview

Ideally, imposex data are submitted as individual measurements; for example, as the Vas Deferens Sequence (VDS) of each female snail. This provides information about variation between individuals, and allows efficient statistical models to be fitted to assess trends and status. However, sometimes the data are submitted as an annual index; for example, as the Vas Deferens Sequence Index (VDSI), the arithmetic mean of the individual VDS measurements. A more ad-hoc modelling approach is then all that is possible. This help file describes the methodology for assessing time series of individual VDS measurements. Other help files describe the approach when VDS data are submitted as annual indices, or as a mixture of individual measurements and annual indices.

For some species, imposex stage or intersex stage are reported rather than VDS, again either as individuals or annual indices1. These data are assessed using the same methods as for VDS.

Changes to the methodology since the 2014 assessment can be found here.


Proportional odds model

The individual VDS measurements are modelled with a proportional odds model (McCullagh & Nelder, 1989). Let \(y_{ij}\) be the VDS measurement of the \(j\)th female snail in year \(t_i, i = 1...N\), with \(y_{ij} \in \left \{ 0, ..., K \right \}\) where \(K\) is the highest possible VDS class2. It is assumed that

\[\text {logit} \left( \text {Prob} (y_{ij} \le k) \right) = f(t_i) + \theta_k\]

for \(0 \le k \le K-1\), with \(\text {Prob} (y_{ij} \le K) = 1\). Here, \(f(t)\) is a function that describes how imposex levels change over time (year). Various forms of \(f(t)\) are considered and these are discussed in the next section. The \(\theta_k\) are cut points that measure the odds of being in a particular VDS class or below. Since the classes are ordered, the cut points are subject to the constraints \(\theta_0 < \theta_1 < ... < \theta_{K-1}\)3.

The model is fitted by maximum likelihood. However, there are rarely sufficient data in a single time series to estimate the cut points precisely, so the cut points are first estimated from a saturated model fitted to multiple time series (described later) and are then assumed fixed and known4. The only parameters estimated when fitting the proportional odds model to a single time series are thus the parameters of \(f(t)\). Parameter standard errors are estimated from the Hessian matrix5.

McCullagh P & Nelder JA, 1989. Generalized Linear Models (second edition). Chapman & Hall, London.


Form of \(f(t)\)

Several different candidate forms of \(f(t)\) are considered, depending on the length of the time series. As well as linear logistic and smooth trends, change-point models are considered. These are motivated by the patterns seen in many time series, where there are steep declines in VDS levels starting in the mid 2000s. These changes coincide with the introduction of EC Regulation 782/2003, which implemented the provisions of the International Maritime Organisation’s Antifouling Systems Convention (IMO, 2001) prohibiting application of TBT surface coatings to all vessels by 2003, and the global ban on TBT which came into force in September 2008. The steep declines usually cannot be described adequately by linear logisitic models, or even smoothers. However, change-point models provide a way of capturing the steep decline with relatively few parameters. The years 2004, 2005, 2006, 2007 and 2008 are regarded as potential change-years, since the environmental response to the TBT measures is likely to have started in this period.

Intuitively, the complexity of the candidate forms of \(f(t)\) should be based on the number of years of data, \(N\). For example, with 8 years of data, one might consider a linear model \(f(t) = \mu + \beta t\) and a smoother \(f(t) = s(t)\) on 2 degrees of freedom (df) (analogous to the models fitted to contaminant time series). However, this runs into numerical difficulties when a time series starts with a series of years in which all VDS measurements equal the maximum value \(K\), or ends with a series of years in which all VDS measurements equal 0, as the amount of information in the data for estimating \(f(t)\) is then reduced6. Instead, the candidate forms of \(f(t)\) are based on \(N_\text{mid}\), an approximate measure of the number of years of data that contain information about changes in VDS levels. Loosely, \(N_\text{mid}\) is the number of years in the ‘middle’ of the timeseries, where intermediate VDS levels are observed. Formally, \(N_\text{mid}\) is defined as follows. Let \(I_i\) = 1 if all the VDS measurements in year \(t_i\) equal \(K\), -1 if all the VDS measurements in year \(t_i\) equal 0, and 0 otherwise. Let

\[i_1 = \begin{cases} 1, & \text{if } I_1 < 1 \\ N, & \text{if } I_i = 1 \forall i \\ \text {min} \left \{ i: I_{i+1} < 1 \right \}, & \text{otherwise} \end{cases}\]

Similarly, let

\[i_2 = \begin{cases} N, & \text{if } I_N > -1 \\ 1, & \text{if } I_i = -1 \forall i \\ \text {max} \left \{ i: I_{i-1} > -1 \right \}, & \text{otherwise} \end{cases}\]

Then \(\{ t_i, i = i_1, ..., i_2 \}\) are the ‘middle’ years of the time series and \(N_\text{mid} = i_2 - i_1 + 1\).

The linear and smooth candidate forms of \(f(t)\) are then

\(N \leq 2\)
no model is fitted
\(N \ge 3\) and \(N_\text{mid} = 1\)
mean model \(\text f(t) = \mu\)
The VDS measurements in the entire time series either all equal \(K\) or all equal 0, so there is no trend.
\(N \ge 3\) and \(N_\text{mid} = 2, 3 \text{ or } 4\)
linear model \(\text f(t) = \mu + \beta t\)
\(N_\text{mid} \geq 5\)
linear model \(\text f(t) = \mu + \beta t\) and smooth model \(\text f(t) = \text s(t)\)
Smoothers on 2 degrees of freedom (df) are considered when \(5 \leq N_\text{mid} \leq 7\), on 2 and 3 df when \(8 \leq N_\text{mid} \leq 10\) and on 2, 3, and 4 df when \(N_\text{mid} \geq 11\).

Change-point models are also considered provided that the time series starts before 2008 and that \(N \ge 3\) and \(N_\text{mid} > 1\). Each change point model is of the form

\[f(t) = \begin{cases} \mu, & \text{if } t < t_\text{change} \\ \mu + g(t), & \text{if } t \ge t_\text{change} \\ \end{cases}\]

where \(t_\text{change}\) is the change year and \(g(t_\text{change}) = 0\) to ensure \(f(t)\) is continuous. Let \(N_\text{mid}^*\) be the number of ‘middle’ years from \(t_\text{change}\) onwards; i.e. \(|\{t_i : t_i \ge t_\text{change} \text { and } i \ge = i_1 \} |\). Then, similar to above, the form of \(g(t)\) depends on \(N_\text{mid}^*\).

\(N_\text{mid}^* = 2, 3 \text{ or } 4\)
linear change-point model \(\text g(t) = \beta (t - t_\text{change})\)
\(N_\text{mid}^* \geq 5\)
linear change-point model \(\text g(t) = \beta (t - t_\text{change})\) and smooth change-point model \(\text g(t) = \text s(t)\), with \(s(t_\text{change})\) = 0
Smoothers on 2 degrees of freedom (df) are considered when \(5 \leq N_\text{mid}^* \leq 7\), on 2 and 3 df when \(8 \leq N_\text{mid}^* \leq 10\) and on 2, 3, and 4 df when \(N_\text{mid}^* \geq 11\).

The change-point models are fitted for each change-year \(t_\text{change}\) = 2004, 2005, 2006, 2007, 2008 in turn, provided that \(t_1 < t_\text{change}\); i.e. the time series started before the change-year.

All the candidate models are fitted by maximum likelihood, with the final model chosen using AICc7. For some time series, there are many candidate models and there is a danger of over-fitting the data. However, this is mitigated somewhat by the fact that the models have been tailored to patterns of change seen in so many time series. It is also preferable to overfit rather than underfit for the purposes of assessing environmental status. Linear or smooth models often overpredict VDS levels in the final monitoring year, so if these are the only models considered, status will appear to be poorer than it should be.


Estimating the cut-points

The cut-points \(\theta_k, k = 0, ..., K-1\) determine the probability of being in each VDS class given the underlying level of TBT contamination (represented by \(f(t)\) above). The cut-points can be thought of as measuring the eco-toxicological response of a species to TBT contamination and might reasonably be assumed to be constant over a wide area. The cut-points can therefore be estimated with good precision by fitting a ‘full’ model to the data from many time series collected over a wide area.

Suppose that, for a particular species and area, there are VDS time series at \(M\) stations. With some abuse of notation, let \(y_{mij}\) be the VDS measurement of the \(j\)th female snail in year \(t_{mi}\) from station \(m\). Then the full model is

\[\text {logit} \left( \text {Prob} (y_{mij} \le k) \right) = \mu _{mi} + \theta_k\]

where \(\mu _{mi}\) represents the underlying level of TBT contamination in year \(t_{mi}\) at station \(m\). The species area combinations used for estimating the cut-points are here.


Estimating the mean VDS class

The mean VDS class in year \(t\), denoted \(v(t)\), is

\[v(t) = \sum_{k = 0}^K k \text { Prob} \left ( Y_t = k \right )\]

where \(Y_t\) is a random variable describing the VDS class of individual snails in year \(t\). The probabilities are expressed in terms of \(f(t)\) and the cut-points through the relationships

\[\begin{align} \text { Prob} \left ( Y_t = k \right ) &= \text { Prob} \left ( Y_t \le k \right ) - \text { Prob} \left ( Y_t \le k-1 \right ) \\ &= \frac {\text {exp} (f(t) + \theta_k)} {1 + \text {exp} (f(t) + \theta_k)} - \frac {\text {exp} (f(t) + \theta_{k-1})} {1 + \text {exp} (f(t) + \theta_{k-1})} \end{align}\]

with \(\text {Prob} (Y_t \le K) = 1\) as before. The mean VDS class \(v(t)\) is then estimated by plugging the estimates of \(f(t)\) and the cut-points into these formulae.

Approximate confidence intervals on \(v(t)\) are obtained by simulating the distribution of the estimates of \(f(t)\) and the cut-points and hence the distribution of \(\hat {v(t)}\)8. In particular, an upper one-sided 95% confidence limit on \(v(t)\) is the 95% ordered value of the simulated distribution of \(\hat {v(t)}\).