On the exact distribution of maximally selected rank statistics

https://doi.org/10.1016/S0167-9473(02)00225-6Get rights and content

Abstract

The construction of simple classification rules is a frequent problem in medical research. Maximally selected rank statistics allow the evaluation of cutpoints, which provide the classification of observations into two groups by a continuous or ordinal predictor variable. The computation of the exact distribution of a maximally selected rank statistic is discussed and a new lower bound of the distribution is derived based on an extension of an algorithm for the exact distribution of a linear rank statistic. Therefore, the test based on the upper bound of the P-value is of level α. For small to moderate sample sizes the lower bound of the exact distribution is a substantial improvement compared to approximations based on an improved Bonferroni inequality or based on the asymptotic Gaussian process. The lower bound of the distribution is compared to the exact distribution by means of a simulation study and the proposal is illustrated by three clinical studies.

Introduction

In clinical research, an investigator often assumes that some prognostic factor X allows for a classification of patients into a risk and a normal group with respect to a response variable Y. The functional relationship between the continuous or ordinal variable X and the response variable Y is unknown. We consider continuous or ordinal response variables which may be censored. As a simple model we assume that an unknown cutpoint in X determines two groups of observations. One group is given by all observations where X is either less than or equal to the unknown cutpoint μ. The other group is given by all observations greater than the unknown cutpoint μ. Choosing the cutpoint μ which minimizes the P-value of a two-sample test between the two groups leads to an increased false error rate. Therefore, it is necessary to test if there is a difference between the groups at all before estimating the cutpoint (Lausen and Schumacher 1992, Lausen and Schumacher 1996). In this paper we focus on a test for the null hypothesis that the event Xμ has no influence on the distribution of the response variable Y:H0:P(Y⩽y|X⩽μ)=P(Y⩽y|X>μ)∀y,μ∈R.Maximally selected rank statistics, i.e. the maximum of the empirical process of the absolute values of standardized two-sample linear rank statistics, were established by Lausen and Schumacher (1992) for testing H0. Several approximations of the limiting distribution of a maximally selected rank statistic are known (Lausen and Schumacher 1992, Lausen and Schumacher 1996; Lausen et al., 1994).

Subgroup analysis or experimental costs, e.g. in DNA microarray studies, often result in small or moderate sample sizes where the use of approximations of the limiting distribution may be questionable. The evaluation of cutpoints in small samples is of special interest for P-value adjusted recursive partitioning (Lausen et al., 1994), where the number of observations in a node decreases as the tree branching increases. We therefore derive an upper bound of the P-value of a maximally selected rank statistic from the exact conditional distribution of a linear rank statistic which can be determined by the shift algorithm proposed by Streitberg and Röhmel 1986, Streitberg and Röhmel 1987. The use of an upper bound of the P-value ensures that the test procedure is of level α, which may not hold if the P-value is based on the limiting distribution.

The paper is organized as follows. In Section 2, we introduce notation for maximally selected statistics and give approximations of the asymptotic null distribution. The determination of the exact conditional distribution of a two-sample linear rank statistics by the shift algorithm is outlined in detail in Section 3. We derive an upper bound of the P-value of a maximally selected rank statistic for small samples in Section 4. Furthermore, we compare our proposal with approximations using an asymptotic distribution and report results of a Monte Carlo study for different sample sizes and rank statistics in Section 6. Finally, in Section 7 we illustrate our proposal with data of three clinical studies.

Section snippets

Maximally selected rank statistics

We consider the situation of N independent observations (Y1,X1),…,(YN,XN). Let R=(R1,…,RN) denote the rank vector of Y1,…,YN and a=a(R)=(a1(R),…,aN(R)) denotes the score vector depending on R. The choice of the scores depends on the scale of the response variable Y. For the sake of simplicity, we restrict our considerations to response variables with continuous distribution function and positive, integer-valued scores. The incorporation of real or rational scores is given in Section 3 and we

Permutation distribution of rank statistics

Without loss of generality, we assume that the sample Y1,…,YN is ordered regarding X1,…,XN. Under the null hypothesis H0 it follows that Tμ depends on X only through the sample size mμ. Let I denote the index set of all possible sample sizes m1mμm2 of the group with Xiμ induced by the cutpoints μ and the limits (ε1,ε2)m1=NFNX−11)andm2=NFNX−12).It is therefore sufficient to deal with statisticsTm=Tm(a)=i=1mai,m∈Iunder the hypothesis that Yi,i=1,…,N are iid. The maximum of the

A new lower bound for the distribution of maximally selected rank statistics

We are interested in the computation of the distribution of the statisticM(a,X,μ12)=M(a,I)=maxm∈I(|Sm(a)|)under H0. Here, our aim is to compute PH0(M(a,I)>b) for b∈R+.

In the following, let m∈I.M(a,I) is defined as a function of the standardized rank statistics Sm. Therefore, we need a representation of the possible values of the standardized statistics. As outlined in Section 2, expectation (2) and variance (3) of Tm depend on the choice of the scores a, the sample size N and m. The

Generalizations to tied and censored data

In the presence of ties the mid-scores, i.e. the average of the scores of the tied observations, are mapped into integers as described by formula (10) in Section 3. Expectation (2) and variance (3) of the two-sample linear rank statistic (1) are now conditional under the integer-valued score vector.

If the observations are right censored, we observe bivariate responses (Z1,δ1),…,(ZN,δN), where δi=1 denotes an observed death for observation number i, δi=0 denotes a censored observation and Zi is

A Monte Carlo study

To be able to compare the approximation of the null distribution by our lower bound (12) with the exact distribution we performed a Monte Carlo study for a small and medium number of cutpoints. The maximally selected rank statistics used are: Wilcoxon-score statistic with a(i)=i, median-score statistic with a(i)=χ{iN/2} and the log-rank statistic (no censoring) with integer valued log-rank scores as defined in Section 5. The results of the median-score statistic allow an assessment of the

Examples

For all three examples, we report the value of the maximally selected log-rank statistic for the log-rank scores (13) and the integer-valued version as given by formula (10).

Discussion

In this paper we derive a lower bound for the exact distribution of maximally selected rank statistics. Our result allows to compute conservative P-values for small sample sizes, which are frequent in various fields of application. The result holds for ties and censored observations. Our proposal is an important addition to the established asymptotic theory of maximally selected rank statistics (Lausen and Schumacher 1992, Lausen and Schumacher 1996; Lausen et al., 2002). We compare our lower

References (27)

  • D.M. Hawkins

    Testing a sequence of observations for a shift in location

    J. Amer. Statist. Assoc.

    (1977)
  • N. Holländer et al.

    On the problem of using ‘optimal’ cutpoints in the assessment of quantitative prognostic factors

    Onkologie

    (2001)
  • T. Hothorn

    On exact rank tests in R

    R News

    (2001)
  • Cited by (440)

    View all citing articles on Scopus
    1

    Support from Deutsche Forschungsgemeinschaft, Grant SFB 539-A4/C1 is gratefully acknowledged.

    View full text