On the exact distribution of maximally selected rank statistics
Introduction
In clinical research, an investigator often assumes that some prognostic factor X allows for a classification of patients into a risk and a normal group with respect to a response variable Y. The functional relationship between the continuous or ordinal variable X and the response variable Y is unknown. We consider continuous or ordinal response variables which may be censored. As a simple model we assume that an unknown cutpoint in X determines two groups of observations. One group is given by all observations where X is either less than or equal to the unknown cutpoint μ. The other group is given by all observations greater than the unknown cutpoint μ. Choosing the cutpoint μ which minimizes the P-value of a two-sample test between the two groups leads to an increased false error rate. Therefore, it is necessary to test if there is a difference between the groups at all before estimating the cutpoint (Lausen and Schumacher 1992, Lausen and Schumacher 1996). In this paper we focus on a test for the null hypothesis that the event X⩽μ has no influence on the distribution of the response variable Y:Maximally selected rank statistics, i.e. the maximum of the empirical process of the absolute values of standardized two-sample linear rank statistics, were established by Lausen and Schumacher (1992) for testing H0. Several approximations of the limiting distribution of a maximally selected rank statistic are known (Lausen and Schumacher 1992, Lausen and Schumacher 1996; Lausen et al., 1994).
Subgroup analysis or experimental costs, e.g. in DNA microarray studies, often result in small or moderate sample sizes where the use of approximations of the limiting distribution may be questionable. The evaluation of cutpoints in small samples is of special interest for P-value adjusted recursive partitioning (Lausen et al., 1994), where the number of observations in a node decreases as the tree branching increases. We therefore derive an upper bound of the P-value of a maximally selected rank statistic from the exact conditional distribution of a linear rank statistic which can be determined by the shift algorithm proposed by Streitberg and Röhmel 1986, Streitberg and Röhmel 1987. The use of an upper bound of the P-value ensures that the test procedure is of level α, which may not hold if the P-value is based on the limiting distribution.
The paper is organized as follows. In Section 2, we introduce notation for maximally selected statistics and give approximations of the asymptotic null distribution. The determination of the exact conditional distribution of a two-sample linear rank statistics by the shift algorithm is outlined in detail in Section 3. We derive an upper bound of the P-value of a maximally selected rank statistic for small samples in Section 4. Furthermore, we compare our proposal with approximations using an asymptotic distribution and report results of a Monte Carlo study for different sample sizes and rank statistics in Section 6. Finally, in Section 7 we illustrate our proposal with data of three clinical studies.
Section snippets
Maximally selected rank statistics
We consider the situation of N independent observations (Y1,X1),…,(YN,XN). Let R=(R1,…,RN) denote the rank vector of Y1,…,YN and a=a(R)=(a1(R),…,aN(R)) denotes the score vector depending on R. The choice of the scores depends on the scale of the response variable Y. For the sake of simplicity, we restrict our considerations to response variables with continuous distribution function and positive, integer-valued scores. The incorporation of real or rational scores is given in Section 3 and we
Permutation distribution of rank statistics
Without loss of generality, we assume that the sample Y1,…,YN is ordered regarding X1,…,XN. Under the null hypothesis H0 it follows that Tμ depends on X only through the sample size mμ. Let I denote the index set of all possible sample sizes m1⩽mμ⩽m2 of the group with Xi⩽μ induced by the cutpoints μ and the limits (ε1,ε2)It is therefore sufficient to deal with statisticsunder the hypothesis that are iid. The maximum of the
A new lower bound for the distribution of maximally selected rank statistics
We are interested in the computation of the distribution of the statisticunder H0. Here, our aim is to compute PH0(M(a,I)>b) for .
In the following, let is defined as a function of the standardized rank statistics Sm. Therefore, we need a representation of the possible values of the standardized statistics. As outlined in Section 2, expectation (2) and variance (3) of Tm depend on the choice of the scores a, the sample size N and m. The
Generalizations to tied and censored data
In the presence of ties the mid-scores, i.e. the average of the scores of the tied observations, are mapped into integers as described by formula (10) in Section 3. Expectation (2) and variance (3) of the two-sample linear rank statistic (1) are now conditional under the integer-valued score vector.
If the observations are right censored, we observe bivariate responses (Z1,δ1),…,(ZN,δN), where δi=1 denotes an observed death for observation number i, δi=0 denotes a censored observation and Zi is
A Monte Carlo study
To be able to compare the approximation of the null distribution by our lower bound (12) with the exact distribution we performed a Monte Carlo study for a small and medium number of cutpoints. The maximally selected rank statistics used are: Wilcoxon-score statistic with a(i)=i, median-score statistic with a(i)=χ{i⩽N/2} and the log-rank statistic (no censoring) with integer valued log-rank scores as defined in Section 5. The results of the median-score statistic allow an assessment of the
Examples
For all three examples, we report the value of the maximally selected log-rank statistic for the log-rank scores (13) and the integer-valued version as given by formula (10).
Discussion
In this paper we derive a lower bound for the exact distribution of maximally selected rank statistics. Our result allows to compute conservative P-values for small sample sizes, which are frequent in various fields of application. The result holds for ties and censored observations. Our proposal is an important addition to the established asymptotic theory of maximally selected rank statistics (Lausen and Schumacher 1992, Lausen and Schumacher 1996; Lausen et al., 2002). We compare our lower
References (27)
- et al.
An application of changepoint methods in studying the effect of age on survival in breast cancer
Comput. Statist. Data Anal.
(1999) - et al.
A note on change point estimation in dose-response trials
Comput. Statist. Data Anal.
(2001) - et al.
A cautionary note on segmenting a cyclical covariate by minimum P-value search
Comput. Statist. Data Anal.
(2001) - et al.
Predictors of antiarrhythmic drug efficacy in patients with malignant ventricular tachyarrhythmias
Amer. Heart J.
(1987) - et al.
A comparison of unconditional and conditional solutions to the maximum likelihood estimation of a change-point
Comput. Statist. Data Anal.
(2000) - et al.
Evaluating the effect of optimized cutoff values in the assessment of prognostic factors
Comput. Statist. Data Anal.
(1996) - et al.
Distinct types of diffuse large B-cell lymphoma identified by expression profiling
Nature
(2000) Numerical computation of multivariate normal probabilities
J. Comput. Graphical Statist.
(1992)- et al.
Theory of Rank Tests
(1999) Maximally selected chi square statistics for small samples
Biometrics
(1982)
Testing a sequence of observations for a shift in location
J. Amer. Statist. Assoc.
On the problem of using ‘optimal’ cutpoints in the assessment of quantitative prognostic factors
Onkologie
On exact rank tests in R
R News
Cited by (440)
Soil moisture simulation of rice using optimized Support Vector Machine for sustainable agricultural applications
2023, Sustainable Computing: Informatics and SystemsAge at diagnosis, neutrophil-to-lymphocyte ratio and platelet-to-lymphocyte ratio as prognoticators in pediatric sinonasal rhabdomyosarcoma
2023, American Journal of Otolaryngology - Head and Neck Medicine and Surgery
- 1
Support from Deutsche Forschungsgemeinschaft, Grant SFB 539-A4/C1 is gratefully acknowledged.