5th International Conference on Multiple Comparison Procedures

MCP 2007 Vienna
MCP Vienna 2007
Vienna, Austria | July 8-11, 2007


The conference will be held from July 9 to July 11. On July 8 several pre-conference courses are offered.
Accepted Talks:

Across and Down in Large SNP Studies: the MAX test of Freidlin and Zheng vs SAS PROC CASECONTROL
Dana Aeschliman; Marie-Pierre Dube
Statistical genetics research group, Montreal Heart Institute
SAS PROC CASECONTROL offers the user 3 statistical tests for assessing the association of a SNP and a binary phenotype: the allele, genotype and trend tests. Three important models of genotype-phenotype association are the recessive, additive and dominant genetic models. In a large SNP study, one is faced with both "across" and "down" aspects of the multiple testing problem. The MAX test of Freidlin et al. (Freidlin et al., 2002, Zheng and Gastwirth 2006) builds on the ideas of Armitage (1955), Sasieni (1997), and Slager and Schaid (2001) and offers a way of testing for recessive, additive, and dominant models while producing one P-value for each SNP. In this report, we compare the power of the MAX test to each of the 3 tests in SAS PROC CASECONTROL. We show that the MAX test compares very favorably. We developed a program in R to simulate genetic data sets of varying complexity. We provide two SAS MACROs that use only BASE SAS. One encodes the MAX test. The second MACRO acts as a wrapper for the first and encodes a step-down resampling algorithm, Westfall and Young's (1993) Algorithm 2.8, resulting in p-values which are corrected for the correlation between test statistics. We comment on the notion of subset pivotality as applied to this situation and discuss the treatment of missing values.

References: Zheng, G. and Gastwirth, J. (2006) On estimation of the variance in Cochran_Armitage trend tests for genetic association using case-control studies. Statistics in Medicine; 25(18): 3150-3159. Freidlin, B et al. (2002) Trend Tests for Case-Control Studies of Genetic Markers: Power, Sample Size and Robustness. Human Heredity; 53, 3. Slager, S.L. and Schaid, D.J. (2001) Case-Control Studies of Genetic Markers: Power and Sample Size Approximations for Armitage’s Test for Trend. Human Heredity; 52, 3. Sasieni, P.D. (1997) From genotypes to genes: Doubling the sample size. Biometrics; 53: 1253-1261. Armitage, P. (1955) Tests for linear trends in proportions and frequencies. Biometrics; 11: 1253-1261. Westfall, P.H. and Young, S.S. (1993) Resampling-Based Multiple Testing. John Wiley and Sons, Inc.

Multiple Testing Procedures for Hierarchically Related Hypotheses
Przemyslaw Biecek
Institute of Mathematics and Computer Science, Wroclaw University of Technology
In some genomic studies, the considered hypotheses are in a hierarchical relation. For example in Gene Set Functional Enrichment Analysis (GSFEA), we are confronted with a problem of testing thousands of hypotheses which correspond to different biological terms. Since the biological terms are hierarchically related, the corresponding hypotheses are also related. If biological term f(i) is more specific than biological term f(j), then the rejection of hypothesis H_0(i) associated with the term f(i) implies the rejection of hypothesis H_0(j) associated with the term f(j) (the relationship between biological attributes is defined by the Gene Ontology Biological Process (GO-BP) hierarchical taxonomy [1]).

In this case, in addition to correction for number of hypotheses, we want to guarantee that testing outcomes are coherent with the relation among biological functions. Popular multiple testing procedures (eg. step-up, step-down or single step) do not guarantee the coherency. Moreover, methods designed for testing under hierarchical relation (see [2]) do not provide a control of FDR and also cannot be easily applied in the context of GSFEA.

We propose a novel approach which incorporates knowledge about the relation among hypotheses. We consider an issue of testing a set of null hypotheses with a given hierarchical relation among them. The relation, represented by a directed acyclic graph (DAG), determines all possible outcomes of testing. It also leads to the two natural testing procedures (the follow up and the follow down) presented in this paper. For these procedures, we derive formulas for significance levels that provide a strong control of the three most popular error rates (FWER, PFER and FDR). We also present a simulation study for the proposed testing procedures, discuss their strengths and weaknesses and point out some applications.

[1] Harris, M. A., et al. (2004)
,,The Gene Ontology (GO) database and informatics resource.''
Nucleic Acids Res. 32(Database issue): D258–D261. doi: 10.1093/nar/gkh036

[2] Finner, H., Strassburger, K. (2002)
,,The partitioning principle: A powerful tool in multiple decision theory''.
The Annals of Statistics, Vol. 30, No. 4, 1194–1213

Multi-treatment optimal response-adaptive designs for continuous responses
Atanu Biswas; Saumen Mandal
Indian Statistical Institute, Kolkata
Optimal response-adaptive designs in phase III clinical trial set up are becoming more and more current interest. Most of the available designs are not from any optimal consideration. An optimal design for binary responses is given by Rosenberger et al. (2001) and an optimal design for continuous responses is provided by Biswas and Mandal (2004). Recently, Zhang and Rosenberger (2006) provided another design for normal responses. The present paper deals with some shortcomings of the earlier works and then extends the present approach for more than two treatments. The proposed methods are illustrated using some real data.

A Procedure to Multiple Comparisons of Diagnostic Systems
Ana Cristina Braga; Lino A. Costa e Pedro N. Oliveira
University of Minho
In this work, a method for the comparison of two diagnostic systems based on ROC curves is presented. ROC curves analysis is often used as a statistical tool for the evaluation of diagnostic systems. For a given test, the compromise between the False Positive Rate (FPR) and True Positive Rate (TPR) can be graphically presented through a ROC curve. However, in general, the comparison of ROC curves is not straightforward, in particular, when they cross each other. A similar difficulty is also observed in the multi-objective optimization field where sets of solutions defining fronts must be compared in a multi-dimensional space. Thus, the proposed methodology is based on a procedure used to compare the performance of distinct multi-objective optimization algorithms. Traditionally, methods based on the area under the ROC curves are not sensitive to the existence of crossing points between the curves. The new approach can deal with this situation and also allows the comparison of partial portions of ROC curves according to particular values of sensitivity and specificity, of practical interest. For illustration purposes, real data from Portuguese hospital was considered.

A general principle for shortening closed test procedures with applications
Werner Brannath; Frank Bretz
Medical University of Vienna
The closure principle is a general, simple and powerful method for constructing multiple test procedures controlling the family wise error rate in the strong sense. In spite of its generality and simplicity, the closure principle has the disadvantage that the number of individual tests required for its completion increases exponentially with number of null hypotheses of primary interest. Hence, multiple test procedures based on the closure principle can require large computational efforts and may become infeasible for a larg number of hypotheses and/or for computational intensive hypotheses tests, such as permutation or bootstrap tests.

Shortcut procedures have been proposed in the past, which substantially reduce the number of operations. In this presentation we provide a general principle for shortening closed tests. This principle provides a unified approach that covers many known shortcut procedures from the literature. As one application among others we derive a shortcut procedure for flexible two stage closed tests for which no shortcuts have been available yet.

Powerful short-cuts for gatekeeping procedures
Frank Bretz; Gerhard Hommel, Willi Maurer
Novartis Pharma AG
We present a general testing principle for a class of multiple testing problems based on weighted hypotheses. Under moderate conditions, this principle leads to powerful consonant multiple testing procedures. Furthermore, short-cut versions can be derived, which simplify substantially the implementation and interpretation of the related test procedures. It is shown that many well-known multiple test procedures turn out to be special cases of this general principle. Important examples include gatekeeping procedures, which are often applied in clinical trials when primary and secondary objectives are investigated, and multiple test procedures based on hypotheses which are completely ordered by importance. We illustrate the methodology with two real clinical studies.

Adjusting p-values of a stepwise generalized linear model
Chiara Brombin; Finos L., Salmaso L.
University of Padova
Stepwise variable selection methods are frequently used to determine the predictors of an outcome in generalized linear model (glm). Despite its widespread use, it is well know that the tests on the explained deviance of the selected model are biased. This arise from the fact that the ordinary test statistics upon which these methods are based were intended for testing pre-specified hypotheses; whereas the tested model is selected through a data-steered procedure. In this work we define and discuss a simple nonparametric procedure which corrects the p-value of the selected model of any stepwise selection method for glm. We also prove that this procedure fall in the class of weighted nonparametric combining functions defined by Pesarin [1] and extended in Finos and Salmaso [2]. The unbiasedness and consistency of the method is also proved. A simulation study also shows the validity of this procedure. Theorical differences with previous works in the same filed (Grachanovsky and Pinsker, [3]; Harshman and Lundy, [4]) are also provided. Free codes for R and Matlab are available and an application on a real dataset is presented.

[1] Pesarin, F. (2001). Multivariate Permutation tests: with application in Biostatistics. John Wiley & Sons, Chichester-New York.
[2] L. Finos, L. Salmaso (2006). Weighted methods controlling the multiplicity when the number of variables is much higher than the number of observations. Journal of Nonparametric Statistics 18, 2, 245–261.
[3] E. Grachanovsky, I. Pinsker(1995). Conditional p-values for the F-statistic in a forward selection procedure. Computational Statistics & Data Analysis 20, 239-263.
[4] R. A. Harshman, M. E. Lundy (2006). A randomization method of obtaining valid p-values for model changes selected “post hoc”. http://publish.uwo.ca/~harshman/imps2006.pdf

Multiple Testing Procedures with Incomplete Data for Rank-based Tests of Ordered Alternatives.
Paul Cabilio; Jianan Peng
Acadia University
Page (1963) and Jonckheere (1954) introduced tests for ordered alternatives in blocked experiments. Specifically, in the model with n blocks and t treatments, it is wished to test the hypothesis of no treatment effect against a specified ordered treatment effect with at least one inequality strict. Page proposed a statistic which can be expressed as the sum of Spearman correlations between each block and the criterion ranking chosen to be (1,2,...,t), while Jonckheere proposed a statistic which is based on Kendall's tau correlation. These tests were extended in Alvo and Cabilio (1995) to the situation where only k(i) treatment responses are observed in block i. For such incomplete blocks, the resulting extended Page statistic L* differs from the one in the complete case in that the complete rank of a response in each block is replaced by a weight times a score which is either the incomplete rank of the response or the average rank (k(i)+1)/2, depending on whether or not the treatment is ranked in that block. If the null hypothesis is rejected, it is of interest to construct test procedures to identify which inequalities in the alternative are strict, and in so doing maintain the experimentwise error rate at a pre-assigned level. Our approach in this case is to modify one or more procedures that have been developed for detecting ordered means in the context of ANOVA (Nashimoto and Wright 2005.) The form of the extended Page statistic makes it possible to apply a general step-down testing procedure for multiple comparisons such as that proposed in Marcus, Peritz, and Gabriel (1976) for normal based tests. Specifically, we define a partition of the integers 1 to t into h sets of consecutive integers. For each set of integers in the partition we define an extended Page test statistic to test the sub-alternative hypothesis of ordered effects of treatments indexed by such integers. The intersection of such hypotheses over the partition can then be tested by the sum of such statistics. The procedure is to test all such hypotheses over all possible partitions. This approach may also be used for the extended Jonckheere statistic.

A leave-p-out based estimation of the proportion of null hypotheses in multiple testing problems
Alain Celisse
UMR 518 AgroParisTech / INRA MIA
A large part of the literature have been devoted to multiple testing problems since the introduction of the False Discovery Rate (FDR) by Benjamini and Hochberg (1995). In this seminal paper, authors provide a procedure that enables control of the FDR at a pre-specified level. However an improvement of the method in terms of power is possible thanks to the introduction of an estimate of the unknown proportion of true null hypotheses: pi0. We propose an estimator of this proportion that relies on both density estimation by means of irregular histograms and exact leave-p-out cross-validation.
We estimate first the density of p-values from a collection of irregular histograms among which we select the best estimator in terms of minimization of the quadratic risk. The estimate of pi0 is deduced as the height of the largest column of the selected histogram. An estimator of the risk is obtained by use of leave-p-out cross-validation. We present a closed formula for this risk estimator and an automatic choice of the parameter p in the leave-p-out. It consists in minimizing the mean square error (MSE) of the leave-p-out risk estimator.
Besides, recent papers have pointed out that the use of two-sided statistics in one-sided tests entails p-values corresponding to false positives near to 1. Whereas most of the existing estimators do not take this phenomenon into account, leading to systematic overestimation, our estimator of the proportion remains accurate in such situations.
Eventually, we compare our procedure with existing ones in simulations, showing as well how problematic false positives near 1 may be. The proposed estimator seems more accurate in terms of variability for instance. Better FDR estimations are obtained.

Multiple Testing in Change-Point Problem with Application to Safety Signal Detection
Jie Chen
Merck Research Laboratories
Detection of a change point usually requires testing multiple null hypotheses. In this talk we focus on the inference of a change in the ratio of two time-ordered Poisson stochastic processes, by developing multiple testing procedures which offer the control of some error rates. Possible extensions of the procedures to multiple change-points are explored. The procedures are illustrated using a real data example for drug safety signal detection and a simulation study.

On the Probability of Correct Selection for Large k Populations, with Application to Microarray Data
Xinping Cui; Jason Wilson
University of California, Riverside
One frontier of modern statistical research is the “multiple comparison problem” (MCP) arising from data sets with large k (>1000) populations, e.g. microarrays and neuroimaging data. In this talk we demonstrate an alternative to hypothesis testing. It is an extension of the Probability of Correct Selection (PCS) concept. The idea is to select the top t out of k populations and estimate the probability that the selection is correct, according to specified selection criteria. We propose “d-best” and “G-best” selection criteria that are suitable for large k problems and illustrate the application of the proposed method on two microarray data sets. Results show that our method is a powerful method for the purpose of selecting the “top t best” out of k populations.

A semi-parametric approach for mixture models: Application to local FDR estimation
Jean-Jacques Daudin; A. Bar-Hen, L. Pierre, S. Robin
AgroParisTech / INRA
In the context of multiple testing, the estimation of false discovery rate (FDR) or local FDR can be stated in the mixture model context. We propose a procedure to estimate a two-components mixture model where one component is known. The unknown part is estimated with a weighted kernel function, which weights are defined in an adaptative way. We prove the convergence and unicity of our estimation procedure. We use this procedure to estimate the posterior population probabilities and
the local FDR.

Key words: FDR, Mixture model, Multiple testing procedure, Semi-parametric density estimation.

Asymptotic improvements of the Benjamini-Hochberg method for FDR control based on an asymptotically optimal rejection curve
Thorsten Dickhaus; Helmut Finner, Markus Roters
German Diabetes Center, Leibniz-Institute at the Heinrich–Heine-University Düsse
Due to current applications with a large number $n$ of hypotheses, asymptotic control ($n \to \infty$) of the false discovery rate (FDR) has become a major topic in the field of multiple comparisons. In general, the original linear step-up (LSU) procedure proposed in Benjamini & Hochberg (1995) does not exhaust the pre-specified FDR level, which gives hope for improvements with respect to power.

Based on some heuristic considerations, we present a new rejection curve and implement this curve into several stepwise multiple test procedures for asymptotic FDR control. It will be shown that the new tests asymptotically exhaust the full FDR level under some extreme parameter configurations. This optimality leads to an asymptotic gain of power in comparison with the LSU procedure.

For the finite case, we discuss adjustments both of the curve and
of the procedures in order to provide strict FDR control.


Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289-300.

Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 4, 1165-1188.

Finner, H., Dickhaus, T. & Roters, M. (2007). On the false discovery rate and an asymptotically optimal rejection curve. Submitted for publication.

Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Stat. 30, 1, 239-257.

Comparison of Methods for Estimating Relative Potencies in Multiple Bioassay Problems
Gemechis Dilba
Institute of Biostatistics, Leibniz University of Hannover, Germany
Relative potency estimations in both multiple parallel-line and
slope-ratio assays involve construction of simultaneous confidence intervals for ratios of linear combinations of general linear model parameters. The key problem here is that of determining multiplicity adjusted percentage point of a multivariate t-distribution the correlation matrix R of which depends on the unknown ratio parameters. Several methods have been proposed in the literature on how to deal with R. Among others, conservative methods based on probability inequalities (e.g., Boole's and Sidak inequalities) and a method based on an estimate of R are used. In this talk, we explore and compare the various methods (including the delta approach) in a more comprehensive manner with respect to their simultaneous coverage probabilities via Monte Carlo simulations. The methods will also be evaluated in terms of confidence interval width through application to data on multiple parallel-line assay.

Adaptive model-based designs in clinical drug development
Vlad Dragalin
Wyeth Research
The objective of a clinical trial may be either to target the maximum tolerated dose or minimum effective dose, or to find the therapeutic range, or to determine the optimal safe dose to be recommended for confirmation, or to confirm efficacy over control in a Phase III clinical trial. This clinical goal is usually determined by the clinicians from the pharmaceutical industry, practicing physicians, key opinion leaders in the field, and the regulatory agency. Once agreement has been reached on the objective, it is the statistician's responsibility to provide the appropriate design and statistical inferential structure required to achieve that goal. There is a plenty of available designs on statistician's shelf. The greatest challenge is their implementation. We exemplify this in three case studies.

Some insights into FDR and k-FWER in terms of average power and overall rejection rate
Meng Du
Department of Statistics, University of Toronto
This paper provides some insights into the false discovery rate (FDR) and the k-familywise error rate (k-FWER), through comparing, in terms of the average power, an FDR controlling procedure by Benjamini and Hochberg (1995) and a k-FWER controlling procedure by Lehmann and Romano (2005). A further look at the overall rejection rate, the probability of obtaining at least one single discovery, explains the behavior patterns of the average powers of these two procedures that control different types of error rates.

Keywords: average power, false discovery rate, k-familywise error rate, large-scale multiple testing, overall rejection rate.

Sequentially rejective test procedures for partially ordered sets of hypotheses
David Edwards; Jesper Madsen
Novo Nordisk
A popular method to control multiplicity in confirmatory clinical trials is to use a hierarchical (sequentially rejective) test procedure, based on an apriori ordering of the hypotheses. The talk describes a simple generalization of this approach in which the hypotheses are partially ordered. It is convenient to display the partial ordering as a directed acylic graph (DAG). To obtain strong FWE control, certain intersection hypotheses must be inserted into the DAG. The resulting DAG is called partially closed. The purpose of the approach is to enable the construction of inference strategies for confirmatory clinical trials that more closely reflect the trial objectives.

Homogeneity of stages in adaptive designs
Andreas Faldum
IMBEI - Universitätsklinikum Mainz
Adaptive designs result in great flexibility in clinical trials and guarantee full control of type I error. Despite increasing interest, such designs are only hesitantly implemented in pharmaceutical trials. One possible reason is concern of the regulatory authorities. In a reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan [EMEA 06], the European Medicines Agency (EMEA) requests methods to assure comparable results of interim and end analysis. The authors point out that it might be difficult to interpret the conclusions from a trial if it is suspected that the observed discrepancies of stages are a consequence of dissemination of the interim results. EMEA states that the simple rejection of the global null hypothesis across all stages is not sufficient to establish a convincing treatment effect. In order to avoid jeopardizing the success of a trial by differing results of the stages, the probability of such discrepancies should be taken into account when planning a trial.
In this talk we concentrate on two-stage adaptive designs. Boundaries for discrepant effect estimates of stages are given dependent on the p value of the first stage and the adaptive design selected. By choosing an appropriate adaptive design a rejection of the null hypothesis despite a relevantly reduced effect estimate in the second stage can be prevented. On the other hand, rejection of the null hypothesis with treatment effect estimates increasing relevantly over stages cannot reasonably be avoided. However, the probability of rejecting the null hypothesis with homogeneous effect estimates of both stages can be predetermined. The results can help to find an adaptive design, which prevents a relevant decrease of effect estimates in case of a significant trial success and reduces the probability of a random relevant increase in the effect estimate. The underlying analyses can be used as a basis for discussion with the regulatory authorities. The considerations proposed here will be clarified by examples.

EMEA (2006). Reflection Paper on Methodological Issues in Confirmatory Clinical Trials with Flexible Design and Analysis Plan. CHMP/EWP/2459/02, end of consultation Sept 2006, http://www.emea.eu.int/pdfs/human/ewp/245902en.pdf.

FDR-control: Assumptions, a unifying proof, least favorables configurations and FDR-bounds
Helmut Finner; Thorsten Dickhaus, Markus Roters
German Diabetes Center, Leibniz-Institute at the Heinrich–Heine-University Düsse
We consider multiple test procedures in terms of p-values based on a fixed rejection curve or a critical value function and study their FDR behavior.
First, we introduce a series of assumptions concerning the underlying distributions and the structure of possible multiple test procedures.
Then we give a short and unifying proof of FDR control for procedures (step-up, step-down, step-up-down)based on Simes'critical values for independent p-values and for a special class of dependent p-values considered in Benjamini and Yekutieli (2001), Sarkar (2002) and Finner, Dickhaus and Roters (2007).
Moreover, we derive upper bounds for the FDR for non step-up procedures which can be calculated with respect to Dirac-uniform configurations.
Finally, it will be shown that Dirac-uniform configurations are asymptotically least favorable for certain step-up-down procedures when the number of hypotheses tends to infinity.


Benjamini, Y. and Yekutieli, D. (2001).
The control of the false discovery rate in multiple testing under dependency.
The Annals of Statistics 29, 1165-1188.

Finner, H., Dickhaus, T. and Roters, M. (2007).
Dependency and false discovery rate: Asymptotics.
The Annals of Statistics, to appear.

Finner, H., Dickhaus, T. and Roters, M. (2007).
On the false discovery rate and an asymptotically optimal rejection curve.
Submitted for publication.

Sarkar, S. K. (2002)
Some results on false discovery rate in stepwise multiple testing procedures.
The Annals of Statistics 30, 239-257.

Non-negative matrix factorization and sequential testing
Paul Fogel; S. Stanley Young, NISS (possibly speaker)
Consultant, Paris
The “omic” sciences, transcriptomics, proteomics, metabalomics, all have data sets with n much lower than p leading to serious multiple testing problems. On the other hand, the coordination of biological action implies that there will be important correlation structures in these data sets. There is a need to take advantage of these correlations in any statistical analysis. We use non-negative matrix factorization to organize the predictors into sets. We alpha allocate over the sets and then test sequentially within each set. The within set testing is sequential so there is no need for multiple testing adjustment. We use simulations to demonstrate the increased power of our methods. We demonstrate our methods with a real data set using a SAS JMP script.

Exploring changes in treatment effects across design stages in adaptive trials
Tim Friede; Robin Henderson
University of Warwick, Warwick Medical School
The recently published draft of a CHMP reflection paper on flexible designs highlights a controversial issue regarding the interpretation of adaptive trials when the treatment effect estimates differ across design stages (CHMP, 2006). In Section 4.2.1 it states “… the applicant must pre-plan methods to ensure that results from different stages of the trial can be justifiably combined. In this respect, studies with adaptive designs need at least the same careful investigation of heterogeneity and justification to combine the results of different stages as is usually required for the combination of individual trials in a metaanalysis.” This suggests that a test for heterogeneity should be preplanned and in the event of a significant result the policy should be to discard observations subsequent to the interim analysis that induced changes in the treatment. In this presentation we investigate the error rates of this procedure. Furthermore, we present an alternative testing strategy which is based on change point methods to detect calendar time effects (Friede and Henderson, 2003; Friede et al., 2006). In a simulation study we demonstrate that our procedure performs favourably compared to the procedure suggested by the guideline. Tools that help to explore changes in treatment effects will be discussed.

Committee for Medicinal Products for Human Use (2006) Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan. London, 23 March 2006, Doc. Ref. CHMP/EWP/2459/02.

Friede T, Henderson R (2003) Intervention effects in observational studies with an application in total hip replacements. Statistics in Medicine 22: 3725-3737.

Friede T, Henderson R, Kao CF (2006) A note on testing for intervention effects on binary responses. Methods of Information in Medicine 45: 435-440.

On estimates of R-values in selection problems
Andreas Futschik
University of Vienna
In the context of selection, quantities analogous to p-values
(called R-values) have been introduced by J. Hsu (1984). They may
be interpreted as a measure of evidence for rejecting (i.e.
not selecting) a population. As in multiple hypothesis testing
when p-values are corrected for multiplicity,
these R-values can be quite conservative in high dimensional settings
unless the parameters are close to the least favorable configuration.
We propose estimates of R-values that are less conservative
and investigate their behavior. They also lead to selection rules for high dimensional problems.

False discovery proportion control under dependence
Yongchao Ge
Mount Sinai School of Medicine, New York
In datasets involving the problem of multiple testing, we are
interested to have statistical inferences of a) the total number $m_1$ of false null hypotheses, and b) the random variable false discovery proportion (FDP): the ratio of the total number of false positives to the total number of positives. The expectation of the FDP is the false discovery rate defined by Benjamini and Hochberg 1995. We describe a general algorithm to construct an upper prediction band for the FDPs and a lower confidence bound for $m_1$ simultaneously. This algorithm has three features: i)
resampling to incorporate the dependence information among the test statistics to improve power, ii) an appropriate normalization of the order test statistics or the numbers of false positives, and iii) carefully chosen rejection regions. Two interesting choices for normalizations are: standard normalization and quantile normalization. The former choice generalizes the maxZ procedure (Ge et al 05, Meinshausen and Rice 06) from independent to dependent data; and the latter improves the work by Meinshausen 06. The properties of these two choices of normalizations combined with other normalizations are compared with simulated data and microarray data.

Resampling-Based Empirical Bayes Multiple Testing Procedure for Controlling the False Discovery Rate with Applications to Genomics
Houston Gilbert; Sandrine Dudoit, Mark J. van der Laan
University of California, Berkeley
We propose resampling-based empirical Bayes multiple testing procedures (MTP) for controlling a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of false positives and true positives [3, 4]. Such error rates include, in particular, the popular false discovery rate (FDR), defined as the expected proportion of Type I errors among the rejected hypotheses. The approach involves specifying the following: (i) a joint null
distribution (or estimator thereof) for vectors of null test statistics; (ii) a distribution for random guessed sets of true null hypotheses. A working model for generating pairs of random variables from distributions (i) and (ii) is a common marginal non-parametric mixture distribution for the test statistics. By randomly sampling null test statistics and guessed sets of true null hypotheses, one obtains a distribution for a guessed specific function of the numbers of false positives and true
positives, for any given vector of cut-offs for the test statistics. Cut-offs can then be chosen to control tail probabilities and expected values of this distribution at a user-supplied level.

We wish to stress the generality of the proposed resampling-based empirical Bayes approach: (i) it controls tail probability and expected value error rates for a broad class of functions of the
numbers of false positives and true positives; (ii) unlike most MTPs controlling the proportion of false positives, it is based on a test statistics joint null distribution and provides Type I error control in testing problems involving general data generating distributions with arbitrary dependence structures among variables; (iii) it can be applied to any distribution pair for the null test statistics and guessed sets of true null hypotheses, i.e., the common marginal non-parametric mixture model is only one among many reasonable working models that does not assume independence of the test statistics.

Simulation study results indicate that resampling-based empirical Bayes MTPs compare favorably in terms of both Type I error control and power to competing FDR-controlling procedures, such as those of Benjamini and Hochberg (1995) [1] and Storey (2002) [5]. The proposed MTPs are also applied to DNA microarray-based genetic mapping and gene expression studies in Saccharomyces cerevisiae [2].

1.) Y. Benjamini and Y. Hochberg. Contolling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 1995.

2.) R.B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci., 2005.

3.) S. Dudoit and M.J. van der Laan. Multiple Testing Procedures and Applications to Genomics. Springer, 2007. (In preparation).

4.) S. Dudoit, H.N. Gilbert and M.J. van der Laan. Resampling-based empirical Bayes multiple testing procedure for controlling the false discovery rate. Technical report, Division of Biostatistics, University of California, Berkeley, 2007. (In preparation).

5.) J.D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 2002.

Comparing treatment combinations with the corresponding monotherapies in clinical trials
Ekkehard Glimm; Norbert Benda
Novartis Pharma AG
The intention of many clinical trials is to show superiority of a treatment over two others. E.g.\ a combination therapy may be compared to the corresponding monotherapies. In such a trial two drugs are administered simultaneously. A beneficial effect might arise from a synergistic effect of the monotherapies. Even in presence of an antagonistic effect, however, a simple superiority of the combination drug might be sufficient, e.g.\ as a way to overcome dose limitations of the monotherapies.

The standard confirmatory statistical test consists of two tests at level $\alpha$ and rejection if both of them are significant. This approach was called min test by Laska and Meissner (1989) who showed that it is uniformly most powerful in a certain class of monotone tests. However, while it exhausts the $\alpha$-level if the difference between monotherapy effects approaches infinity, it is very conservative in the practically more relevant situation of similar monotherapy effects. Sarkar et al.\ (1995) have shown that it is possible to construct tests that are uniformly more powerful than this approach, if the notion of monotonicity is abandoned.

In this talk, we will present alternatives to the tests suggested by Sarkar et al., some of which are also uniformly more powerful than the min test, and others which simply have a different power profile (e.g.\ are advantageous for small or large effect differences).

Simulations and asymptotic considerations will be used to investigate where and how much power is gained depending on the constellation of the therapeutic effects. Finally, the concept of monotonicity and its practical implications will be discussed.

Exact calculations of expected power for the Benjamini-Hochberg procedure
Deborah Glueck; Anis Karimpour-Fard, Lawrence Hunter, Jan Mandel and Keith E. Muller
University of Colorado at Denver and Health Sciences Center
We give exact analytic expressions for the expected power of the Benjamini and Hochberg procedure. We derive bounds for multiple dimensional rejection regions. We make assumptions about the number of hypotheses being tested, which null hypotheses are true, which are false, and the distributions of the test statistics under each null and alternative. This enables us to find the joint cumulative distribution function of the order statistics of the p-values, both under the null, and under the alternative. We thus have order statistics that arise from two sets of real-valued independent, but not necessarily identically distributed random variables. We show that the probability of each rejection region can be expressed as the probability that arbitrary subsets of order statistics fall in disjoint, ordered intervals, and that of the smallest statistics, a certain number come from one set. Finally, we express the joint probability distribution of the number of rejections and the number of false rejections by summing the appropriate probabilities over the rejection regions. The expected power is a simple function of this probability distribution. We give an example power analysis for a multiple comparisons problem in mammography.

Family-wise error on the directed acyclic graph of Gene Ontology
Jelle Goeman; Ulrich Mansmann
Leiden University Medical Center
Methods that test for differential expression of gene groups such as provided by the Gene Ontology database are becoming increasingly popular in the analysis of gene expression data. However, so far methods could not make use of the graph structure of Gene Ontology when adjusting for multiple testing.
We propose a multiple testing method, called the focus level procedure, that preserves the graph structure of Gene Ontology (GO) when testing for association of the expression profiles of GO terms with a response variable. The procedure is constructed as a combination of a Closed Testing procedure with Holm's method. It allows a user to choose a ``focus level'' in the GO graph, which reflects the level of specificity of terms in which the user is most interested. This choice also determines the level in the GO graph at which the procedure has most power. The procedure strongly keeps the family-wise error rate without any additional assumptions on the joint distribution of the test statistics used. We also present an algorithm to calculate multiplicity-adjusted p-values. Because the focus level procedure preserves the structure of the GO graph, it does not generally preserve the ordering of the raw p-values in the adjusted p-values.

Two-stage designs for proteomic and gene expression studies applying methods differing in costs
Alexandra Goll; Bauer Peter
Section of Medical Statistics - Medical University of Vienna, Austria
In gene expression and proteomic studies we generally deal with large numbers of hypotheses, where only for a small fraction of the hypotheses noticeable effects exist. Due to limited resources, the number of observations per hypotheses in a conventional single-stage design is low which limits the power. It has been shown that two-stage pilot and integrated designs are a good option to improve the power. In these sequential designs, the first stage is used to screen for the promising hypotheses, which are further investigated in the second stage. In the following we more thoroughly investigate this type of two-stage designs where the costs per measurement and effect sizes differ between the first and second stage. To compare different designs we assume that the total costs of the experiment are fixed. Both integrated and pilot designs are based on procedures either controlling the family wise type 1 error rate (FWE) or the false discovery rate (FDR). Two scenarios are considered: In the first scenario the experimenter from the beginning may have the choice between two methods that differ in costs and effect sizes (a low-cost standard method or a high-cost improved method). In the second scenario different costs per measurement may arise if the same method is applied at both stages but specific experimental devices have to be produced at higher costs per measurement for the selected markers at the second stage. For the first scenario we show that depending on the cost and the effect size ratios between the methods it is preferable either to apply the low-cost or the high-cost method at both stages. For the second scenario we will show for which cost ratios between stages it is worthwhile to use (optimal) two-stage designs as compared to the single stage design. Finally we also look how design misspecifications in the planning phase would change the power of two-stage designs as compared to the single-stage design.

Adaptive Designs with Correlated Test Statistics
Heiko Götte; Andreas Faldum, Gerhard Hommel
Institute of Medical Biostatistics, Epidemiology and Informatics, Johannes Guten
In clinical trials the collected observations are often correlated, for example: clustered data or repeated measurements. When applying adaptive designs test statistics of different stages are often also correlated in these situations so that classical adaptive designs for uncorrelated test statistics (for example Bauer/ Köhne, 1994) do not seem to be appropriate. Hommel et al. (2005) proposed the Modified Simes test for two stage adaptive designs with correlated test statistics to handle this issue. For bivariate normally distributed test statistics the significance level can be preserved. Analogously to Shih/ Quan (1999) we give the probability of type one error for the Bauer-Köhne-design in the situation of bivariate normally distributed test statistics in an explicit formula. We show that the significance level is inflated for positively correlated test statistics. The decision boundary for the second stage can be modified in a way that the type one error is controlled. The concept is expandable to other adaptive designs. The Modified Simes test is a special case. In order to use these designs the correlation between the test statistics has to be determined. For a repeated measurement setting we show how correlation can be estimated within the framework of linear mixed models. The power of Modified Simes test is compared with the power of the Bauer-Köhne-design for this situation.

Bauer, P., Köhne, K. (1994). Evaluation of Experiments with Adaptive Interim Analyses. Biometrics, 50:1029-1041.
Hommel G., Lindig V., Faldum A.(2005). Two-stage adaptive designs with correlated test statistics. Journal of Biopharmaceutical Statistics, 15:613-623.
Shih W.J., Quan H. (1999). Planning and analysis of repeated measures at key time-points in clinical trials sponsored by pharmaceutical companies. Statistics in Medicine, 18:961-973

This talk contains parts of the thesis of Heiko Götte.

A Bayesian screening method for determining if adverse events reported in a clinical trial are likely to be related to treatment
A Lawrence Gould
Merck Research Laboratories
Many different adverse events usually are reported in large-scale clinical trials. Most of the events will not have been identified a priori. Current analysis practice often applies Fisher's exact test to the usually relatively small event counts, with a conclusion of “safety” if the finding does not reach statistical significance. This practice has serious disadvantages: lack of significance does not mean lack of risk, the various tests are not adjusted for multiplicity, and the data determine which hypotheses are tested. This presentation describes a new approach that does not test hypotheses, is self-adjusting for multiplicity, and has well-defined diagnostic properties. The approach is a screening approach that uses Bayesian model selection techniques to determine for each adverse event the likelihood that the occurrence is treatment-related. The approach directly incorporates clinical judgment by having the criteria for treatment relation determined by the investigator(s). The method is developed for outcomes that arise from binomial distributions (relatively small trials) and for outcomes that arise from Poisson distributions (relatively large trials). The calculations are illustrated with trial outcomes.

Simultaneous confidence regions corresponding to Holm's stepdown multiple testing procedure
Olivier Guilbaud
AstraZeneca R&D, Sweden
The problem of finding simultaneous confidence regions corresponding to multiple testing procedures (MTPs) is of considerable practical importance. Such confidence regions provide more information than the mere rejections/acceptances of null hypotheses that can be made by MTPs. I will show how one can construct simultaneous confidence regions for a finite number of quantities of interest that correspond to Holm's (1979) step-down multiple-testing procedure. Holm's MTP is an important and widely used generalization of the Bonferroni MTP. As the Bonferroni and Holm MTPs, the proposed confidence regions are quite flexible and generally valid. They are based on marginal confidence regions for the quantities of interest, and the only essential assumption for their validity is that the marginal confidence regions are valid. The estimated quantities, as well as the marginal confidence regions, can be of any kinds/dimensions. The proposed simultaneous confidence regions are of particular interest when one aims at confidence statements that will "show" that quantities belong to target regions of interest.

Simultaneous Inference for Ratios
David Hare; Hare, David and John Spurrier
University of Louisiana at Monroe
Consider a general linear model with p-dimensional parameter vector β and i.i.d. normal errors. Let K1, ..., Kk, and L be linearly independent vectors of constants such that LTβ ≠ 0. We describe exact simultaneous tests for hypotheses that KiTβ/LTβ equal specified constants using one-sided and two-sided alternatives, and describe exact simultaneous confidence intervals for these ratios. In the case where the confidence set is a single bounded contiguous set, we describe what we claim are the best possible conservative simultaneous confidence intervals for these ratios - best in that they form the minimum k-dimensional hypercube enclosing the exact simultaneous confidence set. We show that in the case of k = 2, this “box” is defined by the minimum and maximum values for the two ratios in the simultaneous confidence set and that these values are obtained via one of two sources: either from the solutions to each of four systems of equations or at points along the boundary of the simultaneous confidence set where the correlation between two t variables is zero. We then verify that these intervals are narrower than those previously presented in the literature.

Screening for Partial Conjunction Hypotheses
Ruth Heller; Benjamini, Yoav
Tel-Aviv University
We consider the problem of testing the partial conjunction null,
that asks whether less than $u$ out of $n$ null hypotheses are
false. It offers an in-between approach to the testing of the
global null that all $n$ hypotheses are null, and the full conjunction null that not all of the $n$ hypotheses are false. We
address the problem of testing many partial conjunction hypotheses simultaneously, a problem that arises when combining maps of p-values. We suggest powerful test statistics that are valid under dependence between the test statistics as well as under independence. We suggest controlling the false discovery rate (FDR) on the p-values for testing the partial conjunction
hypotheses, and we prove that the BH FDR controlling procedure
remains valid under various dependency structures. We apply the method to examples from Microarray analysis and functional Magnetic Resonance Imaging (fMRI), two application areas where the need for partial conjunction analysis has been identified.

A unifying approach to non-inferiority, equivalence and superiority tests
Chihiro Hirotsu
Meisei University
Two approaches of multiple decision processes are proposed for unifying the non-inferiority, equivalence and superiority tests in a comparative clinical trial for a new drug against an active control. One is a method of confidence set with confidence coefficient 0.95 improving the consumer’s and producer’s risks of the usual approach of the naïve confidence interval. It requires to include 0 within the region as well as to clear the non-inferiority margin so that a trial with somewhat large number of subjects for proving non-inferiority of a drug which is actually inferior should be unsuccessful.
The other is the closed testing procedure combining the one- and two-sided tests by applying the partitioning principle and justifies the switching procedure unifying the non-inferiority, equivalence and superiority tests. In particular regarding the non-inferiority the proposed method justifies simultaneously the old Japanese Statistical Guideline (one-sided 0.05 test) and the International Guideline (two-sided 0.05 test). The method is particularly attractive changing the strength of the evidence of relative efficacy of the test drug against a control at five levels according to the achievement of the clinical trial.
Key words: Bio-equivalence, closed testing procedure, confidence set, non-inferiority, partitioning principle, superiority.

Neglect of Multiplicity in Hypothesis Testing of Correlation Matrices
Burt Holland
Temple University
Many social science journals publish articles with correlation matrices
accompanied by tests of significance that ignore multiplicity. A highly cited article in Psychological Methods recommended use of an MCP when testing correlations but promoted MCP procedures that are inapplicable to correlations. We discuss viable options for handling this problem.

Multiple comparisons for ratios to the grand mean
Ludwig A. Hothorn; G. Dilba
Leibniz Uni Hannover
Multiple comparison for differences to the grand mean is a well-known approach and commonly used in quality control, see the recent textbook on ANOM (analysis of means) by Nelson et al. (2005). Alternatively, we discuss multiple comparisons for ratios to the grand mean: multiple tests and simultaneous confidence intervals. Simultaneous confidence intervals represent a generalization of Fieller intervals and plugging-in the estimated correlations into the multivariate-t distribution with arbitrarily correlation matrix. A related R program will be provide using the mvtnorm package by Hothorn et al 2001.

The advantage of dimensionless confidence intervals will be demonstrated by examples for comparing several mutants or different varieties for multiple endpoints.

Hothorn T et al. (2001) On multivariate t and Gauss probabilities. R New 1 (2): 27-29.
Nelson PR et al. (2005) The analysis of means SIMA

To model or not to model
Jason Hsu; Violeta Calian, Dongmei Li
The Ohio state University
Re-sampling techniques are often used to estimate null distributions of test statistics in multiple testing. In the comparison of gene expressions of levels and in multiple endpoint problems, re-sampling is often used to take into account correlations among the observations. We describe how each of the re-sampling techniques: permutation of raw data, post-pivot of re-sampled test statistics, and re-sampling of pre-pivoted observations, each has its requirement of knowledge of the joint distribution of the test statistics for validity. Modeling is useful toward validating a re-sampling multiple testing technique. To the extent pre-pivot re-sampling is valid, for small samples it has some advantage of smoothness and stability of estimated null distributions.

Simultaneous confidence intervals by iteratively adjusted alpha for relative effects in the one-way layout
Thomas Jaki; Martin J. Wolfsegger
Lancaster University
A bootstrap based method to construct 1−alpha simultaneous confidence interval for relative effects in the one-way layout is presented. This procedure takes the stochastic correlation between the test statistics into account and results in narrower simultaneous confidence intervals than the application of the Bonferroni correction. Instead of using the bootstrap distribution of a maximum statistic, the coverage of the confidence intervals for the individual comparisons are adjusted iteratively until the overall confidence level is reached. Empirical coverage and power estimates of the introduced procedure for many-to-one comparisons are presented and compared with asymptotic procedures based on the multivariate normal distribution.

Distribution Theory with Two Correlated Chi-Square Variables
Anwar H Joarder
KIng Fahd University of Petroleum & Minerals
Ratios of two independent chi-square variables are widely used in statistical tests of hypotheses. This paper introduces a new bivariate chi-square distribution where the variables are not necessarily independent. Moments of the product and ratio of two correlated chi-square variables are outlined. Distributions of the sum and product of two correlated chi-squares are also derived.

AMS Mathematics Subject Classification: 60E05, 60E10, 62E15

Key Words and Phrases: Chi-square distribution, Wishart distribution, product moments, Bivariate distribution, Correlation

On Multiple Treatment Effects in Adaptive Clinical Trials for Longitudinal Count data
Vandna Jowaheer; Brajendra C. Sutradhar
University of Mauritius
In longitudinal adaptive clinical trials it is an important research problem to compare more than two treatments for the purpose of treating maximum number of patients with the best possible treatment. Recently, in the context of longitudinal adaptive clinical trials for count responses, Sutradhar and Jowaheer (2006) [SJ (2006)] introduced a simple longitudinal play-the-winner (SLPW) design for the treatment selection for an incoming patient and discussed a weighted generalized quasilikelihood (WGQL) approach for consistent and efficient estimation of the regression effects including the treatment effects. Their study however was confined to the comparison of two treatments. In this paper, we generalize their SLPW design for the two treatment case to the multiple treatment case. For the estimation of the treatment effects we provide a conditional WGQL (CWGQL) as well as an unconditional WGQL approach. Both approaches provide consistent and efficient estimates for the treatment effects, the CWGQL being simpler but slightly unstable as compared to the unconditional WGQL approach where we use the limiting weights for the treatment selection. A normality based asymptotic test for testing the equality of the treatment effects is also outlined.

Sequential genome-wide association studies for pharmacovigilance
Patrick Kelly
University of Reading, UK
Pharmacovigilance, the monitoring of adverse events, is an integral part in the clinical evaluation of a new drug. Until recently, attempts to relate the incidence of adverse events to putative causes have been restricted to the evaluation of simple demographic and environmental factors. The advent of large-scale genotyping, however, provides an opportunity to look for associations between adverse events and genetic markers, such as single nucleotides polymorphisms (SNPs). It is envisaged that a very large number of SNPs, possibly over 500,000, will be used in pharmacovigilance in an attempt to identify any genetic difference between patients who have experienced an adverse event and those who have not.

This paper presents a sequential genome-wide association test for analysing pharmacovigilance data as adverse events arise, allowing evidence-based decision-making at the earliest opportunity. This gives us the capability of quickly establishing whether there is a group of patients at high-risk of an adverse event based upon their DNA. The method uses permutations and simulations in order to obtain valid hypothesis tests which are adjusted for both linkage disequilibrium and multiple testing. Permutations are used to calculate p-values because the asymptotic properties of the test statistic are unlikely to hold due linkage disequilibrium. Simulations are used to find the required nominal significance level in order to satisfy some overall type I error rate. The simulations provide a simple and easy approach for obtaining a correction for the multiple testing without having to determine how the repeated tests are correlated.

Effects of dependence in high-dimensional multiple testing problems
Kyung In Kim; Mark A. van de Wiel
Eindhoven University of Technology
We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems. Recent simulation studies considered only simple correlation structure among variables, which was hardly inspired by real data features. Our aim is to describe dependence as a network and systematically study effects of several network features like sparsity and correlation strength. We discuss a new method for efficient guided simulation of dependent data, which satisfy the imposed network constraints. We use constrained random correlation matrices and perform extensive simulations under nested conditional independence structures. We check the robustness against dependence of several popular FDR procedures such as Benjamini-Hochberg FDR, Storey’s q-value, SAM and other resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulations studies show that popular methods such as SAM and the q-value seem to overestimate nominal FDR significance level under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust and remain conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable.

A unified approach to proof of concept and dose estimation for categorical responses
Bernhard Klingenberg
Williams College
This talk suggests to unify dose-response modeling and target dose estimation into a single framework for the benefit of a more comprehensive and powerful analysis. Bretz, Pinheiro and Branson (Biometrics, 2006) recently implemented a similar idea for independent normal data by using optimal contrasts as a selection criterion among various candidate dose-response models. We suggest a framework in which from a comprehensive set of candidate models the ones are chosen that best pick up the dose-response. To decide which models, if any, significantly pick up the signal we construct the permutation distribution of the maximum penalized deviance over the candidate set. This allows us to find critical values and multiplicity adjusted p-values, controlling the error rate of declaring spurious signals as significant. A thorough evaluation and comparison of our approach to popular multiple contrast tests reveals that its power is as good or better in detecting a dose-response signal under a variety of situations, with many more additional benefits: It incorporates model uncertainty in proof of concept decisions and target dose estimation, yields confidence intervals for target dose estimates, allows for adjustments due to covariates and extends to more complicated data structures. We illustrate our method with the analysis of a Phase II clinical trial.

On the use of conventional tests in flexible, multiple test designs
Franz Koenig; Peter Bauer, Werner Brannath
Medical University of Vienna
Flexible designs based on the closure principle offer a large amount of flexibility in clinical trials with control of the type I error rate. This allows the combination of trials from different clinical phases of a drug development process. Flexible designs have been criticized because they may lead to different weights for the patients from the different stages when reassessing sample sizes. Analyzing the data in a conventional way avoids such unequal weighting but may inflate the multiple type I error rate. In cases where the conditional type I error rates of the new design (and conventional analysis) is below the conditional type I error rates of the initial design the conventional analysis may be done without inflating the type I error rate. This method will be used to explore switching between conventional designs for typical examples.

Gate-keeping testing without tears
David Li; Mehrotra, Devan
Merck Research Labs
In a clinical trial, there are one or two primary endpoints, and a few secondary endpoints. When at least one primary endpoint achieves statistical significance, there is considerable interest in using results for the secondary endpoints to enhance characterization of the treatment effect. Because multiple endpoints are involved, regulators may require that the trial-wise type-I error rate be controlled at a pre-set level. This requirement can be achieved by using “gate keeping” methods. However, existing methods suffer from logical oddities such as allowing results for secondary endpoint(s) to impact the likelihood of success for the primary endpoint(s). We propose a novel and easy-to-implement gate-keeping procedure that is devoid of such deficiencies. Simulation results and real data examples are used to illustrate efficiency gains of our method relative to existing methods.

Exact simultaneous confidence bands for multiple linear regression over an ellipsoidal region
Shan Lin; Wei Liu
University of Southampton, S3RI
A simultaneous confidence band provides useful information on whereabouts of the true regression function. Construction of simultaneous confidence bands has a history going back to Working and Hotelling (1929) and is a hard problem when the predictor space is restricted in some region and the number of regression covariates is more than one. This talk gives the construction of exact one-sided and two-sided simultaneous confidence bands for a multiple linear regression model over an ellipsoidal region that is centered at the point of the means of the predictor variables in the experiment based on three methods, i.e.,the method of Bohrer (1973), the algebraical method and the tubular neighborhood method. Also,it is of interest to show these three methods give the same result.

Testing Procedures on Comparisons of Several Treatments with one Control in a Microarray Setting
Dan Lin; Ziv. Shkedy, Tomasz Burzykowski, Hinrich W.H. Göhlmann, An De Bondt, Tim Perera,
Center for Statistics,Hasselt University
We discuss a particular situation in a microarray experiment; when two dimensional multiple testing occurs because of comparing several treatments with a control at one hand and testing tens of thousands of genes simultaneously at the other hand. Dunnett’s single step procedure (Dunnett 1995) for testing effective treatments can be used to address one dimensional question of primary interes. Dunnett’s procedure was implemented within resampling-based algorithms such as Significance Analysis of Microarray (SAM, Tusher et al. 2001) and Benjamini and Hochberg False Discovery Rate (FDR, Benjamini and Hochberg 1995). To combine the two-dimensional testing problem into one testing procedure, we proposed an approach to test for m*K (number of genes*number of comparisons between several treatments with the control) tests simultaneously. We compared the performance of SAM and the classical BH-FDR. The method was applied to a microarray experiment with 4 treatment groups (3 microarrays in each group) and 16998 genes. Additionally a simulation study was conducted to investigate the power of the methods proposed and to investigate how to choose the fudge factor in SAM to leverage the genes with small variances.

Keywords: Dunnett’s single step procedure; microarray; multiple testing; Benjamini and Hochberg false discovery rate (BH-FDR); SAM.

A New Hypothesis to Test Minimal Fold Changes of Gene Expression Levels
Jen-pei Liu; Chen-Tuo Liao, Jia-Yan Dai
Division of Biometry, Department of Agronomy, National Taiwan University
Current approaches to identifying differentially expressed genes are based either on the fold changes or on the traditional hypotheses of equality. However, the fold changes do not take into consideration the variation in estimation of the average expression. In addition, the use of fold changes is not in the frame of hypothesis testing and hence the probability associated with errors for decision-making in for identification of differentially expressed genes can not be quantified and evaluated. On the other hand, the traditional hypothesis of equality fails to take into consideration the magnitudes of the biologically meaningful fold changes that truly differentiate the expression levels of genes between groups. Because of the large number of genes tested and small number of samples available for microarray experiments, the false positive rate for differentially expressed genes is quite high and requires further adjustments such as Bonferroni method, false discovery rate, or use of an arbitrary cutoff for the p-values. All these adjustments do not have any biological justification. Hence, we propose to formulate the hypothesis of identifying the differentially expressed genes as the interval hypothesis by consideration of both the minimal biologically meaningful fold changes and statistical significance simultaneously. Based on the interval hypothesis, a two one-sided tests procedure is proposed with a method for sample size determination. A simulation study is conducted to empirically compare the type I error rate and power of the traditional hypothesis among the two-sample t-test, the two-sample t-test with Bonferroni adjustment, the fold-change rule, the method of combination of the two-sample t-test and fold-change rule, and the proposed two one-sided tests procedure under various combinations of fold changes, variability and sample sizes. Simulation results show that the proposed two one-sided tests procedure based on the interval hypothesis not only can control the type I error rate at the nominal level but also provides sufficient power to detect differentially expressed gene. Numeric data from public domains illustrate the proposed methods.

Key words: Interval hypothesis, Type I error, Power, Fold change

Minimum area confidence set optimality for confidence bands in simple linear regression
Wei Liu; A. J. Hayter
S3RI and School of Maths
The average width of a simultaneous confidence band has been used by several authors (e.g. Naiman, 1983, 1984, Piegorsch, 1985a) as a criterion for the comparison of different confidence bands. In this paper, the area of the confidence set
corresponding to a confidence band is used as a new criterion. For simple linear regression, comparisons have been carried out under this new criterion between hyperbolic bands, two-segment bands, and three-segment bands, which include constant width bands as special cases. It is found that if one requires a confidence band over the whole range of the covariate, then the best confidence band is given by the Working \& Hotelling hyperbolic band. Furthermore, if one needs a confidence band over a finite interval of the covariate, then a restricted hyperbolic band can again be recommended, although a three-segment band may be very slightly superior in certain cases.

A Bayesian Spatial Mixture Model for FMRI Analysis
Brent Logan; Maya P. Geliazkova, Daniel B. Rowe, Prakash W. Laud
Medical College of Wisconsin
One common objective of fMRI studies is to identify voxels or points in the brain, which are activated by a neurocognitive task. This is an important multiple comparisons problem, since typically inference (often using z- or t- tests) is performed on each of thousands or hundreds of thousands of voxels. The false discovery rate has been studied for use in this problem by several authors. Finite mixture models have also been proposed to address the multiplicity issue, where voxels are classified according to being activated or not activated by the cognitive task. Links between the false discovery rate and mixture models have been shown in the literature. One limitation to these procedures is that activation is typically expected to occur in clusters of neighboring voxels rather than in isolated single voxels; methods which do not account for this may have lower sensitivity to activation. We propose a Bayesian spatial mixture model to address these issues. Each voxel has an unknown or latent activation status, denoted by a binary activation variable. The spatial model for the binary activation indicators is induced by a latent Gaussian spatial process (a conditional autoregressive, or CAR, model), thresholded to produce the binary activation, analogous to a spatial probit model. An efficient Gibbs sampling algorithm is developed to implement the model, yielding posterior probabilities of activation for each voxel, conditional on the observed data. We apply this method to a real fMRI study, and compare its performance in simulation with other methods proposed for fMRI analysis.

Multiplicity-corrected, nonparametric tolerance regions for cardiac ECG features
Gheorghe Luta; S. Stanley Young, Alex Dmitrienko
National Institute of Statistical Sciences, USA
Electrocardiograms are used to evaluate possible effects on the heart induced by drug candidates. These waveforms are quite complex and many numerical features of these waveforms are extracted for statistical evaluation. In addition, various covariates, heart rate, gender, age, etc., also need to be taken into account. There is a need to consider the multiple questions under consideration. Our idea is to combine two statistical methodologies, nonparametric tolerance regions and resampling-based multiple testing correction. We will review electrocardiograms and their standard numerical characteristics, and place this work into the framework of drug evaluation clinical trials. Using real data, we will show how nonparametric tolerance regions can be used with resampling multiplicity adjustments. The product of this strategy will be tolerance regions that adapt to the shape of the observed distributions and control over the family-wise error rate over the clinical trial.

Adaptive Design in Dose Ranging Studies Based on Both Efficacy and Safety Responses
Olga Marchenko; Prof. R. Keener, University of Michigan, Ann Arbor
i3 Statprobe, Inc
Traditionally, most designs for Phase I studies gather safety information, aiming to determine the maximum tolerated dose (MTD). Then Phase II designs would evaluate the efficacy of doses in the (assumed) toxicity acceptable. It is highly desirable for many reasons to base the dose selection on efficacy and safety responses simultaneously. Recently, several different designs for dose selection have been proposed that are based on both efficacy and safety (e.g., Thall and Cook (2004), Fedorov and Dragalin (2006), Zhang et al. (2006), etc.). While a majority of designs provide appropriate, safe and efficacious dose or doses with some precision, few of them gain the sufficient information on all doses in the range studied. In this talk, I will show how a flexible, adaptive, model-based design proposed by V.Fedorov and V.Dragalin can be implemented and changed as appropriate by studying simulations similar to three case studies with different desirable responses from several therapeutic areas.

Estimation in Adaptive Group Sequential Design
Cyrus Mehta; Werner Brannath, Martin Posch
Cytel Inc.
This paper proposes two methods for computing confidence intervals with exact or conservative coverage following a group sequential test in which an adaptive design change is made one or more times over the course of the trial. The key idea, due to Muller and Schafer (2001), is that by preserving the null conditional rejection probability of the remainder of the trial at the time of each adaptive change, the overall type 1 error, taken unconditionally over all possible design modifications, is also preserved. This idea is further extended by considering the dual tests of repeated confidence intervals (Jennison and Turnbull, 1989) and of stage-wise adjusted confidence intervals (Tsiatis, Rosner and Mehta, 1984). The method extends to the computation of median unbiased point estimates.

Estimating the interesting part of a dose-effect curve: When is a Bayesian adaptive design useful?
Frank Miller
AstraZeneca, Södertälje, Sweden
We consider the design for dose-finding trials in phase IIB of drug development. We propose that “estimating the interesting part of the dose-effect curve” is an important objective of such trials. This objective will be made more concrete and formulated in statistical terms in the talk. Having defined the objective, we can apply optimal design theory to derive efficient designs. Due to our objective, we use a customized optimality criterion and not a common optimality criterion like D-optimality. We specify both an optimal fixed design (without adaptation) and a two-stage Bayesian adaptive design. The efficiencies of these two designs are compared for several situations. We describe typical situations where you can gain efficiency from using an adaptive design but also situations where it might be better with a fixed design. Briefly, we discuss modifications of the considered adaptive design and potential advantages of these.

The multiple confidence procedure and its applications
Tetsuhisa Miwa
National Institute for Agro-Environmental Sciences
In 1973 Takeuchi proposed a multiple confidence procedure for multiple decision problems in his book “Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis” (in Japanese). This procedure is based on the partition of the parameter space. Therefore it is closely related to the recent development of the partitioning principles. In our talk we first review the basic concepts of Takeuchi's multiple confidence procedure. Then we discuss some applications and show the usefulness of the procedure.

Estimating the proportion of true null hypotheses with the method of moments
Jose Maria Muino; P. Krajewski
Instytut Genetyki Roslin PAN
In order to construct the critical region for the test statistic in a multiple hypotheses testing situation, it is necessary to obtain some information about the distribution of the test statistic under the null hypothesis and under the alternative, and to use this information in an optimal way to asses which tests can be declared significant. We propose how to obtain this information in the form of the moments of these distributions and the proportion of true null hypotheses ($\pi_0$) with the method of moments. As a particular case, we study the properties of the estimator $\pi_0$ when the test statistic is the mean value, and we construct a new asymptotically unbiased (as the number of test goes to infinity) estimator. Some numerical simulation are done to compare the proposed method with others.

Hans-Helge Müller; Nina Timmesfeld
Institute of Medical Biometry and Epidemiology, Philipps-University of Marburg,
Consider the statistically monitoring of a clinical trial comparing two treatments where the confirmatory analysis is based on a carefully planned group sequential design. Let us look at the Brownian motion model with the drift parameter reflecting the treatment difference. From now on suppose that during the course of the trial a change of the group sequential design is advisable, however, that the effect size parameter measuring treatment differences can be retained unchanged.
In order to control the type I error rate, it is necessary and sufficient to redesign the trial on the basis of the Conditional Rejection Probability (CRP) principle proposed by Müller and Schäfer (2004). In addition to decision making on a hypothesis testing paradigm, estimation of the effect size parameter with a confidence set is an important issue at the end of the trial.
Following a group sequential trial, the simple fixed sample confidence intervals are inadequate. Methods for the construction of confidence intervals reflecting early stopping for both, significance and futility, have been proposed, e.g. the confidence intervals by Tsiatis et al. (1984).
Starting with a valid concept of estimation of confidence sets in group sequential testing, in this contribution it is shown how to accommodate with the issue of constructing confidence sets following a modified design using the flexible CRP approach. The application in clinical trials is illustrated for a survival study using the method by Tsiatis et al.. The method of transformation is discussed regarding the choice of group sequential confidence sets.

Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Statistics in Medicine 2004; 23: 2497-2508.
Tsiatis AA, Rosner GL, Metha CR. Exact confidence intervals following a group sequential test. Biometrics 1984; 40:797-803.

On the conservatism of the multivariate Tukey-Kramer procedure
Takahiro Nishiyama; Takashi Seo
Tokyo university of science
We consider the conservative simultaneous confidence intervals for pairwise comparisons among mean vectors in multivariate normal distributions. The multivariate Tukey-Kramer procedure which is the multivariate version of Tukey-Kramer procedure is presented. Also, the affirmative proof of the multivariate version of the generalized Tukey conjecture of the conservativeness of the simultaneous confidence intervals for pairwise comparisons of four mean vectors is presented.
Further, the upper bound for the conservativeness of the multivariate Tukey-Kramer procedure is also given in the case of four mean vectors. Finally, numerical results by Monte Carlo simulations are given.

Sch\'effe type multiple comparison procedure in order restricted randomized designs
Omer Ozturk; Steve MacEachern
The Ohio State University
Ozturk and MacEachern (2004) introduced a new design, the order restricted randomized design (ORRD), for the contrast parameters in a linear model. This new design uses a restricted randomization scheme that relies on subjective judgment ranking of the experimental units based in their inherent heterogeneity (or homogeneity). The process of judgment ranking creates a positive correlation structure among within set units and the
restricted randomization on these ranked units translates this positive correlation into a negative one when estimating a contrast. Hence, the design serves as a variance reduction technique for treatment contrasts.

In this talk, we first develop a test for the generalized linear
hypothesis based on an ORRD and discuss how this test can be used to test the treatment effects. We then develop a Sch\'effe-type multiple comparison procedure for all possible contrasts of the treatment effects. We show that the coefficients of contrasts depend on the design matrix and the underlying covariance structure of the judgment ranked observations. A simulation study shows that the multiple comparison procedure is robust against wide range of underlying distributions.

Stepwise confidence intervals for monotone dose-response studies
Jianan Peng; Chu-In Charles Lee, Karolyn Davis
Acadia University
In dose-response studies, one of the most important issues is the identification of the minimum effective dose (MED), where the MED is defined as the lowest dose such that the mean response is better than the mean response of a zero-dose control by a clinically significant difference. Usually the dose-response curves are monotonic. Various authors have proposed step-down test procedures based on contrasts among the sample means to find the MED. In this paper, we improve Marcus and Peritz's method (1976, Journal of Royal Statistical Society, Series B, Vol 38, 157-165) and combine Hsu and Berger's DR method (1999, Journal of the American Statistical Association, Vol 94, 468-482) to construct the lower confidence bound for the difference between the mean response of any non-zero dose level and that of the control under the monotonicity assumption to identify the MED. The proposed method is illustrated by numerical examples and simulation studies on power comparisons are presented.

Detecting differential expression in microarray data: Outperforming the Optimal Discovery Procedure
Alexander Ploner; Elena Perelman, Stefano Calza, Yudi Pawitan
Karolinska Institutet
The identification of differentially expressed genes among the tens of thousands of sequences measured by modern microarrays presents an obvious and serious multiplicity problem. The central role of gene expression data in molecular biology has stimulated much research in addressing this issue over the last decade; an important result of that research is the Optimal Discovery Procedure (ODP) proposed by John Storey, which generalizes the likelihood ratio test statistic of the Neyman-Pearson lemma for multiple parallel hypotheses, and which can be shown to be optimal in the sense that for any fixed number of false positive results, ODP will identify the maximum number of true positives [1].

However, the optimality result derived in [1] assumes exact knowledge of a large number of nuisance parameters that have to be estimated for any realistic application. In our talk, we will demonstrate that the practical implementation of ODP described in [2] is less powerful than a variant of the local false discovery rate we have proposed recently, which uses the distribution of the same nuisance parameters to weight conventional t-statistics [3]. We also show how a combination of the ODP test statistic with our weighting scheme can even further improve the power to detect differentially expressed genes at controlled levels of false discovery.

[1] Storey JD: The Optimal Discovery Procedure: A New Approach to Simultaneous Significance Testing. UW Biostatistics Working Paper Series 2005, Working Paper 259.
[2] Storey JD, Dai JY, Leek JT: The Optimal Discovery Procedure for Large-Scale Significance Testing, with Applications to Comparative Microarray Experiments. UW Biostatistics Working Paper Series 2005, Working Paper 260.
[3] Ploner A, Calza S, Gusnanto A, Pawitan Y: Multidimensional local false discovery rate for microarray studies. Bioinformatics 2006, 22(5):556–565.

Repeated significance tests controlling the False Discovery Rate
Martin Posch; Sonja Zehetmayer, Peter Bauer
Medical University of Vienna
When testing a single hypothesis repeatedly at several interim
analyses, adjusted significance levels have to be applied at each
interim look to control the overall Type I Error rate. There is a
rich literature on such group sequential trials investigating the
choice and computation of adjusted critical values. Surprisingly, if a large number of hypotheses are tested controlling the False Discovery Rate (a frequently used error criterion for large scale multiple testing problems), we can show that under quite general conditions no adjustment of the critical value for multiple interim looks is necessary. This holds asymptotically (for a large number of hypotheses) under all scenarios but the global null hypothesis where all null hypotheses are true. Similar results are given for a procedure controlling the per-comparison error rate.

Involving biological information for weighing statistical error under multiple testing
Anat Reiner-Benaim
Stanford University
Given a multiple testing problem, each hypothesis may be associated with some prior information, which is related to the structure of the data and its scientific basis. This information may be unique to each hypothesis, and therefore, when estimating the overall statistical error, treating the hypotheses as having the same null distributions may lead to biased results. Using the prior information for weighing the null hypothesis can improve the error estimate and may offer less conservative controlling procedure.
The emphasis of the talk will be on use of biological data as prior information. For instance, the machinery of genetic regulation is subjected to probabilistic factors. Regulation happens when a transcription factor binds to a site on the gene. Since the match level between the two is not perfect and can vary within a wide range, it can be incorporated into the error estimation as hypotheses weights.
The effect of the weights on the error estimate will be presented, given the method of computing the weights, the pattern of the weight structure and the type of error controlled. Two approaches to control the False Discovery Rate (FDR)with weights are compared – empirical Bayes per-hypothesis FDR estimation, and weighing the p-values to control the overall FDR.

Two new adaptive multiple testing procedures.
Etienne Roquain; Gilles Blanchard
MIG- INRA Jouy-en-Josas
The proportion $\pi_0$ of true null hypotheses is a quantity that often appears explicitly in the FDR control bounds. Recent research effort has focussed on finding ways to estimate this quantity and incorporate it in a meaningful way in a multiple testing procedure, leading to so-called "adaptive" procedures.
We present here two new adaptive step-up multiple testing procedures:

- The first procedure that we present is a one-stage step-up
procedure. We prove that it has a correct (and strong) FDR control given that the test statistics are independent. If there the set of rejection is not too large (typically less than 50%), this procedure is less conservative than the so-called "two-stage linear step-up procedure" of Benjamini, Krieger and Yekutieli (2006). Moreover, preliminary simulations show that this new procedure seems to still have a correct FDR control when the test statistics are positively correlated.

- The second procedure that we present is a two-stage step-up
procedure. We prove that it has a correct (and strong) FDR control in the "distribution free" context. Because the techniques used in the distribution free context are inevitably less precise, this new adaptive procedure is more conservative than thoses built under independence. However, it will be relevant if we expect a "large" proportion of rejected hypotheses (typically more than 50%).

Procedures Controlling Generalized False Discovery Rate
Sanat Sarkar; Wenge Guo
Temple University
Procedures controlling error rates measuring at least k false
rejections, instead of at least one, can potentially increase the ability of a procedure to detect false null hypotheses in
situations where one seeks to control k or more false rejections
having tolerated a few of them. The k-FWER, which is the
probability of at least k false rejections and generalizes the
usual familywise error rate (FWER), is such an error rate that is recently introduced in the literature and procedures controlling it have been proposed. An alternative and less conservative notion of error rate, the $k$-FDR, which is the expected proportion of k or more false rejections among all rejections and generalizes the usual notion of false discovery rate (FDR) will be introduced in this talk. Procedures with the k-FDR control dominating the Benjamini-Hochberg stepup FDR procedure and its stepdown analog under independence or positive dependence and the Benjamini-Yekutieli stepup FDR procedure under any form of dependence will be presented.

André Scherag; Helmut Schäfer, Hans-Helge Müller
Institute of Medical Biometry and Epidemiology, Philipps-University of Marburg,
Genome-wide association studies have been suggested to unravel the genetic etiology of complex human diseases [1]. Typically, these studies employ a multi-stage plan to increase cost-efficiency. A large panel of markers is examined in a subsample of subjects, and the most promising markers will also be genotyped in the remaining subjects.
Until now all proposed design require adherence to formal statistical rules which may not always meet the practical necessities of ongoing genetic research. In practice, investigators may e.g. wish to base the genetic marker selection on other criteria than formal statistical thresholds.
In this talk we describe an algorithm that enables various design modifications at any time during the course of the study. Using the Conditional Rejection Probability approach [2] the family-wise type I error rate is strongly controlled. The algorithm can deal with an extremely large number of hypotheses tests though requiring very limited computational resources. This algorithm is evaluated by simulations. Furthermore, we present a real data application.

[1] Freimer NB, Sabatti C. Human genetics: variants in common diseases. Nature. 2007 Feb 22;445(7130):828-30.

[2] Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Stat Med. 2004 Aug 30;23(16):2497-508.

A test procedure for random degeneration of paired rank lists
Michael G. Schimek; Peter Hall, Eva Budinska
Medizinische Universität Graz
Let us assume two assessors (e.g. laboratories), at least one of which ranks N distinct objects according to the extent to which a particular attribute is present. The ranking is from 1 to N, without ties. In particular we are interested in the following situations: (i) The second assessor assigns each object to the one or the other of two categories (0-1-decision assuming a certain proportion of ones). (ii) The second assessor also ranks the objects from 1 to N. An indicator variable takes I_j=1 if the ranking given by the second assessor to the object ranked j by the first is not distant more than m, say, from j, and zero otherwise. For both situations our goal is to determine how far into the two rankings one can go before the differences between them degenerate into noise. This allows us to identify a sequence of objects that is characterized by a high degree of assignment conformity.

For the estimation of the point of degeneration into noise we assume independent Bernoulli random variables. Under the condition of a general decrease of p_j for increasing j a formal inference model is developed based on moderate deviation arguments implicit in the work of Donoho et al. (1995, JRSS, Ser. B 57, 301-369). This idealized model is translated into an algorithm that allows to adjust for irregular rankings (i.e. handling of quite different rankings of some objects) typically occuring in real data. A regularization parameter needs to be specified to account for the closeness of the assessors' rankings and the degree of randomness in the assignments. Our approach can be generalized to the case of more than two assessors.

The class of problems we try to solve has various bioinformatics
applications, for instance in the meta-analysis of gene expression studies and in the identification of microRNA targets in protein coding genes.

Comparing mutliple tests for separating populations
Juliet Shaffer
University of California at Berkeley
Most studies for comparing multiple test procedures for finding differences among populations concentrate on the number of true and false differences that are significant, the former as a measure of power, the latter or a combination of both in various forms as a measure of error. For researchers, the configuration of results, e.g. the extent to which they divide populations into nonoverlapping classes, may be as important as or more important than the actual numbers. Results that lead to separations of populations into groups, when accurate, are especially useful. The talk will discuss some new measures of such separability and compare different multiple testing methods on these measures.

An Exact Test for Umbrella Ordered Alternatives of Location Parameters: the Exponential Distribution Case
Parminder Singh
Guru Nanak Dev University, Amritsar
A new procedure for testing the null hypothesis against umbrella ordered alternative with at least one strict inequality, where is the location parameter of the ith two-parameter exponential distribution, , is proposed. Exact critical constants are computed using recursive integration algorithm. Tables containing these critical constants are provided to facilitate the implementation of the proposed test procedure. Simultaneous confidence intervals for certain contrasts of the location parameters are derived by inverting the proposed test statistic. In comparison to existing tests, it is shown, by a simulation study, that the new test statistic is more powerful in detecting umbrella type alternatives when the samples are derived from exponential distributions. As an extension, the use of the critical constants for comparing Pareto distribution parameters is discussed.

Multiple hypothesis testing to establish whether treatment is
Aldo Solari; Salmaso Luigi, Pesarin Fortunato
Department of Chemical Process Engineering, University of Padova, Italy
Experiments are often carried out to establish whether treatment is “better” than control with respect to a multivariate response variable, sometimes referred to as multiple endpoints. However, in order to develop suitable tests, we have to specify the notion of “better”. To formulate the problem, let X and Y denote the k-variate responses associated with control and treatment, respectively. We may be interested in testing H0: “X and Y are equal in distribution” against H1: “X is stochastically smaller than Y and not H0” where the definition of 'stochastically smaller' is given in [1]. If a test rejects H0, then it does not necessarily follow that there evidence to support H1, unless the
latter is the complement of the null hypothesis [2]. Hence we must suppose that “X is stochastically smaller than Y” is known a priori, i.e. either H0 or H1 is true. Under this assumption, we prove that testing H0 against H1 is equivalent to the union-
intersection (UI, [3]) testing formulation based on marginal distributions. However, this is not the only possible formulation for the treatment to be preferred to the control. It may be appropriate to show that the former is not inferior, i.e. not
much worse, on any of the enpoints and is superior on at least one endpoint, resulting in an intersection-union (IU, [4]) combination of IU and UI testing problems [5].
For both formulations of “better”, we propose a multiple testing procedure based on combining dependent permutation tests [6], and an application is presented.

[1] Marshall, A. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Academic Press, New York.
[2] Silvapulle, J.S. and Sen, P.K. (2005) Constrained Statistical Inference. Inequality, Order, and Shape Restictions. Wiley, New Jersey.
[3] Roy, S. (1953). On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathemathical Statistics, 24:220-238.
[4] Berger, R.L. (1982) Multiparameter hypothesis testing and acceptance sampling. Tecnometrics, 24:295-300.
[5] Rohmel, J., Gerlinger, C. Benda, N. and Lauter, J. (2006) On Testing simultaneously Non-Inferiority in Two Multiple Primary Endpoints and Superiority in at Least One of Them. Biometrical Journal, 48:916-933.
[6] F. Pesarin (2001) Multivariate Permutation Tests with Applications in Biostatistics. Wiley, Chichester.

Flexible group-sequential designs for clinical trials with treatment selection
Nigel Stallard; Tim Friede
Warwick Medical School, University of Warwick, UK
Most statistical methodology for phase III clinical trials focuses on the comparison of a single experimental treatment with a control treatment. Recently, however, there has been increasing interest in methods for trials that combine the definitive analysis associated with phase III clinical trials with the treatment selection element of a phase II clinical trial.

A group-sequential design for clinical trials that involve treatment selection was proposed by Stallard and Todd (Statistics in Medicine, 22, 689-703, 2003). In this design, the best of a number of experimental treatments is selected on the basis of data observed at the first of a series of interim analyses. This experimental treatment then continues together with the control treatment to be assessed in one or more further analyses. The method was extended by Kelly, Stallard and Todd (Journal of Biopharmaceutical Statistics, 15, 641-658, 2005) to allow more than one experimental treatment to continue beyond the first interim analysis. This design controls the type I error rate under the global null hypothesis, but may not control error rates under individual null hypotheses if the treatments selected are not the best performing.

In some cases, for example when additional safety data are available, the restriction that the best performing treatments continue may be unreasonable. This talk will describe an extension of the approach of Stallard and Todd that controls the type I error rates under individual null hypotheses whilst allowing the experimental treatments that continue at each stage to be chosen in any way.

Compatible simultaneous lower confidence bounds for the Holm procedure and other closed Bonferroni based tests
Klaus Strassburger; Frank Bretz
German Diabetes Center, Leibniz-Institute at the Heinrich–Heine-University Düsse
In this contribution we present simultaneous confidence intervals being compatible with a certain class of one-sided closed test procedures using weighted Bonferroni tests for each intersection hypothesis. The class of multiple test procedures covered in this talk includes gatekeeping procedures based on Bonferroni adjustments, fixed sequence procedures, the simple weighted or unweighted Bonferroni procedure by Holm and the fallback procedure. These procedures belong to a class of short cut procedures, which are easy to implmenet. It will be shown that the corresponding confidence bounds have a straight forward representation. For the step-down procedure of Holm we illustrate the construction of compatible confidence bounds with a numerical example. The resulting bounds will be compared with those of the classical single-step procedure. Assets and drawbacks will be discussed.

Multiple treatment comparison based on a non-linear binary dynamic model
Brajendra Sutradhar; Vandna Jowaheer
Memorial University of Newfoundland, Canada
When an individual patient receives one of the multiple treatments and provides repeated binary responses over a small period of time, the efficient comparison of the treatment effects requires to take the longitudinal correlations of the binary responses into account. In this talk, we use a non-linear binary dynamic model that allows the full range for correlations and estimate the regression effects including the treatment effects by using the GQL (generalized quasilikelihood) approach that provides consistent as well as efficient estimates. We then
demonstrate how to test the treatment effects based on the asymptotic distributions of their estimators.

A Weighted Hochberg Procedure
Ajit Tamhane; Lingyun Liu
Northwestern University
It is often of interest to differentially weight the hypotheses in terms of their importance. Let $H_1,\ldots,H_n$ be $n \geq 2$
null hypotheses with prespecified positive weights $w_1,\ldots,w_n$ which add up to 1, and with p-values,
$p_1,\ldots,p_n$ respectively. It is desired to test them, taking
into account their weights, while controlling the type I familywise error rate (FWER) at a designated level $\alpha$. The well-known weighted Bonferroni (WBF) test rejects any $H_i$ with
$p_i \leq w_i\alpha$. Weighted Holm (WHM) and weighted Simes (WSM) procedures for this problem were proposed by Holm (1979), Hochberg and Liberman (1994) and Benjamini and Hochberg (1997); however, a weighted Hochberg (WHC) procedure is lacking. Benjamini and Hochberg proposed the following step-down WHM procedure: Let $p_{(1)} \leq \cdots \leq p_{(n)}$ be the ordered p-values, and let $H_{(1)}, \ldots, H_{(n)}$ and $w_{(1)}, \ldots, w_{(n)}$ be the corresponding hypotheses and weights, respectively. Then reject $H_{(i)}$ iff $p_{(j)} \leq [w_{(j)}/\sum_{k=j}^n w_{(k)}]\alpha$ for $j=1, \ldots, i$; ; otherwise accept all remaining hypotheses. They also proposed the following WSM test: Reject $H_0= \bigcap_{i=1}^n H_i$ iff
\[ p_{(i)} \leq \frac{\sum_{k=1}^i w_{(k)}}{\sum_{k=1}^n w_{(k)}}\alpha \] for some $i=1, \ldots, n$. We consider the following WHC procedure that uses the same critical constants as WHM given above, but operates in the step-up manner: Accept $H_{(i)}$ iff $p_{(j)} > [w_{(j)}/\sum_{k=j}^n w_{(k)}]\alpha$ for $j=n, \ldots, i$; otherwise reject all remaining hypotheses. We show that this procedure is not closed in general in the sense of Marcus, Peritz and Gabriel (1976) under the WSM test for subset intersection hypotheses except when the weights are equal. In the course of this demonstration we fill the gap in the incomplete closure proof given by Hochberg (1988) for the equal weights case. Also, a direct proof based on finding a lower bound on the probability of accepting all true hypotheses (see, e.g., Liu 1996) fails for unequal weights. However, simulation studies indicate that WHC does control FWER in the limited number of cases that we have studied. We propose a conservative version of WHC using the critical matrix approach of Liu (1996) and compare its conservatism with WHC in the simulation study.

Unbiased estimation after modification of a group sequential design
Nina Timmesfeld; Schäfer, Helmut, Müller, Hans-Helge
Institut of medical Biometry and Epidemiology, Philipps-UniversityMarburg
It is well known that the classical group-sequential designs perform well in terms of expected sample size for various effect sizes, while the type I and type II error rates are controlled. For ethical and economical reasons such a design is chosen in many clinical trials. Although the planning of the study was carefully done, it might happen that a design change is reasonable. The design can be changed with control of the type I
error rate by the method of Müller and Schäfer (2004) at any time during the course of the trial.

At the end of a study additional inference is required such as confidence bounds and estimates for the effect size. In the case of group sequential designs an unbiased estimator can be obtained by the method of Liu and Hall(1999).

In this talk, we will present a method to modify this estimator to keep the unbiasedness after design modifications, in particular after modification of the sample size.

Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Statistics in Medicine 2004; 23:2497–2508.

Liu A, Hall W. Unbiased estimation following a group sequential test. Biometrika 1999; 86:71–78.

Sample size calculation for microarray data analysis using normal mixture model
Masaru Ushijima
Japanese Foundation for Cancer Research
Sample size calculation is an important procedure when designing a microarray study, especially for medical research. This paper concerns sample size calculation in the identification of differentially expressed genes between two patient groups. We use a mixture model, involving differentially expressed and non-differentially expressed genes.
To calculate the sample size, parameters to be given are as follows: (1) the number of differentially expressed genes, (2) the distribution of the true differences, (3) Type I error rate (e.g. FDR, FWER), (4) statistical power (e.g. sensitivity). We propose a sample size calculation method using FDR, family-wise power proposed by Tsai et al. (Bioinformatics, 2005, 21:1502-8), and a normal mixture model. The sample sizes for two-sample t-test are computed for several settings and the simulation studies are performed.

A new method to identify significant endpoints in a closed test setting
Carlos Vallarino; Joe Romano, Michael Wolf, Dick Bittman
Takeda Pharmaceuticals NA
We present a new multiple testing procedure that has a maximin property under the normal assumption. The new method alters the rejection region of the simple sum test to make it consonant, i.e. to guarantee that rejection of the intersection hypothesis, in a closed test setting, implies the significance of at least one endpoint. Consonance is a desirable property which increases the ability to reject false individual null hypotheses. Designed to perform well when testing related endpoints, the new procedure is applied to PROactive, a cardiovascular (CV)-outcome trial of patients with type 2 diabetes and CV-disease history. Had the PROactive trial considered its two main endpoints as co-primary, the new method shows how efficacy for one key endpoint could have been established.

Controversy? What controversy? - An attempt to structure the debate on adaptive designs
Marc Vandemeulebroecke
Novartis Pharma AG
From their beginnings, concepts for consecutive analyses of accumulating data have evoked lively debate. Classical sequential analysis has been provocatively criticized, and group sequential approaches have been controversially discussed. Since recently, the merits and pitfalls of adaptive designs are passionately debated.

Starting from striking examples, we will in this talk try to dissect the debate. We identify what we consider the main discussion points, sketch their scope, and ponder their relative importance. We propose to standardize and render more precisely the terminology. We hope that this can contribute to the creation of a frame of reference for the current controversy on adaptive designs.

FDR control for discrete test statistics
Anja Victor; Scheuer C, Cologne J, Hommel G
Institute of medical biometry, epidemiology and informatics, University Mainz, G
In genetic association studies considering e.g. Single Nucleotide Polymorphisms (SNPs) one deals with categorical data and dependencies between SNPs may occur (because of linkage Disequilibrium, LD). Additionally genetic association studies exhibit many different study situations ranging from genomewide scans to the examination of just a few selected candidate loci. The proportions of true null hypotheses will vary greatly between these situations, which influences FDR control.
We will focus on multiple testing procedures that take the categorical structure of the SNP data into account. The most popular FWER control procedure for discrete data is Tarone’s procedure (Tarone 1990). However Tarone’s procedure is not monotone in the a-level. Therefore Hommel & Krummenauer published an improvement (Hommel & Krummenauer 1998). Recently Gilbert (Gilbert 2005) transferred Tarone’s procedure to FDR control by explorative Simes procedure (Simes 1986, Benjamini & Hochberg 1995). However in Gilbert’s procedure the finally attained boundary for the p-values by Simes procedure can be higher than the boundary for the previous selection of hypotheses for the „Tarone subset“, such that no rejection may occur for small p-values outside the “Tarone subset” but for larger ones inside.
We discuss ideas how the Hommel&Krummenauer procedure can be extended to FDR control and how Gilbert’s procedure can be improved. Additionally we examine the advantages of using test procedures adapted to discrete test statistics in genetic association studies. Therefore we compare Gilbert’s FDR-controlling procedure with the Hommel&Krummenauer procedure and additionally with classical FWER controlling procedures and the classical FDR controlling procedure. Results suggest that increase in power by exploiting the discrete nature can only be achieved when the number of subjects is small. Superiority of FDR control is more prominent if a larger proportion of null hypotheses are false.

Benjamini Y., and Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B, 57, 289–300.
Gilbert PB. (2005) A modified false discovery rate multiple-comparisons procedure for discrete data, applied to human immunodeficiency virus genetics. Applied Statistics 44, 143–158.
Hommel, G. and Krummenauer, F. (1998) Improvements and modifications of Tarone’s multiple test procedure for discrete data. Biometrics 54,673–681.
Simes, RJ. (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754.
Tarone, RE. (1990) A modified Bonferroni method for discrete data. Biometrics 46: 515–522

An Application of the Closed Testing Principle to Enhance One-Sided Confidence Regions for a Multivariate Location Parameter
Michael Vock
University of Bern, Institute of Mathematical Statistics
If a one-sided test for a multivariate location parameter is inverted, the resulting confidence region may have an unpleasant shape. In particular, if the null and alternative hypothesis are both composite and complementary, the confidence region usually does not resemble the alternative parameter region in shape, but rather a reflected version of the null parameter region.

We illustrate this effect and show one possibility of obtaining confidence regions for the location parameter that are smaller and have a more suitable shape for the type of problems investigated. This method is based on the closed testing principle applied to a family of nested hypotheses.

Proportion of true null hypotheses in non high-dimensional multiple testing problems: procedures and comparison
Mario Walther; Claudia Hemmelmann; Rüdiger Vollandt
Institute of medical statistics, computer science and documentation, Friedrisch-
When testing multiple hypotheses simultaneously, a quantity of
interest is the proportion of true null hypotheses. Knowledge about this proportion can improve the power of different multiple test procedures which control the generalized family-wise error rate, the false discovery rate or the false discovery proportion. For instance in stepwise procedures the critical values, with which the p-values have to be compared, can be increased, if an upper bound of the proportion of true null hypotheses is known.
There are a lot of authors who concerned with establishing methods of estimating the proportion of true null hypotheses. Most of the introduced procedures are based on several thousands p-values, which are often assumed to be independent. These procedures work very well, however, problems arise when the dimension of the multiple testing problem is only in the few hundreds and the data are correlated. There the latter one is for example the case in EEG, proteomic or fMRI data.
Within this framework we pose the question, what is a "good"
estimation of the proportion of true null hypotheses. We therefore introduce several criteria to evaluate the efficiency of the estimations. One criterion will be the probability that a certain estimation method overestimates the proportion of true null hypotheses. Another criterion will be whether the confidence interval of the proportion of true null hypotheses is contained in a range of a pre-specified accuracy.
In this talk, we will explain methods for estimating the proportion of true null hypotheses, which are also suitable for non high-dimensional multiple testing problems with correlated
p-values. Furthermore we will evaluate and compare the quality of the estimators regarding the introduced criteria in a simulation study.

Sample size re-estimation and hypotheses tests for trials with multiple treatment arms
Jixian Wang; Franz Koenig
Novartis Pharma AG
Sample size re-estimation (SSRE) provides a useful tool to change a design during the conduct of a study when an interim look reveals that the original sample size is inadequate. For trials comparing an active treatment with a control, a common way to control the type I error is to construct an asymptotically normal distributed weighted test statistic combining the information before and after the interim look.

We consider sample size re-estimation methods for comparing multiple active treatments with a control, where we allow the change of sample size for one arm to depend on the interim information across all arms. We propose several ways to construct weighted statistics combining the information
before and after SSRE as well as related test procedures to control the overall type I error. When the change of sample size is proportional across all treatment arms, it is possible to construct statistics so that the Dunnett test can be used as if there was no SSRE. For arbitrary SSREs, we propose other procedures including a closed test based on weighted statistics with marginally standard normal distribution and a test using a multivariate generalization of weighted test statistics in combination with the closure principle. A practical example is used to illustrate the proposed approaches. The properties of the procedures are evaluated by simulations.

Resampling-Based Control of the False Discovery Rate under Dependence
Michael Wolf; Joseph Romano, Azeem Shaikh
University of Zurich
This paper considers the problem of testing s null hypotheses
simultaneously while controlling the false discovery rate (FDR).
The FDR is defined to be the expected value of the fraction of
rejections that are false rejections (with the fraction understood to be 0 in the case of no rejections). Benjamini and Hochberg (1995)provide a method for controlling the FDR based on p-values for each of the null hypotheses under the assumption that the p-values are independent. Subsequent research has since shown that this procedure is valid under weaker assumptions on the joint distribution of the p-values. Related procedures that
are valid under no assumptions on the joint distribution of the
p-values have also been developed. None of these procedures, however, incorporate information about the dependence structure of the test statistics. This paper develops methods for control of the FDR under weak assumptions that incorporate such information and, by doing so, are better able to detect false null hypotheses. We illustrate this property via a simulation study and an empirical application to the evaluation of hedge funds.

On Identification of Inferior Treatments Using the Newman-Keuls Type Procedure
Samuel Wu; Weizhen Wang; David Annis
University of Florida
We are concerned with selecting a subset of treatments such that the probability of including ALL best treatments exceeds a prespecified level. In this paper, we provide a stochastic ordering of the Studentized range statistics under a balanced one-way anova model. Based on this result we show that, when restricted to the multiple comparisons with the best, the Newman-Keuls type procedure strongly controls experimentwise error rate for a sequence of null hypotheses regarding the number of largest treatment means.

Knowledge-based approach to handling multiple testing in functional genomics studies
Adam Zagdanski; Przemyslaw Biecek, Rafal Kustra
University of Toronto, Canada and Wroclaw University of Technology, Poland
We propose a novel method for multiple testing problem inherent in functional genomics studies. One novelty of the method is that it directly incorporates prior knowledge about gene annotations to adjust the p-values. We describe general methodology to perform knowledge-based multiple testing adjustment and focus on an application of this approach in Gene Set Functional EnrichmentAnalysis (GSFEA). We apply and evaluate our method using a database of known Protein-Protein Interactions to perform large-scale gene function prediction. In this study Gene Ontology Biological Process (GO-BP) taxonomy is employed as the knowledge-base standard for describing gene functions. An extensive simulation study is carried out to investigate a behaviour of the proposed adjustment procedure under different scenarios. Empirical analysis, based on both real and simulated data, reveals that our approach yields an improvement of a number of performance criteria, including an empirical False Discovery Rate (FDR). We derive theoretical connections between our method and the stratified False Discovery Rate approach proposed by [1], and also describe similarities to the weighted p-value FDR control introduced recently by [2]. Finally we show how our method can be adopted to other multiple hypothesis problems where some form of prior information about the relationships among tests is available.

[1] L.Sun, R.V. Craiu, A.D. Paterson, S.B. Bull (2006)
“Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies”
Genet Epidemiol. 2006 Sep, 30(6):519-30.

[2] Ch.R. Genovese, K. Roeder and L.Wasserman (2006)
“False discovery control with p-value weighting”
Biometrika 2006, 93(3):509-524.

Multi-stage designs controlling the False Discovery or the Family Wise Error Rate
Sonja Zehetmayer; Peter Bauer, Martin Posch
Section of Medical Statistics, Medical University of Vienna, Austria
When a large number of hypotheses are investigated, conventional single-stage designs may lack power due to low sample sizes for the individual hypotheses. We propose multi-stage designs where in each interim analysis 'promising' hypotheses are screened which are investigated in further stages. Given a fixed overall number of observations, this allows to spend more observations for promising hypotheses than with single-stage designs, where the observations are equally distributed among all considered hypotheses. We propose multi-stage procedures controlling either the Family Wise Error Rate (FWE) or the False Discovery Rate (FDR) and derive optimal stopping boundaries and sample size allocations (across stages) to maximize the power of the procedure.
Optimized two-stage designs lead to a considerable increase in power compared to the classical single-stage design. We show that going from two to three stages additionally leads to a distinctive increase in power. Adding a fourth stage leads to a further improvement, which is, however, less pronounced. Surprisingly, we found only small differences in power between optimized integrated designs, where the data of all stages is used in the final test statistics, and optimized pilot designs where only the data from the final stage is used for testing. However, the integrated design controlling the FDR appeared to be more robust against misspecifications in the planning phase. Additionally, we found that with increasing number of stages the drop in power when controlling the FWE instead of the more liberal FDR becomes negligible.
Our investigations show that the crucial point is not the choice of the error rate or the type of design (integrated or pilot), but the sequential nature of the trial where non-promising hypotheses are dropped in early phases of the experiment so that test decisions among the selected hypotheses can be based on considerably larger sample sizes compared to the classical single-stage design.

Adaptive seamless designs for subpopulation selection based on time to event endpoints
Emmanuel Zuber; Werner Brannath, Michael Branson, Frank Bretz, Paul Gallo, Martin Posch, Amy Rac
Novartis Pharma AG, Basel, Switzerland
A targeted therapy might primarily benefit a sub-population of patients. Thus, the ability to select a sensitive patient population may be crucial for the development of such a therapy. Traditionally, one would need to start with a hypothesis generating phase II study to identify a sub-population. The specific sensitivity of that sub-population would have to be confirmed independently in a second phase II study, before a phase III study could be run in the selected target population. A formal claim of efficacy would be based on the phase III data only.

A more efficient approach is presented using an adaptive phase II/III seamless design, to combine into a single two-stage study the selection of either the full or the sub-population, with the proof of efficacy.
From a separate concomitant exploratory study, a sub-population is to be identified independently before the end of stage 1 of the combined phase II/III study. At the end of stage 1, Bayesian tools are used to confirm the hypothesis of a more sensitive sub-population. One may then decide at this step to adapt the conduct of the trial by limiting to that sub-population the further recruitment into stage 2, and by choosing the hypothesis testing strategy. Thus, the independent confirmation of the sub-population is more reliable, being made on the same clinical endpoint and in the same setting as the final phase III demonstration of efficacy. The latter is efficiently based on the combined data from stage 1 and 2, in the selected population, with an adapted testing strategy.

The use of the adaptive design methodology with a time to event endpoint relies on the asymptotic independent increment property of the logrank test statistics. The overall type I error rate is controlled thanks to the concomitant use of adaptive design methodology and of the closed testing principle for the testing in the different populations. The use of Bayesian decision tools such as predictive powers and a posterior distribution of treatment effect does not affect the overall type I error rate. It allows to account in a statistical manner, for the uncertainty of interim data and external information into the adaptation decision making.

Simulations are necessary for the design of such a complex study, to determine sample size and to assess its operating characteristics as a function of the Bayesian decision rules, and of the unknown prevalence of the sub-population. Properties of treatment effect estimates and the preservation of trial integrity after its adaptation are also studied by simulations, compared to more conventional group sequential designs.


 General Information

 Social Events



   Invited Talks
   Contributed Talks

 Organizing Committee



 Travel Information





 flyer Abstract Book
 flyer MCP-2007 Flyer


 MCP-2009 - Japan
 MCP-2005 - Shanghai
 MCP-2002 - Bethesda
 MCP-2000 - Berlin
 MCP-1996 - Tel Aviv

© MCP 2007 | Martin Posch & Franz König | info@mcp-conference.org | Section of Medical Statistics
Medical University of Vienna | Spitalgasse 23 | A-1090 Vienna | ++43 / (0)1 / 40400 / 7488 | updated: 10. April, 2021
| scripted by ViRus |