5th International Conference on Multiple Comparison Procedures

MCP 2007 Vienna
MCP Vienna 2007
Vienna, Austria | July 8-11, 2007


The conference will be held from July 9 to July 11. On July 8 several pre-conference courses are offered.

Multiple inference in medical research – an experience
Peter Bauer
Medical University of Vienna
The history of applications of multiple comparisons procedures in medical research over the last four decades is given from a personal perspective. Some milestones are sketched. The reaction to new statistical methodology in the medical literature is investigated by comparing how multiplicity issues have been handled in medical journal articles twenty years ago and today. Some of the multiplicity issues in the rapidly developing area of gene expression and gene association studies are discussed. They have provoked new concepts and have helped to establish multiplicity as an important point to consider in the scientific community. Some arguments a consulting medical statistician may expect from his clients will be sketched followed by some pragmatic concluding comments.

Invited Talks:

Multi-stage gatekeeping procedures with clinical trial applications
Alex Dmitrienko; Tamhane, Ajit
Eli Lilly and Company
This talk introduces a general approach to constructing gatekeeping procedures for multiple testing problems arising in clinical trials with hierarchically ordered objectives (primary/secondary endpoints, dose-control comparisons, etc). The approach is applied to set up gatekeeping procedures based on popular multiple tests (Holm, fallback and Hochberg tests), resampling and parametric tests. The resulting procedures have a straightforward multi-stage structure that facilitates the implementation of gatekeeping procedures and communication of the results to non-statisticians. One can also account for logical restrictions among multiple analyses and improve the power of individual tests by eliminating comparisons that are no longer clinically meaningful. The general approach is illustrated using clinical trial examples.

Multiple Testing Procedures with Applications to Genomics
Sandrine Dudoit; van der Laan, Mark J.
University of California
In this two-part presentation, we will provide an overview of a general methodology for multiple hypothesis testing and applications to a range of large-scale testing problems in biomedical and genomic research.

Specifically, we will describe resampling-based single-step and stepwise multiple testing procedures for controlling a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses (e.g., generalized family-wise error rate, tail probability for the proportion of false positives among the rejected hypotheses, false discovery rate).
Unlike existing approaches, the procedures are based on a test statistics joint null distribution and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics. The multiple testing results are reported in terms of rejection regions, parameter confidence regions, and adjusted $p$-values.

A key ingredient of our proposed procedures is the null distribution used in place of the unknown joint distribution of the test statistics. We provide a general characterization for a proper test statistics null distribution, which leads to the explicit construction of two main types of test statistics null distributions. The first null distribution is the asymptotic distribution of a vector of null shift and scale-transformed test statistics, based on user-supplied upper bounds for the means and variances of the test statistics for the true null hypotheses.
The most recent proposal defines the null distribution as the asymptotic distribution of a vector of null quantile-transformed test statistics, based on user-supplied marginal test statistics null distributions.

We will discuss joint resampling-based empirical Bayes procedures for controlling generalized tail probability and expected value error rates.
The approach involves specifying:
(i) a null distribution for vectors of null test statistics and
(ii) a distribution for random guessed sets of true null hypotheses.

By randomly sampling null test statistics and guessed sets of true null hypotheses, one obtains a distribution for an arbitrary guessed function of the numbers of false positives and rejected hypotheses, for any given vector of cut-offs for the test statistics. Cut-offs can then be chosen to control tail probabilities and expected values for this distribution at a user-supplied level.

Due to their generality and flexibility, our proposed multiple testing procedures are well-suited to address high-dimensional testing problems arising in different areas of application of statistics. We will conclude with an overview of applications in biomedical and genomic research, including: the identification of differentially expressed and co-expressed genes in high-throughput gene expression experiments, such as microarray experiments; tests of association between gene expression measures and biological annotation metadata (e.g., Gene Ontology); sequence analysis; the genetic mapping of complex traits using single nucleotide polymorphisms.

Our forthcoming book provides a detailed account of the theoretical foundations of our multiple testing methodology and
discusses its software implementation in the R package(www.bioconductor.org)and applications in biomedical and genomic research (Dudoit and van der Laan, 2007).

Aesthetics and power in multiple testing – a contradiction?
Gerhard Hommel
IMBEI, University of Mainz, Germany
It seems to be desirable that a multiple testing procedure should be as powerful as possible, given a criterion for type I error control. However, there are important additional aspects to be considered: 1. the pattern of the decisions should be logical; 2. the decisions should be conceivable, e.g., for scientific reasons or within a clinical trial; and 3. the decisions should be taken in such a way that they can be communicated also to non-statisticians. Moreover, there are also aspects of aesthetics that can be considered as relevant. Aesthetics are certainly to some extent subjective, but many people would agree that the closure test or the Bonferroni-Holm procedure (say) have an aesthetic component.
I will consider in my talk different concepts related to multiple testing procedures and discuss them under the aspects above. In particular, the following issues are discussed:
· Coherence and consonance;
· Monotonicity of decisions (dependent on p-values);
· Exchangeability;
· Criteria for control of type I errors;
· Concepts of power.
To illustrate the ideas, I will consider the “fallback procedure” (Wiens, 2003; Wiens and Dmitrienko, 2005) as an example and discuss some properties of this procedure.

Wiens, B.L. (2003). A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharm. Stat. 2, 211-215.
Wiens, B.L. and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family of hypotheses. J. Biopharm. Stat. 15, 929-942.

Modified Weighted Simes Tests in Group Sequential Designs
Willi Maurer
Novartis Pharma AG
Multiple hypothesis testing problems, where the joint distribution between test statistics is not fully known and different weights or priorities are given to hypotheses, will be discussed. If there is no fully hierarchical ordering among the hypotheses, a general approach consists of gatekeeping procedures that usually are based essentially on the Bonferroni inequality with unequal type I error probability allocation, applied to closed testing procedures. A possible improvement with respect to increased power can be achieved by using the Simes inequality. Such scenarios arise, e.g., in trials aiming at investigating the effectiveness of cardiovascular and diabetes treatments in preventing or delaying cardiovascular events. The primary endpoints considered in these cases are often compound variables summarizing events in time, like fatal/nonfatal MI and stroke, revascularization, hospitalization for unstable angina, etc. So called ‘hard’ and ‘soft’ endpoints are built from subsets of these events where ‘hard’ endpoints can be comprised of a subset of the events constituting a ‘soft’ endpoint. Such endpoints are usually highly correlated among themselves but may be of unknown correlation with further endpoints like time to progression of diabetes. An additional complication in such trials is that they are usually of long duration and interim decisions have to be taken in a group sequential or adaptive design setting. We will discuss in this context issues that arise with respect to the validity of the Simes inequality in the case of unequal allocation of the probability of a type I error and show that results for bi-variate test statistics in the two-sided case can be extended to the one sided case if the rejection region is slightly altered. The problem of applying Hochberg-type testing strategies in a group sequential setting is discussed and various options regarding the allocation of spending functions in the arising repeated closed test situation are compared. Much of the work and newer results presented have been done together with Werner Brannath, Frank Bretz and Sanat Sarkar.

On Consequences of One-Sided Alternative Hypotheses for the Null Hypothesis
Joachim Röhmel
ONeill (1997) defines the primary endpoint as ‘a clinical endpoint that provides evidence sufficient to fully characterize clinically the effect of a treatment in a manner that would support a regulatory claim for the treatment’. In a clinical trial with an experimental treatment and a reference the situation may occur that two (or even more) primary endpoints may be necessary to describe the experimental treatment’s benefit. Sometimes effects on important secondary endpoints will influence the judgement on the experimental treatment’s value. For these situations multiple testing procedures have been developed to control the rate of false claims on superiority or non-inferiority. Often multiple testing procedures focus on the aim that for at least one primary endpoint the a priori set target is met, and little attention is then given to those endpoints which failed to demonstrate the proposed effect. When taking the above definition of a primary endpoint seriously, however, assurance is needed that none of the “non-significant” endpoints is inferior. Therefore, the focus of interest in a situation with multiple primary endpoints should be the more specific minimal target to demonstrate superiority in one of them given that non-inferiority is observed in the remaining. Several proposals exist in the literature for dealing with this or similar problems, but prove insufficient or inadequate at a closer look (e.g. Bloch et al. (2001, 2006) or Tamhane and Logan (2002, 2004)). In the talk I will focus on the case of two primary endpoints. Many aspects, however, can be transferred easily to the general case. I propose a hierarchical three step procedure, where non-inferiority in both variables must be proven in the first step, superiority has to be shown by a bivariate test (e.g. Holm (1979), O’Brien (1984), Hochberg (1988), a bootstrap (Wang (1998)), or Läuter (1996)) in the second step, and then superiority in at least one variable has to be verified in the third step by a corresponding univariate test. From the above mentioned bivariate superiority tests Läuter’s SS test and the Holm procedure are preferably for the reason that these have been proven to control the type I error strictly, irrespective of the correlation structure among the primary variables and the sample size applied. A simulation study reveals that the performance regarding power of the bivariate test depends to a considerable degree on the correlation and on the magnitude of the expected effects of the two primary endpoints. The part of the talk is based on a joined work with Christoph Gerlinger (Schering), Norbert Benda (Novartis), and Jürgen Läuter (University of Magdeburg). I explore consequences for setting up null hypotheses in situations where similar problems with directional alternative hypotheses might arise, for example in stratified clinical trials.

O’Neill RT (1997). Secondary endpoints cannot validly analyzed if the primary endpoint does not demonstrate clear statistical significance. Contr. Clin. Trials 18, 550-556.
Bloch, D.A., Lai, T.L. and Tubert-Bitter, P. (2001) One-sided tests in clinical trials with multiple endpoints. Biometrics 57, 1039-1047.
Bloch, D.A., Lai, T.L., Su, Z. and Tubert-Bitter, P. A combined superiority and non-inferiority approach to multiple endpoints in clinical trials. Statistics in Medicine (in preview) DOI: 10.1002/sim.2611
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800-802.
Holm, S.A. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6, 65-70
Läuter, J. (1996). Exact t and F tests for analyzing clinical trials with multiple endpoints. Biometrics 52, 964-970.
O'Brien, P.C. (1984). Procedures for comparing samples with multiple endpoints. Biometrics 40, 1079-1087.
Tamhane, A.C. and Logan, B.R. (2002) Accurate critical constants for the one-sided approximate likelihood ratio test for a normal mean vector when the covariance matrix is estimated. Biometrics 58, 650-656.
Tamhane, A.C. and Logan, B.R. (2004). A superiority-equivalence approach to one-sided tests on multiple endpoints in clinical trials. Biometrika 91, 715-727.
Wang, S.-J.(1998). A closed procedure based on Follmann's test for the analysis of multiple endpoints. Communications in Statistics Theory and Methods 27, 2461-2480.

Control of Generalized Error Rates in Multiple Testing
Joseph P. Romano
Stanford University
Consider the problem of testing s hypotheses simultaneously. The usual approach restricts attention to procedures that control the probability of even one false rejection, the familywise error rate (FWER). If s is large, one might be willing to tolerate more than one false rejection, thereby increasing the ability of the procedure to correctly reject false null hypotheses. One possibility is to replace control of the FWER by control of the probability of k or more false rejections, which is called the k-FWER. We derive both single-step and stepdown procedures that control the k-FWER in finite samples or asymptotically, depending on the situation. We also consider the false discovery proportion (FDP) defined as the number of false rejections divided by the total number of rejections(and defined to be 0 if there are no rejections). The false discovery rate proposed by Benjamini and Hochberg controls E(FDP). Here, the goal is to construct methods which satisfy, for a given $\gamma$ and $\alpha$, $P\{FDP >\gamma\} \leq \alpha$, at least asymptotically. In contrast to Bonferroni type methods, we construct methods (using resampling) that implicitly take into account the dependence structure of the individual test statistics in order to further increase the ability to detect false null hypotheses. This feature is also shared by related work of van der Laan, Dudoit and Pollard, but our methodology is quite different. Simulations demonstrate improved performance over currently available methods. (This talk is based on joint work with Michael Wolf of the University of Zurich.)

Stepwise Testing of Multiple Dose Groups Against a Control With Ordered Endpoints
James Francis Troendle
Hierarchical gatekeeper methods exist for testing multiple dose clinical trials with multiple endpoints. This paper considers the case of ordered endpoints where an endpoint will only be tested at a given dose if all higher endpoints were found significant at that dose. Existing stepwise procedures based on the Bonferroni procedure are compared to new methods that incorporate correlation either through an assumption of Gaussian distribution or through resampling. The methods are compared by simulation for power and control of the familywise error.

Prospective Strategies and Challenges to Adaptively Designing Genomic Biomarker Targeted Trials in Trials
Sue-Jane Wang
Center for Drug Evaluation and Research
In recent decades, translational research has increasingly aimed to uncover the genomic biomarker signature that has the potential to differentiate the therapeutic effect among
patients who are predicted to be good versus poor signatures. The advent of genomic biomarker gradually brings the awareness that phenotypically homogeneous patients may be heterogeneous at the genomic level. In this talk, I will present adaptive designs that prospectively account for genomic heterogeneous patient subpopulations and diagnostics test performance characteristics. The statistics concept of biomarker qualification and the efficiency of the pharmacogenomics clinical trials in view of personalized medicine will be
exemplified. Aside from alpha allocation strategies and adaptive multiple hypotheses testing, challenges to adaptively designing pharmacogenomics targeted trials within conventional clinical trials will be elucidated via typical examples.

Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method
Peter H. Westfall; Tobias, Randall D.
Texas Tech University
Powerful improvements are possible for multiple testing procedures when the hypotheses are logically related. Closed testing with alpha-exhaustive tests provides a unifying framework for developing such procedures, but can be computationally difficult and can be “nonmonotonic in p-values“. Royen (1989) introduced a “truncated“ closed testing method for the case of all pairwise comparisons in the ANOVA that is monotonic in p-values. Shaffer (1986) developed a similar truncated procedure for more general comparisons, but using Bonferroni tests rather than alpha-exhaustive tests, and Westfall (1997) extended Shaffer's method to allow alpha-exhaustive tests. This paper extends Royen's method to general contrasts and proves that it is equivalent to the extended Shaffer procedure. For large number k of contrasts, the method generally requires evaluation of O(2^{k}) critical values corresponding to subset intersection hypotheses, and is computationally infeasible for large k. The set of intersections is represented using a tree structure, and a branch-and-bound algorithm is used to search the tree and reduce the O(2^{k}) complexity by obtaining conservative “covering sets“ that retain control of the familywise type I error rate (FWE). The procedure becomes less conservative the deeper the tree search, but computation time increases. In some cases with logical relations, even the more conservative covering sets provide much more power than standard methods. The method is general, computable, and often much more powerful than commonly-used methods for multiple testing of general contrasts, as shown by applications to pairwise comparisons and response surfaces. In particular, with response surface tests, the method is computable with complete tree search, even when k is large, and can make many more “discoveries“ than the standard FDR-controlling method. The Extended Shaffer-Royen method has recently been hard-coded in the SAS/STAT procedure PROC GLIMMIX; syntax and output will be shown.


Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, to appear in Journal of the American Statistical Association.

Ranks of true positives in large scale genetics experiments
Russell D. Wolfinger; Zaykin, Dmitri; Zhivotovsky, Lev; Czika, Wendy; Shao, Susan
In the context of a large collection of statistical genetics tests in which the number of true positives (TPs) is small, we study the distribution of the ranks of TPs among the false positives (FPs). We investigate the relative efficiency of ranking measures and how many "best" results need to be screened to cover TPs with high probability, using a few different ways of assessing significance and adjusting for multiple testing. This way of looking at the problem can aid in optimally following up on initial significant findings and in planning of future large scale experiments.

A whole-genome association scan is a prominent example, where the number of tests, L, is now commonly in the hundreds of thousands. With modern high-throughput genotyping capabilities, L can be large simply from the number of measured genetics markers, which are usually single nucleotide polymorphisms (SNPs). L can then grow exponentially by considering tests of haplotypes constructed from all possible pairs of SNPs, all triplets, etc. We simulate markers that are in linkage disequilibrium, that is, have some correlation structure, typically blocked. The measure of association of the genetic markers with a binary or quantitative trait of interest is usually some kind of p-value, perhaps weighted towards effect size. Multiple testing methods investigated include Sidak and no adjustment at all.

Bayesian adjusted inference for selected parameters
Daniel Yekutieli
Tel Aviv University
Benjamini and Yekutieli suggested viewing FDR adjusted inference as marginal inference for selected parameters. I will explain this approach – focusing on its weaknesses. I will then argue that overlooking the problem of selective inference is an abuse of Bayesian methodology; and introduce “valid“ Bayesian inference for selected parameters. I will then show that these methods are straightforward generalizations of Bayesian FDR. To make the discussion clearer I will demonstrate use of the Bayesian adjustments on microarray data.


 General Information

 Social Events



   Invited Talks
   Contributed Talks

 Organizing Committee



 Travel Information





 flyer Abstract Book
 flyer MCP-2007 Flyer


 MCP-2009 - Japan
 MCP-2005 - Shanghai
 MCP-2002 - Bethesda
 MCP-2000 - Berlin
 MCP-1996 - Tel Aviv

© MCP 2007 | Martin Posch & Franz König | info@mcp-conference.org | Section of Medical Statistics
Medical University of Vienna | Spitalgasse 23 | A-1090 Vienna | ++43 / (0)1 / 40400 / 7488 | updated: 10. April, 2021
| scripted by ViRus |