Clinical Trials |
Fred Hutchinson Cancer Research Center and University of Washington School of Medicine, Seattle, USA
Correspondence: Elihu Estey, Fred Hutchinson Cancer Research Center and University of Washington School of Medicine, Seattle, USA. E-mail:eestey{at}u.washington.edu
|
|
|---|
Key words: MLL, proteins, leukemia.
|
|
|---|
While statistically significant, I doubt many patients would consider this result medically significant. Hence, the targeted improvement does not reflect clinical reality. The choice of a false positive rate of 0.05 but a false negative rate of 0.20 implies a preference for more protection against a false positive than a false negative result. This is quite sensible when satisfactory treatment exists for the disease in question, and hence, replacement of this standard with a falsely positive new therapy is particularly undesirable. However, because there is no satisfactory treatment for most patients with AML, the medical risk of a false positive is much less. Indeed, the near universal choice of p=0.05 and power=80%, regardless of the disease in question, ignores the reality that diseases vary considerably in curability.
Consequently, phase III AML trials should perhaps seek more clinically meaningful improvements and permit higher p values. Although this formulation would result in loss of power to detect relatively small advances, I question whether leukemia therapeutics advances in such small increments. In particular, it would appear that quantum therapeutic advances are not infrequent, as with all-trans retinoic acid (ATRA) and arsenic trioxide (ATO) for APL, 2-chlorodeoxyadenosine for hairy cell leukemia, high-dose ara-C and likely gemtuzumab ozogamycin for CBF AML, and imatinib for chronic myeloid leukemia (CML). Even if there were value in retaining sufficient power to detect small advances, the added value may not justify the necessary sample sizes, which prevent expeditious completion of trials and simultaneous investigation of a large number of new therapies.
|
|
|---|
|
View this table: [in a new window] [Download PPT slide] |
Table 1. Definitions.
|
The dependence of p value on trial design is such that, in a case in which the final planned analysis yields a p value of 0.051, but in which subsequently obtained data strengthen the evidence in favor of a difference, these data cannot be used since they were not obtained as part of the planned experiment.
The Bayesian approach provides flexibility, and in particular, encourages interim analyses. The approach begins with parameters, such as the probability of CR or, when comparing two treatments, the probability that the relative risk of survival is greater than 1.0. These parameters (denoted here by
) are random quantities, with probability distributions describing ones uncertainty about them. One begins with a prior distribution, p(
), that characterizes the uncertainty about
before observing any data. The second Bayesian quantity is the likelihood, L(data |
), which describes the probability of observing any specified data given any value of
; examples of likelihoods are the binomial distribution for binary events and the normal (bell-shaped) distribution for continuous variables. Bayess theorem multiplies the prior by the likelihood of observing the data given the parameter to arrive at a posterior distribution of
, which describes uncertainty about
after observing the data (Figure 1). In contrast to p value-based methods, Bayesian inference is not affected by the experimental design since data only enter inferences through the likelihood function. Consequently, when making decisions or inferences based on accruing data, Bayess theorem may be repeatedly applied, with the posterior at each stage becoming the prior for the next stage. The probability distributions in this sequence become increasingly informative about
as the data accumulate. This process, known as Bayesian learning (Figure 1), is especially useful in sequential data monitoring during a clinical trial. The current posterior probability distribution may be used to modify doses, unbalance a randomization in favor of a treatment with relatively superior performance, or terminate a trial early due to either superiority of a treatment or futility. The Bayesian approachs flexibility can be appreciated by contrasting its ability to incorporate data obtained subsequent to trial completion with the p value approachs inability to do this, as noted above.
![]() View larger version (9K): [in a new window] [Download PPT slide] |
Figure 1. Bayesian probability distributions using a trial of a new therapy in relapsed acute myeloid leukemia as an example. The values on the horizontal axis are different probabilities of complete remission. The values on the vertical axis represent the weight assigned to each CR probability. Prior to treatment, although the average CR rate is thought to be 20%, some credence is assigned to each probability of CR (prior probability distribution, dotted line). After observing 5/10 CRS (first posterior probability distribution, dashed line), the average CR rate is close to 50% and no credence is given to CR rates less than 10% or greater than 90%, reflecting the impact of the observed data on the prior. Thus, the posteriors become successively more informative as the data accumulate, and shift to reflect the overall average behavior of the data. After observing 7 CRs in the next 30 patients (total 12 CRs in 40 patients), the average CR rate is approximately 30% and no credence is given to a CR rate greater than 60% (2nd posterior probability distribution, solid line). Computing the proportion of the area under the curve that is to the right of a CR rate of 0.4 gives the current probability that the CR rate is greater than 0.4. This probability can be used to make treatment decisions.
|
|
|
|---|
A trial adaptively randomizing patients over age 50 with untreated AML among idarubicin + ara-C (IA, the standard), troxacitabine + ara-C (TA), and troxacitabine + idarubicin (TI) illustrates the process.12 The first 15 patients were randomized fairly among the three arms. As each patient after the 15th entered the trial, we computed the posterior probability that the CR rate with IA was greater than or equal to 10% better than that with TA or TI. If this probability was less than 0.15, accrual to IA was suspended. If in contrast the posterior probability was greater than 0.85 that the CR rate with TA or TI was greater than or equal to 10% worse with IA, accrual to either TA or TI was suspended. Depending on results in arms that remained open, a closed arm could re-open. A maximum of 75 patients were to be randomized. The TI arm closed and remained closed after the first 5 patients failed to respond, while the TA arm closed and remained closed after the CR rate was 3/11, at which time the CR rate in the IA arm was 10/18. If the 34 patients who had been entered on the trial when both TA and TI arms were closed had been randomized fairly, 11 patients would have received each of TA, TI, and IA. With adaptive randomization, only 16, rather than 22 patients, received the inferior TA or TI arms, probably corresponding with how patients visualize clinical practice.
The possibility certainly exists that stopping arms so early might lead to a false negative conclusion. Beginning adaptive randomization only once 15 to 20 patients have been randomized equally among the various treatment arms reduces the problem. At any rate, it is critical to examine how the design performs under various clinical scenarios, that is, what are its operating characteristics (OC). OC include the probabilities that the design will correctly select a truly superior treatment or incorrectly select a truly inferior treatment, as well as the median number of patients treated on each arm. If clinicians feel the OC are unsatisfactory, the parameters above, such as the criterion probabilities of 0.15 or 0.85, or the number of patients to be fairly randomized are changed until desirable OC are obtained. Table 2 illustrates 1,000 computer simulations for two scenarios in the IA versus TA versus TI trial. In the first, the true CR rates with TA, IA, and TI are 50%, 40%, and 30%, respectively; hence, the correct conclusion is that TA is superior. As parameterized above, the probability was 80% that the design would reach the correct conclusion, corresponding to a power of 80%. In contrast, if the true CR rates were 30%, 40%, and 30% with TA, IA, and TI, respectively, the probability that the design would correctly select IA as superior was only 10%. Hence, in this case, the design provided much more protection against a false negative than a false positive. The false positive rate could have been decreased by eliminating the requirement that, with high probability, TA or TI be at least 10% worse than IA before either of these arms would close. However, this would have also increased the false negative rate contrary to the desire of the clinical investigators to maintain a low false negative rate.
|
View this table: [in a new window] [Download PPT slide] |
Table 2. Operating characteristics for IA vs. TA vs. TI trial.
|
|
|
|---|
The S2S unrealistically assumes that treated patients have homogeneous prognoses. Certainly, in AML this is unlikely to be the case.16 Hence reliance on the S2S risks declaring drugs inactive when they might have been found active had a better prognostic group been treated. Conducting separate phase II trials in distinct prognostic groups is time consuming and does not allow information gained in one prognostic group to affect the trial in a second prognostic group.
A method that accounts for treatment-prognostic subgroup interactions has been proposed, specifically using data from the trial to estimate the degree to which the results in the different subgroups can be combined.17 There are two levels of prior probability distributions (hierarchical Bayes). The first is the usual probability of response to a drug in each of, for example, two prognostic groups. The second quantifies prior belief that the response in one prognostic group can inform the probability of response in the other. As usual, these priors are updated as the trial proceeds.
Consider a hypothetical trial of a new drug in relapsed AML. Actual historical data indicated a response rate of 21% in 169 patients. This rate was 11% (118 patients) if initial CR duration was less than one year but 43% (51 patients) if initial CR duration was greater than one year. The goal was to increase response rate to 31% (absolute increase of 0.2) in the worse prognostic group and to 58% (absolute increase of 15%) in the better prognostic group. Since the historical data suggest that 69% of patients will be in the worse group, the overall targeted improvement is [0.20x0.69}+[0.15x0.31]=0.18. Thus, an S2S design would set 21% as p0 and 0.21+0.18=0.39 as p1. Setting the nominal false positive and false negative rates at 0.10, the S2S would treat 22 patients in a first stage, and the trial would stop if less than five responses occurred. If greater than four responses occurred, an additional 21 patients would be enrolled and the drug declared a success if responses were seen in more than 12/43 patients. Thus, to make the proposed design (hereafter, STI because it examines subgroup-treatment interactions) comparable to the STS, we specify that STI will also take its first look after 22 patients have been evaluated and will also set its false negative rate at 0.10.
Table 3 compares the operating characteristics of the STI and S2S designs. In Table 3A, the new drug achieves its goal in the better but not the worse group. Because the S2S does not consider interactions between prognostic subgroups and treatment, it has the same probability (0.75) of rejecting the drug in both groups. In contrast, the STI is less likely to reject the drug in the better group and more likely to reject it in the worse group. Furthermore, 52% of the patients treated with STI will be in the better group versus only 29% with S2S. Table 3B illustrates that, in the case in which the desired improvement occurs in the worse but not the better group, STI is more likely to accept and reject the drug in the appropriate subgroups. Although conducting separate Simon 2-stage designed trials in better and worse subgroups corrects this problem, S2Ss inability to allow results in one subgroup to affect the conduct of the trial in the other subgroup continues to result in a smaller proportion of patients belonging to the group where treatment seems more effective relative to historical data.
|
View this table: [in a new window] [Download PPT slide] |
Tables 3. Comparative operating characteristics of STI and Simon 2-stage (S2S) designs.
|
|
|
|---|
|
|
|---|
Such selection designs are often criticized as under-powered phase III trials. Examination of selection designs operating characteristics indicate that, in a scenario where three drugs have the same true response rate and the fourth provides an absolute 20% improvement, the probability of correctly selecting the fourth drug (that is the probability that it will not stop early plus the probability that it will have the highest response rate at the end of the trial) is only about 60%. This of course contrasts with the aforementioned 80% power typical of randomized trials, involving, for example, a new drug versus a standard. However, the 80% figure is purely nominal, ignoring the process used to select the new drug. Assume that four new therapies were available for comparison with a standard, and that because pre-clinical rationale cannot substitute for clinical data in the selection process, each was equally likely to be useful clinically. It follows that the probability of correctly selecting the best drug was 25%. This 25% is ignored in the computation of 80% power; if it were not, the power of the trial would be 25%x80%=20%.
Thus, the selection designs 60% probability of correct selection should be viewed, not in relation to 80% power, but in relation to the 25% probability of correct selection that it would obtain in the absence of the selection design. Recognizing these issues, the Medical Research Council-sponsored trials in AML in the United Kingdom are employing selection designs rather than more conventional phase III designs.
Received for publication May 14, 2009. Accepted for publication May 21, 2009.
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||