Options for Summarizing Medical Test Performance in the Absence of a

43 slides
2.61 MB
810 views

Similar Presentations

Presentation Transcript

1

Options for Summarizing Medical Test Performance in the Absence of a “Gold Standard”Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical Test Reviews Methods Guide www.ahrq.gov

2

Recognize settings where the reference standard may be imperfect (i.e., no “gold standard”) Describe sources of potential bias resulting from the use of an imperfect reference standard when estimating the sensitivity and specificity of a medical test Understand the options for analyzing data, their advantages and justification, and potential assumptionsLearning ObjectivesTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

3

Introduction: Classical ParadigmTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

4

“True” status is directly observable (e.g., for tests predicting short-term mortality after a procedure). “True” status is commonly based on a reference standard (test), which is considered to be a “gold standard” if it actually reflects the “true” status. “Reference standard bias” arises when the reference test does not mirror the truth well. The further the reference test deviates from the truth, the less accurate the estimate of the index test’s performance. An “imperfect reference standard” is a reference standard test that misclassifies “true” status at a rate that cannot be ignored.Introduction: Reference Standard IssuesTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

5

The simplest case is an index test and a reference standard that give dichotomous results (e.g., positive or negative for disease). Both the index and reference tests can err by not reflecting the true status. The example in the following slide shows true 2-by-2 table probabilities in relation to the eight combinations of index and reference test results. These eight probabilities (1, 1, 1, 1, 2, 2, 2, and 2) need to be estimated from the accuracy data. The “perfect” reference standard is the “gold standard.”Imperfect Reference Standard Scenario (1 of 2)Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

6

Imperfect Reference Standard Scenario (2 of 2)Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

7

Imperfect Reference Standard Bias (1 of 2)Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

8

“Naïve” estimates are underestimates versus true values when test results are independent among those with and without the condition of interest (“conditional independence”).Imperfect Reference Standard Bias (2 of 2)Solid red line = true sensitivity Dashed red line = true specificity Solid black line = naïve sensitivity Dashed black line = naïve specificityAbbreviations: Seindex = index test specificity Spindex = index test specificity P = disease prevalenceTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

9

Only rarely are we absolutely sure that the reference standard is a perfect reflection of the truth. Often, we are comfortable with overlooking small or moderate misclassifications by the reference standard. Hard-and-fast rules for judging the (in)adequacy of the reference standard do not exist. Consult content experts on a case-by-case basis to make judgments. There are three settings in which one might question the validity of the reference standard. The reference method yields different measurements over time or across settings. The condition of interest is variably defined. The new method is an improved version of a usually applied testReference Standard ValidityTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

10

Situation: The reference method yields different measurements over time or across settings. Example: Diagnosis of obstructive sleep apnea typically requires a high Apnea-Hypopnea Index (AHI; an objective measurement) and the presence of suggestive symptoms and signs. Problem: There is large night-to-night variability in measured AHI and substantial between-rater and between-laboratory variability. Imperfect Reference Standard: Setting 1Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

11

Situation: The condition of interest is variably defined. Example: The disease, such as psoriatic arthritis, is complex. Problem: There is no single symptom, sign, or measurement that suffices to make a diagnosis of the disease with certainty. Instead, a set of diagnostic criteria (symptoms, signs, imaging results, and laboratory measures) is used to identify the disease, which will unavoidably be differentially applied across studies.Imperfect Reference Standard: Setting 2Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

12

Situation: The new method is an improved version of a usually applied test. Example: Measurement of parathyroid hormone (PTH) Problem: Older measurement methodologies are being replaced by newer, more specific ones. Measurements with the new and old methodologies do not agree very well. It is incorrect to use the older method as the reference standard.Imperfect Reference Standard: Setting 3Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

13

Analytic options 1 and 2 below are preferred when possible to summarize data from two fallible tests; option 3 is also suitable. Forgo the classical paradigm, which focuses on test accuracy; assess the ability of the index test to predict patient outcomes (using the index test as a predictive instrument). Forgo the classical paradigm; assess agreement of the index and reference test results, that is, treat index and reference tests as two alternative measurement methods. Using the classical paradigm, calculate “naïve” estimates of the index test’s sensitivity and specificity, but qualify study findings to avoid misinterpretation. Mathematically adjust the “naïve” estimates of the index test’s sensitivity and specificity to account for the imperfect reference standard.Analytic Options for a Systematic ReviewTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

14

Forgo the classical paradigm, which compares the index test to a reference standard (test “accuracy”). This information is not informative or interpretable with an “imperfect” reference standard. Instead, assess the ability of the index test to predict patient outcomes such as history, future clinical events, and response to therapy. This option follows a well-known paradigm in systematic reviews for evaluating prognostic tests (more information is available in Module 12).Analysis Option 1: Focus on Prediction of Patient OutcomesTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

15

Forgo the classical paradigm (test “accuracy”). Instead, assess agreement (concordance) of the index and reference test results. Simply treat the index and reference tests as two alternative measurement methods. How to do this depends on whether the results are categorical or continuous. For categorical test results: Cohen’s kappa statistic is a measure of categorical agreement that accounts for agreement by chance. Meta-analyses of kappa statistics are not common in the medical literature; they will need to be explained and interpreted in detail.Analysis Option 2: Focus on the Agreement of Index and Reference Tests (1 of 2) Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

16

When there are continuous test results but individual data points are available, the researcher can: Directly compare measurements between tests Pool data from all available studies and: Perform regression of one test versus another, which accounts for measurement error Conduct a Bland-Altman analysis (difference vs. the average of the two test results) When there are continuous test results but individual data points are not available, the researcher can: Summarize study-level information from (1) or (2) aboveAnalysis Option 2: Focus on the Agreement of Index and Reference Tests (2 of 2) Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

17

Calculate “naïve” estimates of the index test sensitivity (Se) and specificity (Sp), ignoring imperfection of the reference standard but making qualitative judgments on the direction of bias of these “naïve” estimates. Index and reference tests are independent within strata of disease (conditional independence). Naïve estimates of index test Se and Sp are biased downward (underestimated). Index and reference tests are correlated within strata of disease. Naïve estimates of Se and Sp can be: Overestimates if tests agree more than by chance Underestimates when tests disagree more than by chance Problem: The researcher cannot assume conditional independence without justification; external data are needed.Analysis Option 3: Calculate “Naïve” Estimates and Discuss BiasTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

18

The prostate-specific antigen (PSA) test is used to detect prostate cancer. Numerous methods have been developed to test PSA levels. These tests prone to false-negative misclassification: PSA levels are not elevated in up to 15 percent of prostate cancer cases. Obesity can reduce serum PSA. Obesity will likely affect all PSA-detection methods, old and new (“conditional dependence”). Conditional dependence of PSA tests results in overestimation of the accuracy of a new (index) test. When compared to a non-PSA reference (e.g., a prostate biopsy), this is no conditional dependence; misclassification results in in underestimation.Analysis Option 3: ExampleTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

19

Mathematically adjust (correct) the “naïve” estimates of the index test sensitivity and specificity to account for the imperfect reference standard. Data from 2  2 tables are not enough; additional information is needed from the literature. The task is easiest if conditional independence can be assumed when: The sensitivity and specificity of an imperfect reference test are known from other studies. The specificity of both the index and imperfect reference standard are known from other studies, but the sensitivities are unknown. Use Bayesian inference to add prior distribution data from other studies as opposed to fixed values. It provides data on sensitivity, specificity, and disease prevalence. Alternative sets of assumptions are possible. Problem: Model mis-specification can result in biased estimates. Analysis Option 4: Mathematically Adjust “Naïve” EstimatesTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

20

Obstructive sleep apnea (OSA) is characterized by sleep disturbances secondary to upper airway obstruction. OSA has a prevalence of 2 to 4 percent in middle-aged adults. It is associated with daytime somnolence, cardiovascular morbidity, diabetes, and other adverse outcomes. Treatment includes continuous positive airway pressure. A systematic review on the diagnosis of OSA in the home setting used: Portable monitors as the index diagnostic test Facility-based polysomnography as the reference standard The reviewers first attempted analysis option 3, then moved on to analysis option 2.Example: Performing a Systematic Review on Obstructive Sleep ApneaTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

21

There is no “perfect” or accepted reference standard for obstructive sleep apnea (OSA). A diagnosis of OSA is based on suggestive signs and symptoms and objective assessment of breathing patterns during sleep with facility-based polysomnography (PSG). PSG quantifies the Apnea-Hypopnea Index (AHI). Portable monitors can also measure AHI. A high AHI (usually ≥15 events per hour of sleep) is suggestive of OSA; alternative cutoffs range from 5 to 40 events/hour. The main analysis in the systematic reviews used a cutoff of AHI ≥15, but cutoffs of 10 and 20 were also analyzed (there were too few data to analyze other cut-offs).Systematic Review Example: Choice of Reference Standard and Cutoff Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

22

The reviewers calculated “naïve” estimates of the sensitivity (Se) and specificity (Sp) of the Apnea-Hypopnea Index by comparing portable monitors with polysomnography and qualified the results. “Naïve” estimates of sensitivity and specificity were displayed in the receiver operator characteristic space. High Se and Sp levels were suggested. Systematic Review Example: Analysis Option 3 — Naïve EstimatesHowever, there was considerable variability in the measurements. It was not possible to deduce whether the “naïve” estimates overestimate or underestimate the “true” Se and Sp.Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

23

Reviewers also described concordance between Apnea-Hypopnea Index (AHI) measured by portable monitors (“index” test) versus polysomnography (“reference” test) with Bland-Altman analysis (continuous data with individual points available), but are the tests interchangeable? They found better agreement for lower AHI levels. Systematic Review Example: Analysis Option 2 — Pooled Data AnalysisDashed line = line of perfect agreement Broad limits = suboptimal agreementTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

24

The reviewers summarized Bland-Altman plots across studies. The mean difference in the two measurements of the Apnea-Hypopnea Index (mean bias) and the 95-percent limits of agreement are shown for each study. The 95-percent limits of agreement are very wide in most studies, suggesting great variability in the measurements with the two methods.Systematic Review Example: Analysis Option 2 — Study-Specific ResultsTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

25

Measurements of the Apnea-Hypopnea Index (AHI) with the two methods generally agree on which patients have 15 or less events per hour of sleep (low AHI). The methods disagree on the exact measurement among people who have higher AHIs on average. The reviewers identified a gap in the literature. The reviewers recommended undertaking studies that perform clinical validation of portable monitors, i.e. their ability to predict patients’ history, risk propensity, or clinical profile (analysis option 1).Systematic Review Example: Conclusions and a RecommendationTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

26

When multiple reference standard tests, or multiple cutoffs for the same reference test, are available: Justify the choice of test and/or cutoff or Consider analyzing multiple options Decide on the most appropriate analysis options to synthesize test performance. The four analysis options presented in this module are largely complementary approaches and are not mutually exclusive. Analysis options 1, 2, and 3 are recommended. Analysis option 4 requires expert statistical help. There are no empirical data on the merits and pitfalls of the mathematical adjustments in option 4 for an imperfect reference standard.Overall RecommendationsTrikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Browse More Presentations

Last Updated: 8th March 2018

Recommended PPTs