Criteria for Assessing Internal Validity of Individual Studies
The Methods Work Group for the US Preventive Services Task Force (USPSTF) developed a set of criteria by which the internal validity of individual studies could be evaluated. The USPSTF accepted the criteria, and the associated definitions of quality categories, that relate to internal validity at its September 1999 meeting.
This appendix describes the criteria relating to internal validity and the procedures that topic teams follow for all updates and new assessments in making these judgments.
All topic teams use initial "filters" to select studies for review that deal most directly with the question at issue and that are applicable to the population at issue. Thus, studies of any design that use outdated technology or that use technology that is not feasible for primary care practice may be filtered out before the abstraction stage, depending on the topic and the decisions of the topic team. The teams justify such exclusion decisions if there could be reasonable disagreement about this step. The criteria below are meant for those studies that pass this initial filter.
Presented below are a set of minimal criteria for each study design and then a general definition of three categories: "good," "fair," and "poor," based on those criteria. These specifications are not meant to be rigid rules but rather are intended to be general guidelines, and individual exceptions, when explicitly explained and justified, can be made. In general, a "good" study is one that meets all criteria well. A "fair" study is one that does not meet (or it is not clear that it meets) at least one criterion but has no known "fatal flaw." "Poor" studies have at least one fatal flaw.
a. Systematic Reviews
- Comprehensiveness of sources considered/search strategy used.
- Standard appraisal of included studies.
- Validity of conclusions.
- Recency and relevance are especially important for systematic reviews.
Definition of ratings from above criteria:
Good: Recent, relevant review with comprehensive sources and search strategies; explicit and relevant selection criteria; standard appraisal of included studies; and valid conclusions.
Fair: Recent, relevant review that is not clearly biased but lacks comprehensive sources and search strategies.
Poor: Outdated, irrelevant, or biased review without systematic search for studies, explicit selection criteria, or standard appraisal of studies.
- Accurate ascertainment of cases
- Nonbiased selection of cases/controls with exclusion criteria applied equally to both.
- Response rate.
- Diagnostic testing procedures applied equally to each group.
- Measurement of exposure accurate and applied equally to each group.
- Appropriate attention to potential confounding variables.
Definition of ratings based on criteria above:
Good: Appropriate ascertainment of cases and nonbiased selection of case and control participants; exclusion criteria applied equally to cases and controls; response rate equal to or greater than 80 percent; diagnostic procedures and measurements accurate and applied equally to cases and controls; and appropriate attention to confounding variables.
Fair: Recent, relevant, without major apparent selection or diagnostic work-up bias but with response rate less than 80 percent or attention to some but not all important confounding variables.
Poor: Major selection or diagnostic work-up biases, response rates less than 50 percent, or inattention to confounding variables.
b. Randomized Controlled Trials and Cohort Studies
- Initial assembly of comparable groups:
- For RCTs: adequate randomization, including first concealment and whether potential confounders were distributed equally among groups.
- For cohort studies: consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts.
- Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination).
- Important differential loss to follow-up or overall high loss to follow-up.
- Measurements: equal, reliable, and valid (includes masking of outcome assessment).
- Clear definition of interventions.
- All important outcomes considered.
- Analysis: adjustment for potential confounders for cohort studies, or intention to treat analysis for RCTs.
Definition of ratings based on above criteria:
Good: Meets all criteria: Comparable groups are assembled initially and maintained throughout the study (follow-up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention to confounders in analysis. In addition, for RCTs, intention to treat analysis is used.
Fair: Studies will be graded "fair" if any or all of the following problems occur, without the fatal flaws noted in the "poor" category below: Generally comparable groups are assembled initially but some question remains whether some (although not major) differences occurred with follow-up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention to treat analysis is done for RCTs.
Poor: Studies will be graded "poor" if any of the following fatal flaws exists: Groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied at all equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. For RCTs, intention to treat analysis is lacking.
a. Diagnostic Accuracy Studies
- Screening test relevant, available for primary care, adequately described.
- Study uses a credible reference standard, performed regardless of test results.
- Reference standard interpreted independently of screening test.
- Handles indeterminate results in a reasonable manner.
- Spectrum of patients included in study.
- Sample size.
- Administration of reliable screening test.
Definition of ratings based on above criteria:
Good: Evaluates relevant available screening test; uses a credible reference standard; interprets reference standard independently of screening test; reliability of test assessed; has few or handles indeterminate results in a reasonable manner; includes large number (more than 100) broad-spectrum patients with and without disease.
Fair: Evaluates relevant available screening test; uses reasonable although not best standard; interprets reference standard independent of screening test; moderate sample size (50 to 100 subjects) and a "medium" spectrum of patients.
Poor: Has fatal flaw such as: Uses inappropriate reference standard; screening test improperly administered; biased ascertainment of reference standard; very small sample size or very narrow selected spectrum of patients.