Appendix VI. Criteria for Assessing Internal Validity of Individual Studies

Appendix VI. Criteria for Assessing Internal Validity of Individual Studies

The USPSTF Methods Workgroup developed a set of criteria by which the internal validity of individual studies could be evaluated. The USPSTF accepted the criteria, and the associated definitions of quality categories, at its September 1999 meeting.

This appendix describes the criteria relating to internal validity and the procedures that topic teams follow for all updates and new assessments in making these judgments.

All topic teams use initial exclusion criteria to select studies for review that deal most directly with the question at issue and that are applicable to the population at issue. Thus, studies of any design that use outdated technology or technology that is not feasible for primary care practice may be filtered out before the abstraction stage, depending on the topic and the decisions of the topic team. The team justifies such exclusion decisions if there could be reasonable disagreement about this step. These criteria are meant for those studies that pass this initial filter.

Presented below are a set of minimal criteria for each study design and a general definition of three categories ("good," "fair," and "poor") based on those criteria. These specifications are not meant to be rigid rules but rather are intended to be general guidelines. Recognizing that the methodology of systematic reviews are continuously evolving, the USPSTF allows the EPC to use newer methods of assessing quality of individual studies.

In general, a "good" study is one that meets all criteria well. A "fair" study is one that does not meet (or it is not clear that it meets) at least one criterion but has no known "fatal flaw." "Poor" studies have at least one fatal flaw.

Systematic Reviews


  • Comprehensiveness of sources considered/search strategy used
  • Standard appraisal of included studies
  • Validity of conclusions
  • Recency and relevance (especially important for systematic reviews)

Definition of ratings based on above criteria:

Good: Recent, relevant review with comprehensive sources and search strategies; explicit and relevant selection criteria; standard appraisal of included studies; and valid conclusions.

Fair: Recent, relevant review that is not clearly biased but lacks comprehensive sources and search strategies.

Poor: Outdated, irrelevant, or biased review without systematic search for studies, explicit selection criteria, or standard appraisal of studies.

Case-Control Studies


  • Accurate ascertainment of cases
  • Nonbiased selection of cases/controls, with exclusion criteria applied equally to both
  • Response rate
  • Diagnostic testing procedures applied equally to each group
  • Measurement of exposure accurate and applied equally to each group
  • Appropriate attention to potential confounding variables

Definition of ratings based on above criteria:

Good: Appropriate ascertainment of cases and nonbiased selection of case and control participants; exclusion criteria applied equally to cases and controls; response rate equal to or greater than 80%; accurate diagnostic procedures and measurements applied equally to cases and controls; and appropriate attention to confounding variables.

Fair: Recent, relevant, and without major apparent selection or diagnostic workup bias, but response rate less than 80% or attention to some but not all important confounding variables.

Poor: Major selection or diagnostic workup bias, response rate less than 50%, or inattention to confounding variables.

RCTs and Cohort Studies


  • Initial assembly of comparable groups:
    • For RCTs: adequate randomization, including first concealment and whether potential confounders were distributed equally among groups
    • For cohort studies: consideration of potential confounders, with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts
  • Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination)
  • Important differential loss to followup or overall high loss to followup
  • Measurements: equal, reliable, and valid (includes masking of outcome assessment)
  • Clear definition of interventions
  • All important outcomes considered
  • Analysis: adjustment for potential confounders for cohort studies or intention-to-treat analysis for RCTs

Definition of ratings based on above criteria:

Good: Meets all criteria: comparable groups are assembled initially and maintained throughout the study (followup greater than or equal to 80%); reliable and valid measurement instruments are used and applied equally to all groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention to confounders in analysis. In addition, intention-to-treat analysis is used for RCTs.

Fair: Studies are graded "fair" if any or all of the following problems occur, without the fatal flaws noted in the "poor" category below: generally comparable groups are assembled initially, but some question remains whether some (although not major) differences occurred with followup; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention-to-treat analysis is used for RCTs.

Poor: Studies are graded "poor" if any of the following fatal flaws exists: groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. Intention-to-treat analysis is lacking for RCTs.

Diagnostic Accuracy Studies


  • Screening test relevant, available for primary care, and adequately described
  • Credible reference standard, performed regardless of test results
  • Reference standard interpreted independently of screening test
  • Indeterminate results handled in a reasonable manner
  • Spectrum of patients included in study
  • Sample size
  • Reliable screening test

Definition of ratings based on above criteria:

Good: Evaluates relevant available screening test; uses a credible reference standard; interprets reference standard independently of screening test; assesses reliability of test; has few or handles indeterminate results in a reasonable manner; includes large number (greater than 100) of broad-spectrum patients with and without disease.

Fair: Evaluates relevant available screening test; uses reasonable although not best standard; interprets reference standard independent of screening test; has moderate sample size (50 to 100 subjects) and a "medium" spectrum of patients.

Poor: Has a fatal flaw, such as: uses inappropriate reference standard; improperly administers screening test; biased ascertainment of reference standard; has very small sample size or very narrow selected spectrum of patients.

Current as of: July 2017

Internet Citation: Appendix VI. Criteria for Assessing Internal Validity of Individual Studies. U.S. Preventive Services Task Force. July 2017.

