Section 4: Evidence Report Development
4.1 Literature Retrieval and Data Abstraction for Topic Reviews
After literature searches are conducted, the team of evidence reviewers uses a set of a priori inclusion/exclusion criteria as appropriate to each key question to define whether identified literature is relevant to the review. These criteria are applied twice first at the title, or title and abstract review stage, and a second time at the article review stage. This two-stage process is designed to be efficient, to minimize errors, and to be transparent and reproducible.
Titles and abstracts are reviewed by broadly applying the inclusion criteria for the review. When in doubt at the title/abstract review phase as to whether an article might meet the inclusion criteria, reviewers should err on the side of inclusion so that article is retrieved and can be reviewed at the article stage. All citations are coded with at least an excluded or included code, which is managed in a database and used to guide the further literature review steps. This database is the source of the final tables documenting the review process.
Full-text articles are retrieved for all citations included at the title/abstract stage, and are reviewed by a member of the review team, using inclusion/exclusion criteria for relevance and for quality. Included articles receive codes to indicate the key question(s) for which they meet criteria and excluded articles are coded for a reason for exclusion. The reasons for exclusions could be either the primary reason or the first reason encountered in reviewing the article; and thus the distribution of reasons for exclusion do not necessarily represent the state of the excluded literature. Similarly, all the reasons for exclusion of an individual article may not be listed in the final exclusion table. Before they are abstracted, articles are reviewed to ensure that they meet minimal design-specific U.S. Preventive Services Task Force (USPSTF) quality criteria.
The abstract and article review process generally involves a team of reviewers and is conducted using established research methods in order to minimize reviewer drift as well as inter-rater review and coding differences.
4.1.1 Procedures for Abstract and Article Review
Abstracts undergo "dual-review" in that either all abstracts are reviewed separately and reconciled, or at least abstracts excluded by the primary reviewer are re-reviewed by another reviewer to ensure that all appropriate studies are included. Any studies excluded by the first reviewer but included by the second reviewer are included in the next phase.
When the volume of abstracts is very high due to the non-specific nature of searches possible within a specific literature (e.g., alcohol misuse), reviewers may use a sampling scheme for quality assurance as follows. For each key question, all of the searches (ML, CCRCT, PsycINFO) will be considered as one search. Reviewers will dual review a set number (1000) of the most recent abstracts that proportionally represents the key databases searched for that key question, and will then review a random subset of the remainder. All abstracts resulting from the CCRCT are dual-reviewed. The other database searches are proportionally reviewed to get up to a total of 1000 abstracts that are dual reviewed, then a random subset of the remainder, to equal about 20-25% of the total number of articles, will be dual-reviewed. In the case of a sampling approach to dual review, inter-rater reliability is calculated using the kappa statistic.
4.1.2 Database of Abstracts
For each systematic review, the review team establishes a database of all articles located through searches and from other sources (i.e., both those included and those eventually excluded from the final set of articles reviewed). Information captured in the database includes the source of the citation (e.g., search source, outside source), whether the abstract was included or excluded, the key question(s) associated with each included abstract, whether the article was excluded (with reasons for exclusion) or included in the review, and other coding approaches developed to support the specific review. For example, a hierarchical approach to answering a question may be proposed at the work plan stage, specifying that reviewers will consider a type of study design or a clinical setting only if research data are too sparse for the preferred type of study. While reviewing abstracts and articles, these can be coded to allow easy retrieval during the conduct of the review, if warranted.
4.1.3 Documenting Search Results
Search terms used for each key question, along with the yield associated with each term, are documented in a table or set of tables; these appear in the summary of the literature search (early in any topic review project) and in the final evidence synthesis. Follow-up searches to capture newly published data are conducted periodically as the project progresses; the frequency of these searches depends on the individual topic. A final search is conducted close to the time of completion of the draft evidence report, with the exact timing determined by the evidence review team. The final documentation of the search should indicate the most recent time point searched.
To the extent that it suits the review rationale and is feasible, search dates for different key questions should conform to one another.
4.1.4 Data Abstraction Approaches
- Use of forms: Data may be abstracted in forms developed or adapted for the review, or directly into evidence tables.
- Minimal elements to abstract: Although the Task Force has no standard or generic abstraction form, the following broad categories are always abstracted from included articles: key question, study design, study participant description, details of the intervention or screening test being studied, study results with emphasis on health outcomes where appropriate, and individual study quality information, including specific threats to validity. Information relevant to generalizability is consistently abstracted. Each team uses these general categories, and other categories if indicated, to develop an abstraction form specific to the topic at hand. For example, source of funding may be an important variable to abstract for some topics.
- Abstraction of included articles: The evidence review teams abstract only those articles that, after review of the entire article, both meet criteria for quality and focus on the key question at hand. Abstractions are conducted by trained team members, and evidence review teams may, but do not routinely, double abstract all included articles. Key articles are always read and checked by more than one team member. All reviewers are trained in the topic, the analytic framework and key questions, and the use of the abstraction instrument. Initial reliability checks are done for quality control.
- Other quality assurance methods: It is desirable to have more than one evidence review team member check data accuracy for key data elements, including data included in a summary table, a meta-analysis, or in calculations supporting a balance sheet/outcomes table.
By means of its explicit analytic framework and key questions, the Task Force indicates what issues it must examine to make its recommendation. By setting inclusion and exclusion criteria for the searches for each key question, the Task Force indicates what evidence it will consider admissible. The critical aspect used to determine whether an individual study is admissible is its internal and external validity with respect to the key question posed. This initial examination of the "quality" (i.e., internal and external validity) of individual studies is conducted with established criteria (go to Appendix VII and Appendix VIII) by the evidence review team or USPSTF topic work group. If questions arise in the course of this process, Task Force members are asked to review the articles in question. Studies with fatal flaws (i.e., with "poor" internal or external validity) are not admissible for further consideration. Likewise, studies of interventions that require training or equipment not feasible in even high quality primary care would be judged to have poor external validity for the key questions posed by the Task Force, and would not be admissible evidence.
Once the admissible evidence has been found, and the internal and external validity of individual studies has been assessed, the Task Force must consider the level of evidence that the studies provide to answer the KQs. The Task Force's process for determining the level of evidence over a key question involves answering the following 6 critical appraisal questions about the admissible evidence. The Task Force uses these same 6 critical appraisal questions to determine the overall evidence of certainty of net benefit for the entire preventive service, including all key questions in the analytic framework. (Go to Section 5 for a description of the Task Force's methods for judging the cumulative evidence and arriving at a recommendation.)
4.2.1 Critical Appraisal Questions
4.2.2 Levels of Critical Appraisal
The evidence review process involves assessing the validity and reliability of admissible evidence at 3 levels:
- The individual study;
- The key question (i.e., linkage in the analytic framework); and
- The entire preventive service.
For individual studies, questions 1-3 and 6 are assessed. That is, a single study will be categorized as to study design and whether internal and external validity are "good," "fair," or "poor" to answer the key question. For the key question and entire preventive service levels, all 6 questions must be considered.
For the individual study level, the evidence review team finds admissible evidence and then categorizes the internal validity (i.e., quality—Appendix VII) of each study into "good", "fair", and "poor" categories. For critical or borderline studies, the Task Force leads (and sometimes the entire Task Force) will also consider the individual studies. The EPC also provides the Task Force with descriptions of factors entering into the determination of external validity (i.e., applicability or generalizability—Appendix VIII), as well as descriptions of each study's research design and the number and description of studies relevant to each key question.
For the key question level, the Task Force, using information about the evidence supplied by the EPC, assesses the level of evidence across each key question using all 6 critical appraisal questions. The body of evidence is often categorized as to the highest level of applicable evidence available. The Task Force categorizes the evidence across each key question into one of 3 categories: "convincing," "adequate," or "inadequate."
For the preventive service, the entire body of evidence in the entire analytic framework is synthesized by the Task Force into categories of "certainty" of the overall evidence: high, moderate, and low. Again, the Task Force uses all 6 critical appraisal questions for this determination. (Go to Appendix IV regarding topic workgroup procedures for assessing certainty.)
4.3 Assessing Evidence at the Individual Study Level
4.3.1 Critical Appraisal
All individual articles are critically appraised to determine the validity and reliability of the evidence they provide. This assessment is conducted primarily by the topic team (usually led by the EPC or by AHRQ team leaders), with input from Task Force members for critically important or borderline articles. The assessment of internal (i.e., "quality") and external validity (i.e., applicability or generalizability) are based on explicit criteria, given in Appendix VII and Appendix VIII.
4.3.2 Internal Validity
The Task Force recognizes that research design is an important component of the validity of the information in a study, for the purpose of answering a key question. Although RCTs cannot answer all key questions, they are ideal for questions of the benefits or harms of various interventions. Thus, for these questions, the current Task Force endorses a slightly revised version of the "hierarchy of research design" used by the second Task Force:
I: Properly powered and conducted randomized controlled trial (RCT); well-conducted systematic review or meta-analysis of homogeneous RCTs
II-1: Well-designed controlled trial without randomization
II-2: Well-designed cohort or case-control analytic study
II-3: Multiple time series with or without the intervention; dramatic results from uncontrolled experiments
III: Opinions of respected authorities, based on clinical experience; descriptive studies or case reports; reports of expert committees
In assessing individual studies, all are classified first according to this design code, with additional designations added for other or unconventional designs.
Although research design is an important component of the information provided by an individual study, the Task Force also recognizes that not all studies within a research design have equal internal validity ("quality"). To assess more carefully the internal validity of individual studies within research designs, the Task Force adopted design-specific criteria for assessing the internal validity of individual studies.
These criteria, given in Appendix VII, provide general guidelines for categorizing studies into one of three internal validity categories: "good," "fair," and "poor." These specifications are not meant to be rigid rules; individual exceptions, when explicitly explained and justified, can be made. In general, a "good" study is one that meets all design-specific criteria. A "fair" study is one that does not meet (or does not clearly meet) at least one specified criterion, but has no known "fatal flaw." "Poor" studies have at least one fatal flaw. A fatal flaw is a deficit in design or implementation of the study that calls into serious question the validity of its results for the key question being addressed.
The Task Force views the level of evidence, whether for an individual study, a key question/linkage, or an entire preventive service, as independent of the magnitude of effect. Thus, a study (or a number of studies) could be classified as "good" even if it (they) found no effect of the preventive service.
4.3.3 External Validity (Generalizability) and Applicability
It is necessary not only to assess the external validity (generalizability) of the individual studies that contribute to answering the key questions, but also to assess the body of evidence in order to judge its applicability to the population or populations that are the target for the clinical preventive service, to the settings in which the service will be implemented, and to the providers who will deliver the service. In this document, the term "external validity" will be used when discussing assessment of individual studies, and the term "applicability" will be used when discussing the assessment linkages across key questions and the overall body of evidence, even though the external validity of individual studies is a key element of the applicability judgment. The summative judgment about applicability is more than the sum of the assessment of each of the parts.
For the USPSTF, the study-level assessment of external validity and the assessment of applicability are done separately.
A description of the overall conceptual approach for both components is provided below. Appendix VIII gives detailed information on criteria and process.
188.8.131.52 Assessment of the External Validity of a Study
Judgments about the external validity ("generalizability") of a study pertinent to a preventive interventions address three main questions:
- Considering the subjects in the study, to what degree do the study's results measure the likely clinical results in the asymptomatic people who are the recipients of the preventive service in the United States?
- Considering the setting in which the study was done, to what degree do the study's results measure the likely clinical result in United States primary care situation? and
- Considering the providers who were a part of the study, to what degree do the study's results measure the likely clinical results in providers who would deliver the service in the United States primary care setting?
The subjects that comprise the participants in a study may differ from people receiving primary care in many ways. Such differences may include gender, ethnicity, age, co-morbidities, and other personal characteristics. Some of these differences have a small potential to affect the study's results and/or the outcomes of an intervention. Other differences have the potential to cause large differences between the study's results and what would be reasonably anticipated to occur in asymptomatic individuals or people who are the target of the preventive intervention.
The choice of the study population may affect the magnitude of the benefit observed in the study through exclusion/inclusion criteria that limit the study to people most likely to benefit; other study features may impact the risk level of the subjects recruited to the study. The absolute benefit from a service is often greater for people at increased risk than for people at lower risk.
Because of the presence of certain research design elements, adherence is likely to be greater in research studies than in the usual primary care practice. This may lead to overestimation of the benefit of the intervention when delivered to people who are less selected (i.e., who more closely resemble the general population), and who are not subject to the special study procedures.
Factors related to the study situation relative to the situation in U. S. primary care settings must be assessed when assessing the external validity of a study.
The choice of study setting may lead to an over- or under-estimate of the benefits and harms of the intervention as they would be expected to occur in U.S. primary care settings. For example, results of a study in which items essential for the service to have benefit are provided at no cost to patients may not be attainable when the item must be paid for. Results obtained in a trial situation that ensures immediate access to care if a problem or complication occurs may not be obtainable in a usual care situation, where the same safeguards cannot be ensured, and where as a result the risks of the intervention are greater.
When assessing the external validity of a study, factors related to the experience of providers in the study should be considered in comparison with the experience of providers likely to be encountered in primary care in the U.S. Studies may involve providers selected for their experience or their high skill level. Providers involved in studies may undergo special training that affects their performance of the intervention. For these and other reasons, the effect of the intervention may be overestimated or the harms underestimated compared with the likely experience of unselected providers in the primary care setting.
184.108.40.206 Criteria and Process
The criteria used to rate the external validity of individual studies according to the population, the situation, and the setting are described in detail in Appendix VIII. As with internal validity, this assessment of external validity is usually conducted initially by the EPC or AHRQ topic team leader, with input from Task Force members for critically important or borderline studies. This assessment is then used to give each study a rating using the same 3-tiered grading scheme as for internal validity: good, fair, and poor.
The underlying question answered in the grading the external validity of a study as good, fair or poor is:
If the study had been done with the typical U.S. primary care population, situation, and providers, what is the likelihood that the results would be different in a clinically important way?
4.4 Applicability of the Body of Evidence to the Target Population/Situation/Setting
USPSTF members assess the applicability of the body of evidence to populations/situations/settings as one of the components of the overall process of making recommendations.
Judgment about applicability considers the populations, situations, and providers in each study, but it also involves synthesis of the evidence from the individual studies across the key questions, and for the overall body of evidence.
The overall goal of the assessment is to judge whether there are likely to be clinically important differences between the results observed in studies as a whole and the results expected when the intervention is implemented in the U.S. primary care populations/situations/providers.
The following questions are addressed:
- Can an inference be made from the evidence that the intervention has any effectiveness for the U.S. primary care populations/situations/providers?
- Is the magnitude of benefit observed in individual studies that comprise the body of evidence likely to be the same for the U.S. primary care populations/situations/providers?
- Are the harms observed in individual studies that comprise the body of evidence likely to be the same for the U.S. primary care populations/situations/providers?
- What is the relationship between benefits and harms derived from the evidence likely to be for the U.S. primary care population/situation/providers?
- Is the time and effort required to provide the interventions that comprise the body of evidence attainable in the U.S. primary care situations/providers?
- Can people in U.S. primary care populations/situations be expected reasonably to partake of the interventions that comprise the body of evidence considering their time, effort, and cost?
- Is the extrapolation of data from the body of evidence to large populations of asymptomatic people biologically plausible?
4.4.1 Relative Importance of Efficacy/Effectiveness
The USPSTF seeks to make recommendations based on projections of what would be expected from widespread implementation of the preventive service within the actual world of U.S. medical practice. For this reason, the Task Force considers carefully the applicability to medical practice of "efficacy" studies, which measure the effects of the preventive care service under ideal circumstances. However, the USPSTF ultimately seeks to base its recommendations on "effectiveness," which is what results could be expected with widespread implementation under usual practice circumstances.
Questions arise about whether the USPSTF recommendations consider effectiveness in usual practice or in ideal/excellent practice. The "situation" for practices varies widely within the U.S. Some practices have greater support and more resources than others. The TF attempts to makes recommendations for all of these practice "situations," and may specify what resources are required for implementation.
4.4.2 Definition of Primary Care
To further specify the situation that is the object of its concern, the Task Force has adopted the Institute of Medicine's definition of primary care:
Primary care is the provision of integrated, accessible health care services by clinicians who are accountable for addressing a large majority of personal health care needs, developing a sustained partnership with patients, and practicing in the context of family and community. This definition acknowledges the importance of the patient clinician relationship as facilitated and augmented by teams and integrated delivery systems. (7)
4.4.3 Primary Care Interventions Addressed by the USPSTF
The USPSTF considers interventions that are delivered in primary care settings or are judged to be feasible for delivery in primary care. To be feasible in primary care, the intervention could target patients seeking care in primary care settings, and the skills to deliver the intervention are or could be present in clinicians and/or related staff in the primary care setting, or the intervention could generally be ordered/initiated by a primary care clinician.
4.5 Other Issues in Assessing Evidence at Individual Study Level
4.5.1 Dealing with Secondary and Aggregate Endpoints
The Task Force adopted a policy of critically appraising all of the endpoints (outcomes) of trials in a similar manner, following the 6 critical appraisal questions listed earlier (Section 4.2.1). In its review, the Task Force takes note of the biological plausibility of a study's finding, the supporting evidence, and whether an outcome is a primary or secondary one. Similarly, the Task Force examines composite (aggregate) outcomes carefully. It generally asks 3 questions of these outcomes: (1) are the component outcomes of similar importance to patients? (2) did the more or less important outcomes occur with similar frequency? And (3) are the component outcomes likely to have similar relative risk reduction (RRR)?
4.5.2 Ecologic Evidence
Because biases may be present in ecologic data, the Task Force is careful in its use of this type of evidence. The Task Force rarely accepts ecologic evidence alone as sufficient to recommend a preventive service. Because this evidence is widely accepted by others, the Task Force developed a policy for when it uses ecologic evidence, and how this evidence is critically appraised.
By ecologic evidence we mean data that are not at the individual level; but rather, that relate to the average exposure and average outcome within a population. The "ecologic fallacy" is the erroneous conclusion that there is an association when exposure occurs in some members of a population and an outcome in other members. In addition, ecologic data sets often do not include other potential confounding factors; thus, one cannot directly assess the ability of these potential confounders to explain apparent associations. Finally, some ecologic studies use data collected in ways that are not accurate or reliable.
Ecologic studies usually make comparisons of outcomes in exposed and unexposed populations in one of two ways: (1) between different populations, some exposed and some not, at one point in time (i.e., cross-sectional ecologic study); or (2) within a single population with changing exposure status over time (i.e., time series ecologic study). In either case the potential for making the ecologic fallacy is a major concern.
As it is not possible to completely avoid the potential for making the ecologic fallacy in these studies, the USPSTF does not usually accept ecologic evidence alone as adequate to establish the causal association of a preventive service and a health outcome. In some unusual situations (e.g., cervical cancer screening) ecologic evidence may play the primary role in the Task Force's evidence review, but this is rare.
More frequently, ecologic evidence is considered by the Task Force in the following situations:
- For background, for an understanding of the context in which the preventive service is being considered;
- When well-known ecologic data are being used as evidence by others to justify either recommending or not recommending the service the Task Force is considering;
- Where other evidence is inadequate but the Task Force thinks that good ecologic evidence could add important information;
- When there are reports of dramatic results of ecologic studies.
In the situations above, the Task Force critically appraises ecologic studies. High quality ecologic evidence meets the following criteria:
- The exposures, outcomes, and potential confounders are measured accurately and reliably.
- Other potential explanations and potential confounders are considered and adjusted for.
- The populations and interventions being compared are comparable.
- The populations and interventions are relevant to a primary care population.
- Multiple ecologic studies are present that are consistent/coherent.
4.5.3 Mortality as Outcome: All-cause Versus Disease-specific Mortality
When a condition is a common cause of mortality, all-cause mortality, instead of cause-specific mortality, is a desirable health outcome measure. Few preventive interventions attain the high standard set by use of this outcome. The fact that there is a discrepancy between the effect of the preventive intervention on all-cause and disease-specific mortality is important to recognize and explore. A discrepancy may arise when (1) there is real benefit of the preventive intervention for a targeted condition or (2) because of methodologic issues that are inherent in the study of all-cause mortality:
220.127.116.11 Real Benefit for the Targeted Condition
Three situations can produce this kind of discrepancy. First, when a preventive intervention increases deaths from causes other than the one targeted by the intervention, all-cause mortality may not be decreased even when cause-specific mortality due to the targeted condition is decreased. This indicates a potential harm of the intervention for a condition other than the one targeted.
Second, when the condition targeted by the preventive intervention is rare and/or the effect of the intervention on cause-specific mortality due to the targeted condition is small, the effect on all-cause mortality may be very small or even non-existent.
Third, when the preventive intervention is applied in a population with strong competing causes of mortality, the effect of the preventive intervention considering all-cause mortality may be very small or even non-existent even though the intervention decreases cause-specific mortality due to the targeted condition. For example, preventing death due to hip fracture by implementing an intervention to decrease falls in 85-year women may not decrease all-cause mortality over reasonable time frames for a study because the force of mortality is so large at this age.
18.104.22.168 Methodologic Issues
Methodologic issues can arise because of difficulties in the assignment of cause of death based on records available to or used by a study. In the absence of detail about the circumstances of death, death may be attributed to a chronic condition known to exist at the time of death but without any true contribution to death. Coding conventions for death certificates also result in deaths from some causes being attributed to chronic conditions present at death routinely. For example, it is conventional to assign people with a mention of cancer on the death certificate to cancer as primary cause of death. The result of these methodologic issues is a biased estimate of cause-specific mortality, which may not reflect the true effect an intervention has on death from the targeted condition.
As indicated above, studies that provide data on all-cause and cause-specific mortality may have low statistical power to detect even large or moderate effects of the preventive intervention on all-cause mortality. This is especially true when the disease targeted by the screening test is not common.
When data are available, the Task Force considers data on both all-cause and cause-specific mortality in making its recommendations, taking into account the real and methodologic contributions to potential discrepancies between apparent and true effect.
4.5.4 Subgroup Analyses
The Task Force is interested in targeting its recommendations to those populations or situations in which there would be maximal benefit for the harms and costs involved. Thus, it often takes into consideration subgroup analyses of large studies. It attaches varying levels of credibility to those analyses, however, depending on such factors as: the size of the subgroup; whether randomization occurred within subgroups; whether a statistical test for interaction was done; whether the results of multiple subgroup analyses are consistent within themselves; whether the subgroup analyses were pre-specified; and whether the results are biologically plausible.
4.5.5 Relative Versus Absolute Risk Reduction
The Task Force is interested in reducing risk both for populations and for individuals. For this reason it takes into account both relative (RRR) and absolute risk reduction (ARR) from intervention studies. It generally prioritizes ARR over RRR. That is, it is less impressed with a large RRR in situations of low ARR; it remains interested in an intervention with a low RRR if its ARR is high.
4.6 Incorporating Other Systematic Reviews in USPSTF Reviews
Existing systematic reviews or meta-analyses that meet quality and relevance criteria can be incorporated into topic reviews done for the USPSTF. Quality criteria for reporting meta-analyses are specified by the QUORUM and MOOSE guidelines published, respectively, by The Lancet and the Journal of the American Medical Association (JAMA) (8, 9). The USPSTF has specified its criteria for critically appraising systematic reviews (go to Appendix VII and Appendix VIII). Relevance is considered at two levels: at the general level of the review or analysis question, and at a more specific level. At the general level, the question would be "Is the review or meta-analysis relevant to one or more of the USPSTF key questions for this review?" The more specific question would be: "Did the review include the desired study designs and relevant population(s), settings, exposure/intervention(s), comparator(s), and outcome(s)?" Recency of the review is also a consideration, and can determine whether a review that meets quality and relevance criteria is recent enough not to require any bridging searches. Finally, existing reviews can be used in several ways in a USPSTF review: (1) to answer one or more key questions wholly or in part; (2) to substitute for conducting a systematic search for a specific time period for a specific key question; or (3) as a source document for cross-checking the results of systematic searches.
4.7 Use of Observational Designs in Questions of the Effectiveness/Efficacy of Interventions
The Task Force prefers large, well-conducted RCTs to determine the benefits and harms of preventive services. In many situations, however, such studies have not been or are not likely to be done. When these studies can be done, and other evidence is insufficient to determine benefits and/or harms, the Task Force advocates that large, well-conducted RCTs be done. It notes that small, poorly-conducted RCTs are often of little use.
In some situations, however, the Task Force does use observational evidence to make recommendations. Multiple, large, well-conducted observational studies with consistent results showing a large effect size that does not change markedly with adjustment for multiple potential confounders may be judged sufficient to determine the magnitude of benefits and harms of a preventive service. Also, large, awell-conducted observational studies often provide essential additional evidence even in situations where there are adequate RCTs. Ideally, RCTs provide evidence that an intervention can work and observational studies provide better understanding of the populations where the benefits would be greatest.