Methods Commentary: Risk of Bias in Cohort Studies

G.Guyatt, J. Busse

Keeping Terminology Consistent

In our commentary addressing rating of the conduct of randomized control trials (RCTs), we argued for the use of the term “risk of bias” rather than “quality” of the studies, “methodological quality”, the “validity” or the “internal validity” of the studies.  We noted that each of these terms may refer to risk bias: the likelihood that, because of flaws in design and execution of a study, it is at risk of a systematic deviation from the truth (i.e. overestimating or underestimating the true treatment effect).  We believe risk of bias is the optimal term not only for RCTs but also for cohort studies.

Assessing Risk of Bias in Cohort Studies

There are a large number of study designs that one might include under “observational studies”.  However, observational studies may usefully be classified as either cohort or case-control studies.  In cohort studies, groups of individuals either exposed or non-exposed to either a treatment or potential harmful causal agent are followed forward (either prospectively or retrospectively) for occurrence of the outcomes of interest.  In case-control studies, the investigators identify individuals who either have or have not experienced an outcome of interest and then ascertain whether these individuals have or have not been exposed to the potentially causal factor under consideration.  The differences in study design require somewhat different assessment criteria – this commentary focuses on cohort studies.

A systematic review identified 194 instruments that have been used to assess the quality – in most cases, the focus was on risk of bias – of observational studies1.  This review appraised these instruments, eventually ending up recommending two of them, the Downs and Black instrument2, and the Newcastle-Ottawa Scale3.  The Cochrane Collaboration – in a chapter addressing non-randomized studies that shares an author with the systematic review that recommended the two instruments, and also the lead author of the Newcastle-Ottawa Scale – recommends the same two instruments4.

Despite the possible conflicts of the Cochrane authors, we agree that the two instruments they have chosen represent the best of what is available for assessing the risk of bias in observational studies.  With 29 items, the Downs and Black instrument is time consuming to use and its results difficult to summarize – we consider it unwieldy for use in systematic reviews.  Thus, we have used the Newcastle-Ottawa instrument as a starting point for developing our suggested form.

The Ottawa-Newcastle scale includes assessment of the accuracy of measurement of the exposure of interest (this can be challenging – e.g. the total exposure to smoking, or to dietary intake); the similarity of the exposed and unexposed cohorts (probably the most important single issue in a cohort study, which needs to be addressed both by drawing from exposed and unexposed from the same population and by a matched or adjusted analysis); accuracy of outcome assessment; and the extent to which investigators achieved full follow-up.  All of these are appropriate.

The Newcastle-Ottawa scale has, in our view, a number of limitations.  The first item addresses the representativeness of the cohort with respect to the community.  One can enrol an unrepresentative cohort and still provide an unbiased assessment of the impact of the exposure on the outcome of interest within that cohort.  The issue would then arise as to whether the results would be applicable to a more representative sample.  Thus, the first item addresses an issue of applicability, and does not belong in a risk of bias instrument.

A second item suffers from the same limitation.  This item focuses on the duration of follow-up.  It is possible to obtain an unbiased assessment of the occurrence of an event over a suboptimal follow-up duration.  The results of such a study are likely to have limited usefulness, but they would be accurate.  Thus, duration of follow-up is also an issue of applicability.

The Newcastle-Ottawa scale also has two important omissions.  The ascertainment of prognostic factors can be just as, or often more, challenging than the ascertainment of exposure.  Smoking and diet, mentioned above, may not be the exposures of interest but rather potential confounding factors, presenting measurement challenges.  Other potential confounders in studies focusing on cardiovascular disease, including blood pressure, exercise, and family history, may all raise significant challenges.  A risk of bias instrument should address the fidelity of these measurements.

Co-intervention represents another important aspect of comparability.  Once again using cardiovascular disease as an example, exposure to anti-platelet agents, antihypertensive drugs, and lipid lowering medication is likely to effect outcome.  Thus, the extent to which potential co-intervention is documented in a cohort study should be addressed.

In our view, the structure of the response options in the Newcastle-Ottawa instrument leaves much to be desired.  We have used our enhancement of the response options from the Cochrane risk of bias instrument and applied it to our form for risk of bias in cohort studies.  We frame each criterion as a question, and have response options definitely yes (low risk of bias), probably yes, probably no, and definitely no (high risk of bias).  In addition, once again following the model of the Cochrane risk of bias instrument, we provide for each item examples of study design that would lead to low risk of bias, and examples for high risk of bias.  In this case, we also provide examples of designs that would lead to risk of bias between high and low.

Rating of Risk of Bias Should be Outcome Specific

In our commentary addressing rating of the conduct of randomized control trials (RCTs), we noted that, traditionally, systematic review authors have provided a single rating of risk of bias for a particular study.  We explained that this tradition is misguided because risk of bias can differ between outcomes. This is equally true for observational studies.  Thus, systematic review authors should rate risk of bias separately for each outcome.


  1. Deeks JJ, Dinnes J, D’Amico R, et al. Evaluating non-randomised intervention studies. Health Technol Assess 2003; 7:iii-x, 1-173
  2. Downs S, Black N. The feasibility of creating a checklist for the assessment of methodological quality both of randomized and non-randomized studies of health care interventions: Summation of the conference. Journal of Epidemiology and Community Health 1998; 52:377-384
  3. Wells G, Shea B, O’Connell D, et al. The Newcastle-Ottawa Scale (NOS) for assessing the quality of non randomized studies in meta-analyses
  4. Reeves B, Deeks J, Higgins JP, et al. Including non-randomized studies. In: Higgins J, Green S, eds. Cochrane Handbook for Systematic Reviews of Interventions 5.0.1. Chichester, U.K.: John Wiley & Sons, 2008
  5. Akl E, Sun X, Busse J, et al. Specific instructions for estimating unclearly reported blinding status in randomized clinical trials were reliable and valid. Journal of Clinical Epidemiology, submitted. 2011

Appendix: Criteria for Blinding Judgments

Follow the following stepwise rules:

  1. Explicit statement that a group of interest was blinded ->definitely Yes for that group
  2. Explicit statement that a group of interest was not blinded ->definitely no for that group
  3. Explicit statement “investigators were blinded” ->probably Yes for health care providers and for data collectors
  4. Explicit description of the trial as “Open label” or “unblinded” -> definitely no for remaining groups
  5. No explicit statement about blinding status of data analysts -> probably no for data analysts
  6. No explicit statement about  blinding status of either patients, health care providers, data collectors, or outcome adjudicators, and:
    • Placebo controlled drug trial -> probably yes for those groups
    • Active control drug trial (A vs. B) and mention of “double dummy” or that medications were identical or matched -> probably yes for those groups
    • Active control drug trial (A vs. B) but no mention of “double dummy” or that medications were identical or matched -> probably no for those groups
    • Non drug trial -> probably no for those group
  7. None of the above applies, and trial described as:
    • “single blinded” à use best judgment to assign probably yes to 1 group and  probably no to remaining groups
    • ”double blinded” or “triple blinded” à probably yes for patients, health care providers, data collectors, and outcome adjudicators and  probably no for data analysts
  8. Make sure “blinding” applies to the comparison of interest for LOST-IT: e.g., in a 2×2 factorial RCT (medication A vs. placebo) x (behavioral intervention vs. standard of care) described as “blinded”, blinding applies if the 1st comparison is the one of interest, but does not if the 2nd comparison is the one of interest
  9. Make sure “blinding applies to the outcome of interest (i.e., outcome chosen as the primary outcome for LOST-IT): e.g., in a trial assessing quality of life and disease progression on radiography: blinding of radiologists assessing x-rays applies to the outcome adjudication of the 2nd but not 1st outcome.
  10. If the primary outcome is a self reported outcome, and if the patients are definitely not blinded but “physicians making an assessment” are à definitely no for data collectors (Q 18)
  11. If one component of the outcome adjudication process is not blinded à definitely no for outcome adjudication (Q19): e.g., when a component of the outcome is patient reported and patient is not blinded