Methodological quality of studies and strength of recommendations

A grading system was used for the strength of the recommendations. This grading system is simple and easy to apply, and shows a large degree of consistency between the grading of therapeutic and preventive, prognostic and diagnostic studies. The system is based on the original ratings of the AHCPR Guidelines (1994) and levels of evidence used in systematic (Cochrane) reviews on low back pain.

Strength of recommendations:

1. Therapy and prevention:
Level A : Generally consistent findings provided by (a systematic review of) multiple high quality randomised controlled trials (RCTs).
Level B : Generally consistent findings provided by (a systematic review of) multiple low quality RCTs or non-randomised controlled trials (CCTs).
Level C : One RCT (either high or low quality) or inconsistent findings from (a systematic review of) multiple RCTs or CCTs.
Level D: No RCTs or CCTs.
                  
Systematic review: systematic methods of selection and inclusion of studies, methodological quality assessment, data extraction and analysis.

2. Prognosis:
Level A : Generally consistent findings provided by (a systematic review of) multiple high quality prospective cohort studies.
Level B : Generally consistent findings provided by (a systematic review of) multiple low quality prospective cohort studies or other low quality prognostic studies.
Level C : One prognostic study (either high or low quality) or inconsistent findings from (a systematic review of) multiple prognostic studies.
Level D, no evidence: No prognostic studies.
                  
High quality prognostic studies: prospective cohort studies
Low quality prognostic studies: retrospective cohort studies, follow-up of untreated control patients in a RCT, case-series

3 Diagnosis:
Level A : Generally consistent findings provided by (a systematic review of) multiple high quality diagnostic studies.
Level B : Generally consistent findings provided by (a systematic review of) multiple low quality diagnostic studies.
Level C : One diagnostic study (either high or low quality) or inconsistent findings from (a systematic review of) multiple diagnostic studies.
Level D, no evidence: No diagnostic studies.
                  
High quality diagnostic study: Independent blind comparison of patients from an appropriate spectrum of patients, all of whom have undergone both the diagnostic test and the reference standard. (An appropriate spectrum is a cohort of patients who would normally be tested for the target disorder. An inappropriate spectrum compares patients already known to have the target disorder with patients diagnosed with another condition)

Low quality diagnostic study: Study performed in a set of non-consecutive patients, or confined to a narrow spectrum of study individuals (or both) all of who have undergone both the diagnostic test and the reference standard, or if the reference standard was unobjective, unblinded or not independent, or if positive and negative tests were verified using separate reference standards, or if the study was performed in an inappropriate spectrum of patients, or if the reference standard was not applied to all study patients.

The methodological quality of additional studies will only be assessed in areas that have not been covered yet by a systematic review or of the non-English literature.

The methodological quality of trials is usually assessed using relevant criteria related to the internal validity of trials. High quality trials are less likely to be associated with biased results than low quality trials. Various criteria lists exist, but differences between the lists are subtle.

Quality assessment should ideally be done by at least two reviewers, independently, and blinded with regard to the authors, institution and journal. However, as experts are usually involved in quality assessment it may often not be feasible to blind studies. Criteria should be scored as positive, negative or unclear, and it should be clearly defined when criteria are scored positive or negative. Quality assessment should be pilot tested on two or more similar trials that are not included in the systematic review. A consensus method should be used to resolve disagreements and a third reviewer was consulted if disagreements persisted. If the article does not contain information on the methodological criteria (score 'unclear'), the authors should be contacted for additional information. This also gives authors the opportunity to respond to negative or positive scores.

The following checklists are recommended:

Checklist for methodological quality of therapy / prevention studies

Items:  
1) Adequate method of randomisation,
2) Concealment of treatment allocation,
3) Withdrawal / drop-out rate described and acceptable,
4) Co-interventions avoided or equal,
5) Blinding of patients,
6) Blinding of observer,
7) Blinding of care provider,
8) Intention-to-treat analysis,
9) Compliance,
10) Similarity of baseline characteristics.

Checklist for methodological quality of prognosis (observational) studies

Items:  
1) Adequate selection of study population,
2) Description of in- and exclusion criteria,
3) 3) Description of potential prognostic factors,
4) Prospective study design,
5) Adequate study size (> 100 patient-years),
6) Adequate follow-up (> 12 months),
7) Adequate loss to follow-up (< 20%),
8) Relevant outcome measures,
9) Appropriate statistical analysis.

Checklist for methodological quality of diagnostic studies

Items:  
1) Was at least one valid reference test used?
2) Was the reference test applied in a standardised manner?
3) Was each patient submitted to at least one valid reference test?
4) Were the interpretations of the index test and reference test performed independently of each other?
5) Was the choice of patients who were assessed by the reference test independent of the results of the index test?
6) When different index tests are compared in the study: were the index tests compared in a valid design?
7) Was the study design prospective?
8) Was a description included regarding missing data?
9) Were data adequately presented in enough detail to calculate test characteristics (sensitivity and specificity)?