Family Practice Advance Access originally published online on May 9, 2007
Family Practice 2007 24(3):252-258; doi:10.1093/fampra/cmm011
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Measuring the severity of upper gastrointestinal complaints: does GP assessment correspond with patients self-assessment?
a Department of General Practice, Research Institute Caphri, Maastricht University, Maastricht the Netherlands
b Department of Gastroenterology and Hepatology, Radboud University Medical Centre, Nijmegen the Netherlands
c Department of Health Promotion and Health Education, Research Institute Caphri, Maastricht University, Maastricht, the Netherlands
Correspondence to: GAJ Fransen, Department of General Practice, Caphri Research Institute, Maastricht University, PO Box 616, 6200 MD Maastricht, the Netherlands; Email: gerdinefransen{at}hotmail.com
Received 19 June 2006; Accepted 18 March 2007.
| Abstract |
|---|
|
|
|---|
Background. Questionnaires are frequently used to measure the severity of gastrointestinal (GI) complaints. These questionnaires can either be filled out by the physicians or by the patients, but it is not clear whether these scores correspond. This study aimed to investigate the interrater agreement between physician-reported severity and patient-reported severity concerning the patients upper GI complaints.
Methods. In a prospective observational study, the severity of eight GI complaints was registered by both patients and GPs independently on a seven-point scale (n = 316) before and after treatment with esomeprazole. Weighted kappa values for the agreement on the severity and simple kappa values for the agreement on the absence or presence of symptoms were calculated.
Results. The weighted kappa values ranged from 0.14 to 0.68 indicating poor to moderate agreement. The agreement on the presence or absence of symptoms was similar. Several systematic differences in scoring were found: the GPs tended to underestimate the severity of belching, nausea, early satiety, vomiting and upper and lower abdominal pain. Furthermore, the treatment effect for belching and lower abdominal pain was more often overestimated, while the treatment effect for nausea was more often underestimated by the GP.
Conclusion. The agreement between GP and patient is low. The differences in scoring should be kept in mind when comparing physician-reported outcomes with patient-reported outcomes.
Keywords. Agreement, dyspepsia, family medicine, gastroenterology, questionnaire.
| Introduction |
|---|
|
|
|---|
The severity of upper gastrointestinal (GI) complaints is frequently used to evaluate treatment strategies. Because of the lack of more objective clinical outcomes, questionnaires are often used to measure the severity of GI complaints.1 These questionnaires can either be filled out by the physician during the interview (physician-reported outcomes),2 or be used by the patient in a self-administered way (patient-reported outcomes).3 The results of these two methods of assessment of GI complaints are often compared as if there were no difference between the two. However, Bytzer1 stated that physicians tend to be more optimistic about the treatment effect than patients; i.e. the type of assessment might influence the evaluation of effectiveness and subsequently the study outcome. In fact, few studies have investigated the interrater agreement between patient- and physician-administered questionnaires for the severity of upper GI complaints. Therefore, in order to be able to interpret results from studies using questionnaires to measure symptom severity, it is very important to know whether the physician score does correspond with the patient score.
This study aims to examine the interrater agreement between physician and patient concerning the severity of the patients upper GI complaints, both for baseline and follow-up assessments. Furthermore, we investigated whether patient and GP agreed on the change in severity of the complaints between baseline and follow-up measurement, because it is possible that GP and patient disagree on severity at baseline and at follow-up, but do agree on the decrease or increase of severity during treatment.
| Material and methods |
|---|
|
|
|---|
Respondents and procedure
This observational study is part of a larger prospective study investigating the effectiveness of proton pump inhibitor therapy with esomeprazole (40 mg) for treatment of primary care patients with upper GI symptoms. Patients who consulted their GP for upper GI symptoms and received an esomeprazole prescription were eligible. Exclusion criteria were the presence of alarm symptoms or the use of a proton pump inhibitor during the month prior to inclusion. Each GP could include up to 10 patients. When a patient was included for the effectiveness study, the GP collected basic demographic variables, information about the prescribed treatment, and scored the severity of the patient's upper GI complaints, all on a printed form (GP CRF).
A subgroup of the GPs participating in the effectiveness study participated in the present study comparing patient- and GP-administered GI symptom questionnaires. These GPs invited their patients to fill out questionnaires about the severity and their complaints themselves. It is possible that these GPs did not invite all eligible patients to participate in the present study, but we could not check this. The patient-administered questionnaires could be filled out at home and returned anonymously to the investigators by a pre-stamped addressed envelope. By means of an identical pre-printed number, the GP CRF and the patient questionnaire could be linked for purpose of analysis.
Measurements
Data were collected at the baseline and at the follow-up visit. Both GP and patients used the same scale measuring the severity of eight upper GI complaints of the last week on a seven-point Likert scale ranging from 0 (none) to 6 (very severe).4 This simple scale has been proven clinically useful and is used in several ongoing studies. The scale is based on other widely used validated scales and has been validated by Bovenschen et al.4
Statistical analysis
To avoid differences between GP and patient scoring due to changes in time (treatment or natural course), only questionnaires filled out at the same moment in time were compared. Therefore, we excluded patient questionnaires filled out 2 days or more after the consultation. To check whether this exclusion led to selective drop out, in- and excluded patients were compared for gender, age and severity of the complaints using a chi-square test and two-sided student's t-test.
Cohen's kappa is often used for measuring reliability between observers, as it compares the observed agreement with perfect agreement while correcting for chance. Because of the seven-point scale, we calculated the weighted kappa, which means that observations on the diagonal in the 7 x 7 cross table of GP scores versus patient scores (correspondence of scores) are given a higher weight than observations further from the diagonal; the further from the diagonal the less weight of the observation. Standardized weights were calculated with SAS statistical software and we used the default available weight type: the Cicchetti-Allison type. Landis and Koch5 have suggested that kappa values of 0.4 indicate poor agreement, values of 0.40.6 moderate agreement and from 0.8 excellent agreement.
Additional to reporting the chance-corrected agreement (weighted kappa values) on the seven-point scale, we also present the chance-corrected agreement between patient and GP on the presence or absence of the complaints (simple kappa values). The absence of a symptom is defined by a score of 0 on the seven-point scale. As suggested by Cicchetti and Feinstein,6 also the observed agreement should be reported. Therefore, we investigated how many patients scored exactly the same as their GPs, how many patients indicated more severe and how many patients indicated less severe complaints than their GPs. Furthermore, also differences of two points or more in GP and patient scoring were investigated, because to our opinion a difference of two points or more on a seven-point scale is clinically relevant and provides a good first impression of (dis)agreement between GP and patient.
When there is a difference between the proportions of underestimation and overestimation, this will result in a difference between the mean GP scores and mean patient scores, indicating systematic differences. This was tested using two-sided student's t-tests.
These analyses are presented for each symptom at baseline, at follow-up, and for the change in severity over time (baseline score minus follow-up score). All statistical analyses were performed using SAS statistical software (version 8.0, SAS Institute Inc., Cary, NC).
| Results |
|---|
|
|
|---|
Response on the questionnaire
In the effectiveness study 2905 patients were included by 346 GPs. Of these GPs, 127 participated in the present study; they together included 1027 patients for the effectiveness study. A total of 477 baseline and 425 follow-up questionnaires were returned. Sixty-nine patients only returned the follow-up questionnaire. Thus, 53% (546 of 1027) of the potentially invited patients returned one or more questionnaires.
Patients who returned one or more questionnaires (responders) had a similar age and gender distribution as non-responders. However, there were some statistically significant differences in the severity scores (GP administered): the responders tended to have more severe complaints of heartburn, regurgitation, belching, lower abdominal pain and early satiety [mean differences, 95% confidence interval (CI): 0.17, 0.020.33; 0.23, 0.080.38; 0.37, 0.230.52; 0.13, 0.040.22; 0.28, 0.120.44, respectively].
Exclusion because of time-lag between GP scoring and patient scoring
Patient questionnaires filled out more than 2 days after the visit were excluded, resulting in 316 baseline and 269 follow-up questionnaires available for analysis. These patients were treated by 102 GPs. No differences for age or gender were found between included and excluded patients. Patients who filled out the questionnaire after 2 days of treatment had less severe heartburn, regurgitation and belching than patients who filled out the questionnaire at baseline (Table 1). This supports our decision to exclude these patients from analysis: it is likely that the severity of these symptoms has changed over time due to treatment or natural course, and therefore it is important to only take measurements into account that were done at the same moment.
|
Patient characteristics
The mean age of the included patients was 53 years (SD 16) and 57% was male. Most of them (89%) were treated for 2 weeks with esomeprazole 40 mg, 3% for 3 weeks and 8% for 4 weeks. Figure 1 shows the patients and the GPs mean severity scores and indicates that the patients had mild GI complaints at baseline which decreased significantly during treatment.
|
Measures of agreement
The weighted kappa values at baseline varied from 0.48 to 0.60, indicating moderate agreement (Table 2). At follow-up, the weighted kappa values were slightly higher and varied from 0.57 to 0.68 (Table 3). The simple kappa values on the presence or absence of agreement were quite comparable to the weighted kappa values, varying from 0.47 to 0.68, indicating moderate agreement (Tables 2 and 3). For the change in severity, the weighted kappa values varied from 0.14 to 0.57, indicating poor to moderate agreement (Table 4).
|
|
|
These low rates of agreement were further investigated by looking at the proportions of cases with differences of two points or more in scoring (Tables 24
Regarding the overestimation and underestimation, for most symptoms, the severity was approximately as frequently underestimated as overestimated by GPs. For instance for heartburn at baseline (Table 2), 26% of the patients indicated more severe and 29% of the patients indicated less severe complaints than GPs. Also the proportions of cases with differences of two points or more were approximately similar for underestimation and overestimation (e.g. for heartburn at baseline 8% versus 10%). This is reflected in the small differences between the mean GP score and the mean patient score (Fig. 1). However, although these differences tend to be small, some were statistically significant, e.g. for lower abdominal pain and vomiting at baseline. This indicates that there were systematic differences in the scores of these symptoms. Furthermore, concerning the change in severity, Table 4 shows that GPs were more optimistic than patients about the change in severity between baseline and follow-up for belching and lower abdominal pain, but less optimistic about the change in severity of nausea.
| Discussion |
|---|
|
|
|---|
Studies investigating upper GI symptoms use patient-reported outcomes as well as physician-reported outcomes, but it is not clear whether these outcomes yield similar results. This study had two aims: firstly, to investigate the interrater agreement between GP and patient in measuring the severity of the patients upper GI complaints before and after treatment, and secondly, to investigate the interrater agreement in measuring the treatment effect (change in severity). Concerning the first aim, the weighted kappa values indicated moderate agreement on the severity of the complaints. This did not improve when looking only at the agreement on the presence or absence of the complaints. Furthermore, several systematic differences in scoring were found: the GPs tended to underestimate the severity of belching, nausea, early satiety, vomiting and upper and lower abdominal pain. Concerning the second aim, the weighted kappas indicated poor to moderate agreement for the change in severity. The treatment effects for belching and lower abdominal pain were more often overestimated, while the treatment effect for nausea was more often underestimated by the GP. In general, in up to 40% of the cases, the GP and patient substantially differed in their scoring.
Few other publications investigated the agreement between patient and physician concerning the severity of upper GI complaints measured on the same scale at the same time. Bytzer1 stated that physicians tend to be more optimistic about the treatment effect than patients, which is in line with our findings. This is also confirmed by a study of Sandmark et al.7 where the investigators rated approximately 75% of patients as completely symptom free after 4 weeks of omeprazole therapy, but only 55% of patients felt that their symptoms had completely resolved.
Two other studies support our findings, although their results were less thorough because they did not use the same measuring scales for physicians and patients at the same time. McColl et al.8 investigated the agreement between physician and patient in gastroesophageal reflux disease and found for the severity of reflux-like symptoms weighted kappa values varying between 0.170.53 at baseline and 0.310.73 at follow-up. Furthermore, Revicki et al.9 found that patients with more severe complaints according to the physician (based on interviews) also had higher means for severity on the self-reported scale. However, because physicians and patients used different rating scales, no definite conclusions can be drawn about the agreement between physician and patient.
Another study supports our findings: Quan et al.10 validated a diabetes bowel symptom questionnaire. They used the same questionnaire for both physician interviews and patient-reported severity of the GI complaints, but not at the same moment in time. The patients were interviewed only once by the physician, approximately 1 week after filling out the follow-up questionnaire, which makes the occurrence of recall bias likely. According to Quan et al.,10 the interquartile range of the kappa value of all GI items was 0.240.64 (median 0.47), when looking only at the presence/absence of symptoms. For the severity of the GI complaints, they found an even poorer agreement, with a median kappa value of 0.14 and an interquartile range from 0.87 to 0.40.
In our study, the GPs and the patients used exactly the same measuring scale. Furthermore, measurements were done at the same moment in time, thus minimizing differences due to treatment or natural course between GP and patient assessment of the symptoms.
Despite our efforts to minimize sources of bias, there may be some limitations. Fifty-three per cent of the patients returned one or more questionnaires, which is acceptable considering the fact that we were not able to send reminders because of the anonymity. In fact, the response rate might even be higher in reality since we were not able to check whether all 1027 patients actually received the questionnaires; some GPs might not have handed out questionnaires to all their patients. Nevertheless, because participation of patients was voluntary, there may be some response bias. Patients who returned the questionnaire may on the one hand be more communicative about their complaints, which might have led to higher agreement if they also communicated more elaborately about their symptoms with their GPs. On the other hand, some of these patients might belong to a group of patients regarded as complainers by their GPs, which may lead to less agreement about the severity of their symptoms. Overall, this may probably not have had a large influence on agreement.
Furthermore, there seemed to be selection of patients with slightly more severe complaints. Maybe the severity of the complaints triggered these patients more to participate. Nevertheless, the selected patients still had only moderate dyspeptic symptoms, and do not constitute a subgroup of patients with extreme symptoms.
Moreover, agreement results may depend on communication style and skills of individual GPs. Participating GPs were not extensively instructed how to use the symptom scale. It is possible that some GPs scored the severity based on their own clinical impression after history taking, which might have led to low agreement if these GPs were not completely in tune with their patients. But by including a large group of GPs from across the Netherlands, we assumed that GP skills would be representative for the Dutch GPs in general. In order not to overrepresent individual GP characteristics, inclusion was limited to 10 patients per GP. Therefore, our results probably are representative for the Dutch population of GPs and patients with dyspeptic symptoms.
The relatively low levels of agreement raise concerns. In everyday practice, the agreement between GP and patient might be even worse compared to our findings because of our patient selection. The investigated population only consisted of patients for whom physicians decided that treatment with a proton pomp inhibitor was needed. Furthermore, the patients agreed to participate in a study investigating their upper GI complaints. Therefore, one can safely assume that both GPs and patients participating in this study agreed (at least) on the presence of upper GI complaints and even on the need for treatment, which already implicates some level of agreement. In everyday practice, patient and GP will not always agree on the need for treatment, so agreement in the primary care population is expected to be lower. This is supported by a study by our group11 in which the perceptions of patients with unexplained upper GI symptoms were investigated. This study illustrated that these patients felt that their complaints were not taken seriously by their GPs and that they were unsatisfied with their treatment. The low agreement on the severity of complaints is supported by these findings and indicates that the communication about the severity of symptoms needs to be enhanced.
In conclusion, the low agreement between GP and patients indicates that there is room for improvement concerning the communication about upper GI symptom severity between GP and patient, especially in the initial consultation. In general, GPs as frequently overestimated as underestimated the severity of reflux-like symptoms. The severity of belching, nausea, early satiety, vomiting and upper and lower abdominal pain is however more often underestimated by the GP. Compared to patients, GPs were more optimistic about the treatment effects for belching and lower abdominal pain but less optimistic for the change in severity of nausea. This should be kept in mind when comparing physician-reported outcomes with patient-reported outcomes concerning the severity of their upper GI complaints and treatment effects.
| Declaration |
|---|
|
|
|---|
Funding: None.
Ethical approval: Ethical approval was obtained from the Helsinki declaration.
Conflicts of interest: None.
| Acknowledgments |
|---|
This study was supported by an unrestricted grant from AstraZeneca, the Netherlands. The investigators independently analysed the data and reported the results. The authors wish to thank Corine van Marrewijk and Suhreta Mujakovic for their cooperation and support for this study. Furthermore, we wish to thank all reviewers for their useful comments.
| Notes |
|---|
Fransen GAJ, Janssen MJR, Muris JWM, Mesters I and Knottnerus JA. Measuring the severity of upper gastrointestinal complaints: does GP assessment correspond with patients self-assessment? Family Practice 2007; 24: 252258.
| References |
|---|
|
|
|---|
1 Bytzer P. Assessment of reflux symptom severity: methodological options and their attributes. Gut (2004) 53(suppl 4):iv28iv34.
2 Westbrook JI, Duggan AE, Duggan JM, Westbrook MT. A 9 year prospective cohort study of endoscoped patients with upper gastrointestinal symptoms. Eur J Epidemiol (2005) 20:6196. 27.[CrossRef][Web of Science][Medline]
3 Veldhuyzen van Zanten S, Fedorak RN, Lambert J, Cohen L, Vanjaka A. Absence of symptomatic benefit of lansoprazole, clarithromycin, and amoxicillin triple therapy in eradication of Helicobacter pylori positive, functional (nonulcer) dyspepsia. Am J Gastroenterol (2003) 98:1963196. 9.[Web of Science][Medline]
4 Bovenschen HJ, Janssen MJ, van Oijen MG, Laheij RJ, van Rossum LG, Jansen JB. Evaluation of a Gastrointestinal Symptoms Questionnaire. Dig Dis Sci (2006) 51:150915. 15.[CrossRef][Web of Science][Medline]
5 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics (1977) 33:1591. 74.[CrossRef][Web of Science][Medline]
6 Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol (1990) 43:55155. 8.[CrossRef][Web of Science][Medline]
7 Sandmark S, Carlsson R, Fausa O, Lundell L. Omeprazole or ranitidine in the treatment of reflux esophagitis. Results of a double-blind, randomized, Scandinavian multicenter study. Scand J Gastroenterol (1988) 23:6256. 32.[Web of Science][Medline]
8 McColl E, Junghard O, Wiklund I, Revicki DA. Assessing symptoms in gastroesophageal reflux disease: how well do clinicians assessments agree with those of their patients? Am J Gastroenterol (2005) 100:111. 8.[CrossRef][Web of Science][Medline]
9 Revicki DA, Wood M, Wiklund I, Crawley J. Reliability and validity of the Gastrointestinal Symptom Rating Scale in patients with gastroesophageal reflux disease. Qual Life Res (1998) 7:7583.[CrossRef][Web of Science][Medline]
10 Quan C, Talley NJ, Cross S, et al. Development and validation of the Diabetes Bowel Symptom Questionnaire. Aliment Pharmacol Ther (2003) 17:117911. 87.[CrossRef][Web of Science][Medline]
11 Fransen GAJ, Mesters I, Bonten E, Muris JWM. Expectations of patients regarding the management of gastrointestinal problems. Gut (2004) 53(suppl 4):A284.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. Delaney Engaging practitioners in research; time to change the values of practice rather than the way research is carried out? Fam. Pract., June 1, 2007; 24(3): 207 - 208. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

