How best to handle common data entry errors in patient-reported outcome (PRO) datasets?

This FAQ describes three common types of data-entry errors which may occur in PRO datasets and their potential impact:

1. Items are missing in the database, but are recorded on the hard copy questionnaire.

 Some questionnaire scoring manuals have specific guidance for missing items. The EORTC QLQ-C30 multi-item scoring algorithms will deal with this to some extent – as long as half or more of the items are not missing. The half-mean imputation rule is specified in the EORTC scoring manual. 

Likely impact: The missing items may lead to a reduction in precision due to the smaller effective sample size. The extent of the problem depends on how many items are missing in the dataset, and also how important the scale with the missing items is to the study.


2. Items have been incorrectly coded when checked against the hard copy questionnaire. i.e item score should be a '3' instead of a '4'

 The extent of the problem depends on how prevalent the problem is within the dataset. If not very prevalent, or if equally prevalent in both arms and errors occur in random direction (e.g. as likely to be 3 to 4 as 4 to 3), then it’s very unlikely to impact the statistical results / inferences /conclusions.


 Likely impact: if the errors are random as described above, there should arguably be no impact. If the errors are systematic in nature (e.g. always 3 to 4), and equally prevalent in each arm, then there will be bias in each arm, however no bias in the difference between arms. Extent depends on how prevalent the problem is.


3. The wrong study visit number has been recorded (e.g. entered month 8 when it should be month 9) when checked against the date of visit.

The extent of the problem depends on how prevalent the problem is within the dataset.  This is potentially a more serious issue than error number 2 above, UNLESS the QOL trajectory is stable or has a very gentle slope across the time-points. If there is a peak (e.g. due to acute toxicity), then these type of errors, if common enough, could dilute that peak. This problem is not fully ameliorated if the issue occurs equally in each arm, as the accurate description in each arm is also important (albeit secondary to the accurate estimate of the difference between arms – which is the primary interest).


 Likely impact: The impact of the problem is hard to predict and depends on how common, the nature of QOL trajectory, and which time-points are incorrectly recorded.


Some strategies to prevent or minimise the impact of missing data include:

  • Double data entry is a common QA approach however it is a big time investment and may not always be feasible.
  • Running data checks in an analysis program to check for extreme values or errors that fall outside a set range.
  • When setting up the study database, set range limits for each cell (this will prevent entry of outliers, however may not reduce other data entry errors)
  • Consider electronic data capture which avoids the potential for missing items and other data entry errors 

Did you find this information useful?

Return to Previous Page