Reliability

Reliability is the characteristic of a test or method that produces consistent results, which means the test instrument is unlikely to be influenced by external factors.

Validity and reliability are used to assess the rigour of research.

A study's quality relies on its ability to produce results that are easily interpreted as such, several researchers conducting the same experiment using the same test on the same group of participants should be able to produce similar results.

 

How is this evaluated?

  1. Internal consistency: a measure of correlation, not causality. The extent to which all the items on a scale measure one construct or the same latent variable.

    Depending on the type of test, internal consistency may be measured through Cronbach's alpha, Average Inter-Item, Split-Half, or Kuder-Richardson test. 

    Example: Visual Analog Scales and Likert Scales.

    This VAS for pain is presented to participants without the numerical scale (see here).                      
  2. Parallel forms: the correlation between two equivalent versions of a test.

    The easiest way to measure this is simply to alter the order of questions on a questionnaire, which should minimise memory, training, or habituation effects.

  3. Test-retest reliability: is the repeatability of a test over time.

    Together, these assess the stability of the measurement. 

    Although these scales require prior cultural adaptation and validation, in longitudinal studies they may be used to infer temporal trends such as the secular rise in IQ test scores (Flynn Effect).

  4. Interrater reliability: measures the degree of agreement between different people observing or assessing the same thing.  

    It can help mitigate observer bias, and is used when data is collected by researchers assigning ratings, scores, or categories.

    In medicine, it is often used when two or more specialists provide a diagnosis based on imaging (pathological samples, tomographs, radiographs, magnetic ressonance).

The most prevalently reported measure is the area under the receiver operator characteristic curve (AUC-ROC), so I'll talk about how to interpret this next week.

Comments

Leave a comment!

How to spot fake reviewers: a beginner's guide

Auditing published papers (part I)

IMHO: why open science should adopt double anonymous peer review