Statistics Toolkit

Contents

Published by Blackwell Publishing

BMJ Books is an imprint of the BMJ Publishing Group Limited, used under licence

Blackwell Publishing, Inc., 350 Main Street, Malden, Massachusetts 02148-5020, USA

Blackwell Publishing Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK

Blackwell Publishing Asia Pty Ltd, 550 Swanston Street, Carlton, Victoria 3053, Australia

The right of the Author to be identified as the Author of this Work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

First published 2008

1 2008

ISBN: 978-1-4051-6142-8

A catalogue record for this title is available from the British Library and the Library of Congress.

Set in Helvetica Medium 7.75/9.75 by Sparks, Oxford - Printed and bound in Singapore by Markono Print Media Pte Ltd

Commissioning Editor: Mary Banks

Development Editors: Lauren Brindley and Victoria Pittman

Production Controller: Rachel Edwards

For further information on Blackwell Publishing, visit our website:

The publisher’s policy is to use permanent paper from mills that operate a sustainable forestry policy, and which has been manufactured from pulp processed using acid-free and elementary chlorine-free practices. Furthermore, the publisher ensures that the text paper and cover board used have met acceptable environmental accreditation standards.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book.

This handbook was compiled by Rafael Perera, Carl Heneghan and Douglas Badenoch. We would like to thank all those people who have had input to our work over the years, particularly Paul Glasziou and Olive Goddard from the Centre of Evidence-Based Medicine. In addition, we thank the people we work with from the Department of Primary Health Care, University of Oxford, whose work we have used to illustrate the statistical principles in this book. We would also like to thank Lara and Katie for their drawings.

Introduction

This ‘toolkit’ is the second in our series and is aimed as a summary of the key concepts needed to get started with statistics in healthcare.

images/c01_image001.jpg

Often, people find statistical concepts hard to understand and apply. If this rings true with you, this book should allow you to start using such concepts with confidence for the first time. Once you have understood the principles in this book you should be at the point where you can understand and interpret statistics, and start to deploy them effectively in your own research projects.

The book is laid out in three main sections: the first deals with the basic nuts and bolts of describing, displaying and handling your data, considering which test to use and testing for statistical significance. The second section shows how statistics is used in a range of scientific papers. The final section contains the glossary, a key to the symbols used in statistics and a discussion of the software tools that can make your life using statistics easier.

Occasionally you will see the GO icon on the right. This means the difficult concept being discussed is beyond the scope of this textbook. If you need more information on this point you can either refer to the text cited or discuss the problem with a statistician.

images/c01_image002.jpg

images/c01_image003.jpg

Data: describing and displaying

The type of data we collect determines the methods we use. When we conduct research, data usually comes in two forms:

Categorical data, which give us percentages or proportions (e.g. ‘60% of patients suffered a relapse’).
Numerical data, which give us averages or means (e.g. ‘the average age of participants was 57 years’).

So, the type of data we record influences what we can say, and how we work it out. This section looks at the different types of data collected and what they mean.

Any measurable factor, characteristic or attribute is a variable

A variable from our data can be two types: categorical or numerical.

Categorical: the variables studied are grouped into categories based on qualitative traits of the data. Thus the data are labelled or sorted into categories.

images/c02_image001.jpg

A special kind of categorical variables are binary or dichotomous variables: a variable with only two possible values (zero and one) or categories (yes or no, present or absent, etc.; e.g. death, occurrence of myocardial infarction, whether or not symptoms have improved).

Numerical: the variables studied take some numerical value based on quantitative traits of the data. Thus the data are sets of numbers.

images/c02_image002.jpg

You can consider discrete as basically counts and continuous as measurements of your data.

Censored data – sometimes we come across data that can only be measured for certain values: for instance, troponin levels in myocardial infarction may only be detected for a certain level and below a fixed upper limit (0.2-180 μg/L)

Summarizing your data

It’s impossible to look at all the raw data and instantly understand it. If you’re going to interpret what your data are telling you, and communicate it to others, you will need to summarize your data in a meaningful way. Typical mathematical summaries include percentages, risks and the mean.

The benefit of mathematical summaries is that they can convey information with just a few numbers; these summaries are known as descriptive statistics.

Summaries that capture the average are known as measures of central tendency, whereas summaries that indicate the spread of the data usually around the average are known as measures of dispersion.

The arithmetic mean (numeric data)

The arithmetic mean is the sum of the data divided by the number of measurements. It is the most common measure of central tendency and represents the average value in a sample.

images/c02_image003.jpg

To calculate the mean, add up all the measurements in a group and then divide by the total number of measurements.

The geometric mean

If the data we have sampled are skewed to the right (see p. 7) then we transform the data using a natural logarithm (base e = 2.72) of each value in the sample. The arithmetic mean of these transformed values provides a more stable measure of location because the influence of extreme values is smaller. To obtain the average in the same units as the original data – called the geometric mean – we need to back transform the arithmetic mean of the transformed data:

images/c02_image004.jpg

The weighted mean

The weighted mean is used when certain values are more important than others: they supply more information. If all weights are equal then the weighted mean is the same as the arithmetic mean (see p. 54 for more).

We attach a weight (w_i) to each of our observations (x_i):

images/c02_image005.jpg

The median and mode

The easiest way to find the median and the mode is to sort each score in order, from the smallest to the largest:

images/c02_image006.jpg

The median is the value at the midpoint, such that half the values are smaller than the median and half are greater than the median. The mode is the value that appears most frequently in the group. For these test scores the mode is 7. If all values occur with the same frequency then there is no mode. If more than one value occurs with the highest frequency then each of these values is the mode. Data with two modes are known as bimodal.

Choosing which one to use: (arithmetic) mean, median or mode?

The following graph shows the mean, median and mode of the test scores. The x-axis shows the scores out of ten. The height of each bar (y-axis) shows the number of participants who achieved that score.

images/c02_image007.jpg

This graph illustrates why the mean, median and mode are all referred to as measures of central tendency. The data values are spread out across the horizontal axis, whilst the mean, median and mode are all clustered towards the centre of the graph.

Of the three measures the mean is the most sensitive measurement, because its value always reflects the contributions of each data value in the group. The median and the mode are less sensitive to outlying data at the extremes of a group. Sometimes it is an advantage to have a measure of central tendency that is less sensitive to changes in the extremes of the data.

For this reason, it is important not to rely solely on the mean. By taking into account the frequency distribution and the median, we can obtain a better understanding of the data, and whether the mean actually depicts the average value. For instance, if there is a big difference between the mean and the median then we know there are some extreme measures (outliers) affecting the mean value.

images/c02_image008.jpg

The shape of the data is approximately the same on both the lefthand and righthand side of the graph (symmetrical data). Therefore use the mean (5.9) as the measure of central tendency.

The data are now nonsymmetrical, i.e. the peak is to the right. We call these negatively skewed data and the median (9) is a better measurement of central tendency.

The data are now bimodal, i.e. they have two peaks. In this case there may be two different populations each with its own central tendency. One mean score is 2.2 and the other is 7.5

Sometimes there is no central tendency to the data; there are a number of peaks. This could occur when the data have a ‘uniform distribution’, which means that all possible values are equally likely. In such cases a central tendency measure is not particularly useful.

Measures of dispersion: the range

To provide a meaningful summary of the data we need to describe the average or central tendency of our data as well as the spread of the data.

images/c02_image009.jpg

However, class 2 test scores are more scattered; using the spread of the data tells us whether the data are close to the mean or far away.

images/c02_image010.jpg

The range is the difference between the largest and the smallest value in the data.

We will look at four ways of understanding how much the individual values vary from one to another: variance, standard deviation, percentiles and standard error of the mean.

The variance

The variance is a measure of how far each value deviates from the arithmetic mean. We cannot simply use the mean of the difference as the negatives would cancel out the positives; therefore to overcome this problem we square each mean and then find the mean of these squared deviations.

σ² = population variance

s² = sample variance

To calculate the (sample) variance:

1. Subtract the mean from each value in the data.

2. Square each of these distances and add all of the squares together.

3. Divide the sum of the squares by the number of values in the data minus 1.

images/c02_image011.jpg

Note we have divided by n – 1 instead of n. This is because we nearly always rely on sample data and it can be shown that a better estimate of the population variance is obtained if we divide by n – 1 instead of n.

images/c02_image012.jpg

The standard deviation

The standard deviation is the square root of the variance:

images/c02_image013.jpg

The standard deviation is equivalent to the average of the deviations from the mean and is expressed in the same units as the original data.

images/c02_image014.jpg

Therefore, in class 1 the mean is 5.4 and the standard deviation is 0.93. This is often written as 5.4 ± 0.93, describing a range of values of one SD around the mean.

Assuming the data are from a normal distribution then this range of values one SD away from the mean includes 68.2% of the possible measures, two SDs includes 95.4% and three SDs includes 99.7%.

Dividing the standard deviation by the mean gives us the coefficient of variation. This can be used to express the degree to which a set of data points varies and can be used to compare variance between populations.

Percentiles

Percentiles provide an estimate of the proportion of data that lies above and below a given value. Thus the first percentile cuts off the lowest 1% of data, the second percentile cuts off the lowest 2% of data and so on. The 25th percentile is also called the first quartile and the 50th percentile is the median (or second quartile).

Percentiles are helpful because we can obtain a measure of spread that is not influenced by outliers. Often data are presented with the interquartile range: between the 25th and 75th percentiles (first and third quartiles).

Standard error of the mean

The standard error of the mean (SEM) is the standard deviation of a hypothetical sample of means. The SEM quantifies how accurately the true population mean is known:

images/c02_image015.jpg

where s is the standard deviation of the observations in the sample.

The smaller the variability (s) and/or the larger the sample the smaller the SEM will be. By ‘small’ we mean here that the estimate will be more precise.