cover.eps

Business Statistics For Dummies®

Visit www.dummies.com/cheatsheet/businessstatistics to view this book's cheat sheet.

Table of Contents

Introduction

About This Book

Foolish Assumptions

Icons Used in This Book

Beyond the Book

Where to Go from Here

Part I: Getting Started with Business Statistics

Chapter 1: The Art and Science of Business Statistics

Representing the Key Properties of Data

Analyzing data with graphs

Defining properties and relationships with numerical measures

Probability: The Foundation of All Statistical Analysis

Random variables

Probability distributions

Using Sampling Techniques and Sampling Distributions

Statistical Inference: Drawing Conclusions from Data

Confidence intervals

Hypothesis testing

Simple regression analysis

Multiple regression analysis

Forecasting techniques

Chapter 2: Pictures Tell the Story: Graphical Representations of Data

Analyzing the Distribution of Data by Class or Category

Frequency distributions for quantitative data

Frequency distribution for qualitative values

Cumulative frequency distributions

Histograms: Getting a Picture of Frequency Distributions

Checking Out Other Useful Graphs

Line graphs: Showing the values of a data series

Pie charts: Showing the composition of a data set

Scatter plots: Showing the relationship between two variables

Chapter 3: Finding a Happy Medium: Identifying the Center of a Data Set

Looking at Methods for Finding the Mean

Arithmetic mean

Geometric mean

Weighted mean

Getting to the Middle of Things: The Median of a Data Set

Comparing the Mean and Median

Determining the relationship between mean and median

Acknowledging the relative advantages and disadvantages of the mean and median

Discovering the Mode: The Most Frequently Repeated Element

Chapter 4: Searching High and Low: Measuring Variation in a Data Set

Determining Variance and Standard Deviation

Finding the sample variance

Finding the sample standard deviation

Calculating population variance and standard deviation

Finding the Relative Position of Data

Percentiles: Dividing everything into hundredths

Quartiles: Dividing everything into fourths

Interquartile range: Identifying the middle 50 percent

Measuring Relative Variation

Coefficient of variation: The spread of a data set relative to the mean

Comparing the relative risks of two portfolios

Chapter 5: Measuring How Data Sets Are Related to Each Other

Understanding Covariance and Correlation

Sample covariance and correlation

Population covariance and correlation coefficient

Comparing correlation and covariance

Interpreting the Correlation Coefficient

Showing the relationship between two variables

Application: Correlation and the benefits of diversification

Part II: Probability Theory and Probability Distributions

Chapter 6: Probability Theory: Measuring the Likelihood of Events

Working with Sets

Membership

Subset

Union

Intersection

Complement

Betting on Uncertain Outcomes

The sample space: Everything that can happen

Event: One possible outcome

Computing probabilities of events

Looking at Types of Probabilities

Unconditional (marginal) probabilities: When events are independent

Joint probabilities: When two things happen at once

Conditional probabilities: When one event depends on another

Determining independence of events

Following the Rules: Computing Probabilities

Addition rule

Complement rule

Multiplication rule

Chapter 7: Probability Distributions and Random Variables

Defining the Role of the Random Variable

Assigning Probabilities to a Random Variable

Calculating the probability distribution

Visualizing probability distribution with a histogram

Characterizing a Probability Distribution with Moments

Understanding the summation operator (Σ)

Expected value

Variance and standard deviation

Chapter 8: The Binomial, Geometric, and Poisson Distributions

Looking at Two Possibilities with the Binomial Distribution

Checking out the binomial distribution

Computing binomial probabilities

Moments of the binomial distribution

Graphing the binomial distribution

Determining the Probability of the Outcome That Occurs First: Geometric Distribution

Computing geometric probabilities

Moments of the geometric distribution

Graphing the geometric distribution

Keeping the Time: The Poisson Distribution

Computing Poisson probabilities

Graphing the Poisson distribution

Chapter 9: The Uniform and Normal Distributions: So Many Possibilities!

Comparing Discrete and Continuous Distributions

Working with the Uniform Distribution

Graphing the uniform distribution

Discovering moments of the uniform distribution

Computing uniform probabilities

Understanding the Normal Distribution

Graphing the normal distribution

Getting to know the standard normal distribution

Computing standard normal probabilities

Computing normal probabilities other than standard normal

Chapter 10: Sampling Techniques and Distributions

Sampling Techniques: Choosing Data from a Population

Probability sampling

Nonprobability sampling

Sampling Distributions

Portraying sampling distributions graphically

Moments of a sampling distribution

The Central Limit Theorem

Converting 9781118630693-FMeq001.eps to a standard normal random variable

Part III: Drawing Conclusions from Samples

Chapter 11: Confidence Intervals and the Student’s t-Distribution

Almost Normal: The Student’s t-Distribution

Properties of the t-distribution

Graphing the t-distribution

Probabilities and the t-table

Point estimates vs. interval estimates

Estimating confidence intervals for the population mean

Chapter 12: Testing Hypotheses about the Population Mean

Applying the Key Steps in Hypothesis Testing for a Single Population Mean

Writing the null hypothesis

Coming up with an alternative hypothesis

Choosing a level of significance

Computing the test statistic

Comparing the critical value(s)

Using the decision rule

Testing Hypotheses About Two Population Means

Writing the null hypothesis for two population means

Defining the alternative hypotheses for two population means

Determining the test statistics for two population means

Working with dependent samples

Chapter 13: Testing Hypotheses about Multiple Population Means

Getting to Know the F-Distribution

Defining an F random variable

Measuring the moments of the F-distribution

Using ANOVA to Test Hypotheses

Writing the null and alternative hypotheses

Choosing the level of significance

Computing the test statistic

Finding the critical values using the F-table

Coming to the decision

Using a spreadsheet

Chapter 14: Testing Hypotheses about the Population Mean

Staying Positive with the Chi-Square Distribution

Representing the chi-square distribution graphically

Defining a chi-square random variable

Checking out the moments of the chi-square distribution

Testing Hypotheses about the Population Variance

Defining what you assume to be true: The null hypothesis

Stating the alternative hypothesis

Choosing the level of significance

Calculating the test statistic

Determining the critical value(s)

Practicing the Goodness of Fit Tests

Comparing a population to the Poisson distribution

Comparing a population to the normal distribution

Testing Hypotheses about the Equality of Two Population Variances

The null hypothesis: Equal variances

The alternative hypothesis: Unequal variances

The test statistic

The critical value(s)

The decision about the equality of two population variances

Part IV: More Advanced Techniques: Regression Analysis and Forecasting

Chapter 15: Simple Regression Analysis

The Fundamental Assumption: Variables Have a Linear Relationship

Defining a linear relationship

Using scatter plots to identify linear relationships

Defining the Population Regression Equation

Estimating the Population Regression Equation

Testing the Estimated Regression Equation

Using the coefficient of determination (R2)

Computing the coefficient of determination

The t-test

Using Statistical Software

Assumptions of Simple Linear Regression

Chapter 16: Multiple Regression Analysis: Two or More Independent Variables

The Fundamental Assumption: Variables Have a Linear Relationship

Estimating a Multiple Regression Equation

Predicting the value of Y

The adjusted coefficient of determination

The F-test: Testing the joint significance of the independent variables

The t-test: Determining the significance of the slope coefficients

Checking for Multicollinearity

Chapter 17: Forecasting Techniques: Looking into the Future

Defining a Time Series

Modeling a Time Series with Regression Analysis

Classifying trends

Estimating the trend

Forecasting a Time Series

Changing with the Seasons: Seasonal Variation

Implementing Smoothing Techniques

Moving averages

Centered moving averages

Exploring Exponential Smoothing

Forecasting with exponential smoothing

Comparing the Forecasts of Different Models

Part V: The Part of Tens

Chapter 18: Ten Common Errors That Arise in Statistical Analysis

Designing Misleading Graphs

Drawing the Wrong Conclusion from a Confidence Interval

Misinterpreting the Results of a Hypothesis Test

Placing Too Much Confidence in the Coefficient of Determination (R2)

Assuming Normality

Thinking Correlation Implies Causality

Drawing Conclusions from a Regression Equation when the Data do not Follow the Assumptions

Including Correlated Variables in a Multiple Regression Equation

Placing Too Much Confidence in Forecasts

Using the Wrong Distribution

Chapter 19: Ten Key Categories of Formulas for Business Statistics

Summary Measures of a Population or a Sample

Probability

Discrete Probability Distributions

Continuous Probability Distributions

Sampling Distributions

Confidence Intervals for the Population Mean

Testing Hypotheses about Population Means

Testing Hypotheses about Population Variances

Using Regression Analysis

Forecasting Techniques

About the Author

Cheat Sheet

Connect with Dummies

Introduction

Have you always been scared to death of statistics? You and just about everyone else! The equations are extremely intimidating, and the terminology sounds so . . . boring.

Why, then, is statistics so important? All business disciplines can be analyzed with statistical principles. Statistics make it possible to analyze real world problems with actual data, so that we can understand if our marketing strategy is really working, or how much a company should charge for its products, or any of a million other practical questions.

Without a formal framework for analyzing these types of situations, it would be impossible to have any confidence in our results. This is where the science of statistics comes in. Far from being an overbearing collection of equations, it is a logical framework for analyzing practical business problems with real-world data.

This book is designed to show you how to apply statistics to practical situations in a step-by-step manner, so that by the time you’re done, you’ll know as much about statistics as people with far more education in this area!

About This Book

All business degrees require at least some statistics courses, and there’s a good reason for that! All business disciplines are empirical by nature, meaning that they need to analyze actual data to be successful. The purpose of this book is to:

check.png Give you the principles on which statistical analysis is based

check.png Provide you with many worked-out examples of these principles so that you can master them

check.png Improve your understanding of the circumstances in which each statistical technique should be used

As a For Dummies title, this book is organized into modules; you can skip around and learn about various statistical techniques in the order that suits you. In cases where the contents of a chapter are based on previous readings, you are guided back to the original material. Along the way, there are many helpful tips and reminders so that you get the most out of each chapter. I explain each equation in great detail, and all key terms are explained in depth. You will also find a summary of key formulas at the back of the book along with important statistical tables.

This book can’t make you an expert in statistics, but provides you with a way of improving your knowledge very quickly so that you can use statistics in practical settings right away.

Foolish Assumptions

I am willing to make the following assumptions about you as the reader of this book:

check.png You need to use the techniques in this book in a practical setting and have little or no previous experience with statistics.

OR

check.png You’re a student who feels overwhelmed by a traditional statistics course and feels the need for more background. You can benefit from seeing more examples of the material; statistics is a science that can be learned through practice!

OR

check.png You’re simply interested in improving your knowledge of this field.

In all of these cases, you’re extremely well motivated and can put as much effort into learning statistics as you need. Congratulations! Your reward for reading this book will be a greater understanding of business statistics.

Icons Used in This Book

The following icons are designed to help you use this book quickly and easily. Be sure to keep an eye out for them.

remember.eps The Remember icon points to information that’s especially important to remember for exam purposes.

tip.eps The Tip icon presents information like a memory acronym or some other aid to understanding or remembering material.

warning_bomb.eps When you see this icon, pay special attention. The information that follows may be somewhat difficult, confusing, or harmful.

technicalstuff.eps The Technical Stuff icon is used to indicate detailed information; for some people, it might not be necessary to read or understand.

Beyond the Book

In addition to the informative, clever, and (if I may say so) well-written material you're reading right now, this product also comes with some access-anywhere goodies on the web. No matter how well you know statistics by the end of this book, a little extra information is always helpful. Check out the free Cheat Sheet at www.dummies.com/cheatsheet/businessstatistics to learn more about describing populations and samples, random variables, probability distributions, hypothesis testing, and more.

Where to Go from Here

When you’ve become more adept at statistical analysis, you may want to learn the capabilities of a spreadsheet program such as Excel. You may also want to tackle a full-blown statistical package, such as SPSS or SAS. These will eliminate a great deal of the computational burden, freeing you to concentrate on the analysis of the results.

You may also be interested in obtaining further education in this area. For example, you may want to pursue a graduate degree, such as an MBA (master of business administration.) This is an extremely important credential that will open a large number of doors in the business world. You’ll need your statistical skills in order to earn this degree, since it is heavily used throughout the curriculum.

If you’re not ready for graduate school, you may simply want to explore some college-level statistics courses at your local university. The most important thing is to continue using your statistical skills, as you’ll only become adept at using them through constant practice.

Part I

Getting Started with Business Statistics

9781118630693-pp0101.eps

pt_webextra_bw.TIF Visit www.dummies.com for great Dummies content online.

In this part…

check.png Use histograms to provide a visual of the distribution of elements in a data set. A histogram can show which values occur most frequently, the smallest and largest values, how spread out these values are.

check.png Create graphs that reflect non-numerical data, such as colors, flavors, brand names, and so on. Graphs are used where numerical measures are difficult or impossible to compute.

check.png Identify the center of a data set by using the mean (the average), median (the middle), and mode (the most commonly occurring value). These are known as the measures of central tendencies.

check.png Use formulas for computing covariance and correlation for both samples and populations; a scatter plot is used to show the relationship (if there is one) between two variables.

Chapter 1

The Art and Science of Business Statistics

In This Chapter

arrow Looking at the key properties of data

arrow Understanding probability’s role in business

arrow Sampling distributions

arrow Drawing conclusions based on results

This chapter provides a brief introduction to the concepts that are covered throughout the book. I introduce several important techniques that allow you to measure and analyze the statistical properties of real-world variables, such as stock prices, interest rates, corporate profits, and so on.

Statistical analysis is widely used in all business disciplines. For example, marketing researchers analyze consumer spending patterns in order to properly plan new advertising campaigns. Organizations use management consulting to determine how efficiently resources are being used. Manufacturers use quality control methods to ensure the consistency of the products they are producing. These types of business applications and many others are heavily based on statistical analysis.

Financial institutions use statistics for a wide variety of applications. For example, a pension fund may use statistics to identify the types of securities that it should hold in its investment portfolio. A hedge fund may use statistics to identify profitable trading opportunities. An investment bank may forecast the future state of the economy in order to determine which new assets it should hold in its own portfolio.

Whereas statistics is a quantitative discipline, the ultimate objective of statistical analysis is to explain real-world events. This means that in addition to the rigorous application of statistical methods, there is always a great deal of room for judgment. As a result, you can think of statistical analysis as both a science and an art; the art comes from choosing the appropriate statistical technique for a given situation and correctly interpreting the results.

Representing the Key Properties of Data

The word data refers to a collection of quantitative (numerical) or qualitative (non-numerical) values. Quantitative data may consist of prices, profits, sales, or any variable that can be measured on a numerical scale. Qualitative data may consist of colors, brand names, geographic locations, and so on. Most of the data encountered in business applications are quantitative.

technicalstuff.eps The word data is actually the plural of datum; datum refers to a single value, while data refers to a collection of values.

You can analyze data with graphical techniques or numerical measures. I explore both options in the following sections.

Analyzing data with graphs

Graphs are a visual representation of a data set, making it easy to see patterns and other details. Deciding which type of graph to use depends on the type of data you’re trying to analyze. Here are some of the more common types of graphs used in business statistics:

check.png Histograms: A histogram shows the distribution of data among different intervals or categories, using a series of vertical bars.

check.png Line graphs: A line graph shows how a variable changes over time.

check.png Pie charts: A pie chart shows how data is distributed between different categories, illustrated as a series of slices taken from a pie.

check.png Scatter plots (scatter diagrams): A scatter plot shows the relationship between two variables as a series of points. The pattern of the points indicates how closely related the two variables are.

Histograms

You can use a histogram with either quantitative or qualitative data. It’s designed to show how a variable is distributed among different categories. For example, suppose that a marketing firm surveys 100 consumers to determine their favorite color. The responses are

Red:

23

Blue:

44

Yellow:

12

Green:

21

The results can be illustrated with a histogram, with each color in a single category. The heights of the bars indicate the number of responses for each color, making it easy to see which colors are the most popular (see Figure 1-1).

9781118630693-fg0101.eps

Illustration by Wiley, Composition Services Graphics

Figure 1-1: A histogram for preferred colors.

Based on the histogram, you can see at a glance that blue is the most popular choice, while yellow is the least popular choice.

Line graphs

You can use a line graph with quantitative data. It shows the values of a variable over a given interval of time. For example, Figure 1-2 shows the daily price of gold between April 14, 2013 and June 2, 2013:

9781118630693-fg0102.eps

Illustration by Wiley, Composition Services Graphics

Figure 1-2: A line graph of gold prices.

With a line graph, it’s easy to see trends or patterns in a data set. In this example, the price of gold rose steadily throughout late April into mid-May before falling back in late May and then recovering somewhat at the end of the month. These types of graphs may be used by investors to identify which assets are likely to rise in the future based on their past performance.

Pie charts

Use a pie chart with quantitative or qualitative data to show the distribution of the data among different categories. For example, suppose that a chain of coffee shops wants to analyze its sales by coffee style. The styles that the chain sells are French Roast, Breakfast Blend, Brazilian Rainforest, Jamaica Blue Mountain, and Espresso. Figure 1-3 shows the proportion of sales for each style.

9781118630693-fg0103.eps

Illustration by Wiley, Composition Services Graphics

Figure 1-3: A pie chart for coffee sales.

The chart shows that Espresso is the chain’s best-selling style, while Jamaica Blue Mountain accounts for the smallest percentage of the chain’s sales.

Scatter plots

A scatter plot is designed to show the relationship between two quantitative variables. For example, Figure 1-4 shows the relationship between a corporation’s sales and profits over the past 20 years.

Each point on the scatter plot represents profit and sales for a single year. The pattern of the points shows that higher levels of sales tend to be matched by higher levels of profits, and vice versa. This is called a positive trend in the data.

9781118630693-fg0104.eps

Illustration by Wiley, Composition Services Graphics

Figure 1-4: A scatter plot showing sales and profits.

Defining properties and relationships with numerical measures

A numerical measure is a value that describes a key property of a data set. For example, to determine whether the residents of one city tend to be older than the residents in another city, you can compute and compare the average or mean age of the residents of each city.

Some of the most important properties of interest in a data set are the center of the data and the spread among the observations. I describe these properties in the following sections.

Finding the center of the data

To identify the center of a data set, you use measures that are known as measures of central tendency; the most important of these are the mean, median, and mode.

The mean represents the average value in a data set, while the median represents the midpoint. The median is a value that separates the data into two equal halves; half of the elements in the data set are less than the median, and the remaining half are greater than the median. The mode is the most commonly occurring value in the data set.

The mean is the most widely used measure of central tendency, but it can give deceptive results if the data contain any unusually large or small values, known as outliers. In this case, the median provides a more representative measure of the center of the data. For example, median household income is usually reported by government agencies instead of mean household income. This is because mean household income is inflated by the presence of a small number of extremely wealthy households. As a result, median household income is thought to be a better measure of how standards of living are changing over time.

The mode can be used for either quantitative or qualitative data. For example, it could be used to determine the most common number of years of education among the employees of a firm. It could also be used to determine the most popular flavor sold by a soft drink manufacturer.

Measuring the spread of the data

Measures of dispersion identify how spread out a data set is, relative to the center. This provides a way of determining if the members of a data set tend to be very close to each other or if they tend to be widely scattered. Some of the most important measures of dispersion are

check.png Variance

check.png Standard deviation

check.png Percentiles

check.png Quartiles

check.png Interquartile range (IQR)

The variance is a measure of the average squared difference between the elements of a data set and the mean. The larger the variance, the more “spread out” the data is. Variance is often used as a measure of risk in business applications; for example, it can be used to show how much uncertainty there is over the returns on a stock.

The standard deviation is the square root of the variance, and is more commonly used than the variance (since the variance is expressed in squared units). For example, the variance of a series of gas prices is measured in squared dollars, which is difficult to interpret. The corresponding standard deviation is measured in dollars, which is much more intuitively clear.

Percentiles divide a data set into 100 equal parts, each consisting of 1 percent of the total. For example, if a student’s score on a standardized exam is in the 80th percentile, then the student outscored 80 percent of the other students who took the exam. A quartile is a special type of percentile; it divides a data set into four equal parts, each consisting of 25 percent of the total. The first quartile is the 25th percentile of a data set, the second quartile is the 50th percentile, and the third quartile is the 75th percentile. The interquartile range identifies the middle 50 percent of the observations in a data set; it equals the difference between the third and the first quartiles.

Determining the relationship between two variables

For some applications, you need to understand the relationship between two variables. For example, if an investor wants to understand the risk of a portfolio of stocks, it’s essential to properly measure how closely the returns on the stocks track each other. You can determine the relationship between two variables with two measures of association: covariance and correlation.

Covariance is used to measure the tendency for two variables to rise above their means or fall below their means at the same time. For example, suppose that a bioengineering company finds that increasing research and development expenditures typically leads to an increase in the development of new patents. In this case, R&D spending and new patents would have a positive covariance. If the same company finds that rising labor costs typically reduce corporate profits, then labor costs and profits would have a negative covariance. If the company finds that profits are not related to the average daily temperature, then these two variables will have a covariance that is very close to zero.

Correlation is a closely related measure. It’s defined as a value between –1 and 1, so interpreting the correlation is easier than the covariance. For example, a correlation of 0.9 between two variables would indicate a very strong positive relationship, whereas a correlation of 0.2 would indicate a fairly weak but positive relationship. A correlation of –0.8 would indicate a very strong negative relationship; a correlation of –0.3 would indicate a weak negative relationship. A correlation of 0 would show that two variables are independent (that is, unrelated).

Probability: The Foundation of All Statistical Analysis

Probability theory provides a mathematical framework for measuring uncertainty. This area is important for business applications since all results from the field of statistics are ultimately based on probability theory. Understanding probability theory provides fundamental insights into all the statistical methods used in this book.

Probability is heavily based on the notion of sets. A set is a collection of objects. These objects may be numbers, colors, flavors, and so on. This chapter focuses on sets of numbers that may represent prices, rates of return, and so forth. Several mathematical operations may be applied to sets — union, intersection, and complement, for example.

The union of two sets is a new set that contains all the elements in the original two sets. The intersection of two sets is a set that contains only the elements contained in both of the two original sets (if any.) The complement of a set is a set containing elements that are not in the original set. For example, the complement of the set of black cards in a standard deck is the set containing all red cards.

Probability theory is based on a model of how random outcomes are generated, known as a random experiment. Outcomes are generated in such a way that all possible outcomes are known in advance, but the actual outcome isn’t known.

The following rules help you determine the probability of specific outcomes occurring:

check.png The addition rule

check.png The multiplication rule

check.png The complement rule

You use the addition rule to determine the probability of a union of two sets. The multiplication rule is used to determine the probability of an intersection of two sets. The complement rule is used to identify the probability that the outcome of a random experiment will not be an element in a specified set.

Random variables

A random variable assigns numerical values to the outcomes of a random experiment. For example, when you flip a coin twice, you’re performing a random experiment, since:

check.png All possible outcomes are known in advance

check.png The actual outcome isn’t known in advance

The experiment consists of two trials. On each trial, the outcome must be a “head” or a “tail.”

Assume that a random variable X is defined as the number of “heads” that turn up during the course of this experiment. X assigns values to the outcomes of this experiment as follows:

Outcome

X

{TT}

0

{HT, TH}

1

{HH}

2

T represents a tail on a single flip

H represents a head on a single flip

TT represents two consecutive tails

HT represents a head followed by a tail

TH represents a tail followed by a head

HH represents two consecutive heads

X assigns a value of 0 to the outcome TT because no heads turned up. X assigns a value of 1 to both HT and TH because one head turned up in each case. Similarly, X assigns a value of 2 to HH because two heads turned up.

Probability distributions

A probability distribution is a formula or a table used to assign probabilities to each possible value of a random variable X. A probability distribution may be discrete, which means that X can assume one of a finite (countable) number of values, or continuous, in which case X can assume one of an infinite (uncountable) number of different values.

For the coin-flipping experiment from the previous section, the probability distribution of X could be a simple table that shows the probability of each possible value of X, written as P(X):

X

P(X)

0

0.25

1

0.50

2

0.25

The probability that X = 0 (that no heads turn up) equals 0.25 because this experiment has four equally likely outcomes: HH, HT, TH, and TT and in only one of those cases will there be no heads. You compute the other probabilities in a similar manner.

Discrete probability distributions

Several specialized discrete probability distributions are useful for specific applications. For business applications, three frequently used discrete distributions are:

check.png Binomial

check.png Geometric

check.png Poisson

You use the binomial distribution to compute probabilities for a process where only one of two possible outcomes may occur on each trial. The geometric distribution is related to the binomial distribution; you use the geometric distribution to determine the probability that a specified number of trials will take place before the first success occurs. You can use the Poisson distribution to measure the probability that a given number of events will occur during a given time frame.

Continuous probability distributions

Many continuous distributions may be used for business applications; two of the most widely used are:

check.png Uniform

check.png Normal

The uniform distribution is useful because it represents variables that are evenly distributed over a given interval. For example, if the length of time until the next defective part arrives on an assembly line is equally likely to be any value between one and ten minutes, then you may use the uniform distribution to compute probabilities for the time until the next defective part arrives.

The normal distribution is useful for a wide array of applications in many disciplines. In business applications, variables such as stock returns are often assumed to follow the normal distribution. The normal distribution is characterized by a bell-shaped curve, and areas under this curve represent probabilities. The bell-shaped curve is shown in Figure 1-5.

9781118630693-fg0105.eps

Illustration by Wiley, Composition Services Graphics

Figure 1-5: The bell-shaped curve of the normal distribution.

The normal distribution has many convenient statistical properties that make it a popular choice for statistical modeling. One of these properties is known as symmetry, the idea that the probabilities of values below the mean are matched by the probabilities of values that are equally far above the mean.

Using Sampling Techniques and Sampling Distributions

Sampling is a branch of statistics in which the properties of a population are estimated from samples. A population is a collection of data that someone has an interest in studying. A sample is a selection of data randomly chosen from a population.

For example, if a university is interested in analyzing the distribution of grade point averages (GPAs) among its MBA students, the population of interest would be the GPAs of every MBA student at the university; a sample would consist of the GPAs of a set of randomly chosen MBA students.

Several approaches can be used for choosing samples; a sample is a subset of the underlying population.

A statistic is a summary measure of a sample, while a parameter is a summary measure of a population. The properties of a statistic can be determined with a sampling distribution — a special type of probability distribution that describes the properties of a statistic.

The central limit theorem (CLT) gives the conditions under which the mean of a sample follows the normal distribution:

check.png The underlying population is normally distributed.

check.png The sample size is “large” (at least 30).

A detailed discussion of the central limit theorem can be found at http://en.wikipedia.org/wiki/Central_limit_theorem.

Statistical Inference: Drawing Conclusions from Data

Statistical inference refers to the process of drawing conclusions about a population from randomly chosen samples. In the following sections, I discuss two techniques used for statistical inference: confidence intervals and hypothesis testing.

Confidence intervals

A confidence interval is a range of values that’s expected to contain the value of a population parameter with a specified level of confidence (such as 90 percent, 95 percent, 99 percent, and so on). For example, you can construct a confidence interval for the population mean by following these steps:

1. Estimate the value of the population mean by calculating the mean of a randomly chosen sample (known as the sample mean).

2. Calculate the lower limit of the confidence interval by subtracting a margin of error from the sample mean.

3. Calculate the upper limit of the confidence interval by adding the same margin of error to the sample mean.

The margin of error depends on the size of the sample used to construct the confidence interval, whether the population standard deviation is known, and the level of confidence chosen.

The resulting interval is known as a confidence interval. A confidence interval is constructed with a specified level of probability. For example, suppose you draw a sample of stocks from a portfolio, and you construct a 95 percent confidence interval for the mean return of the stocks in the entire portfolio:

(lower limit, upper limit) = (0.02, 0.08)

The returns on the entire portfolio are the population of interest. The mean return in each sample drawn is an estimate of the population mean. The sample mean will be slightly different each time a new sample is drawn, as will the confidence interval. If this process is repeated 100 times, 95 of the resulting confidence intervals will contain the true population mean.

Hypothesis testing

Hypothesis testing is a procedure for using sample data to draw conclusions about the characteristics of the underlying population.

The procedure begins with a statement, known as the null hypothesis. The null hypothesis is assumed to be true unless strong evidence against it is found. An alternative hypothesis — the result accepted if the null hypothesis is rejected — is also stated.

You construct a test statistic, and you compare it with a critical value (or values) to determine whether the null hypothesis should be rejected. The specific test statistic and critical value(s) depend on which population parameter is being tested, the size of the sample being used, and other factors.

If the test statistic is too extreme (for example, it’s too large compared with the critical value[s]) the null hypothesis is rejected in favor of the alternative hypothesis; otherwise, the null hypothesis is not rejected.

technicalstuff.eps If the null hypothesis isn’t rejected, this doesn’t necessarily mean that it’s true; it simply means that there is not enough evidence to justify rejecting it.

Hypothesis testing is a general procedure and can be used to draw conclusions about many features of a population, such as its mean, variance, standard deviation, and so on.

Simple regression analysis

Regression analysis uses sample data to estimate the strength and direction of the relationship between two or more variables. Simple regression analysis estimates the relationship between a dependent variable (Y) and a single independent variable (X).

For example, suppose you’re interested in analyzing the relationship between the annual returns of the Standard & Poor’s (S&P) 500 Index and the annual returns of Apple stock. You can assume that the returns of Apple stock are related to the returns to the S&P 500 because the index is a reflection of the overall strength of the economy. Therefore, the returns of Apple stock are the dependent variable (Y) and the returns of the S&P 500 are the independent variable (X). You can use regression analysis to measure the numerical relationship between the S&P 500 and Apple stock.

Simple regression analysis is based on the assumption that a linear relationship occurs between X and Y. A linear relationship takes this form:

9781118630693-eq01001.eps

Y is the dependent variable, X is the independent variable, m is the slope, and b is the intercept.

The slope tells you how much Y changes due to a specific change in X; the intercept tells you what the value of Y would be if X had a value of zero.

The goal of regression analysis is to find a line that best fits or explains the data. The population regression line is written as follows:

Yi = β0 + β1Xi + εi

In this equation, Yi is the dependent variable, Xi is the independent variable, β0 is the intercept, β1 is the slope, and εi is an error term.

A sample regression line, estimated from the data, is written as follows:

9781118630693-eq01002.eps

Here, 9781118630693-eq01003.eps is the estimated value of Yi, 9781118630693-eq01004.eps is the estimated value of β0, and 9781118630693-eq01005.eps is the estimated value of β1 and is the independent variable.

The sample regression line shows the estimated relationship between Y and X; you can use this relationship to determine how much Y changes due to a given change in X. You can also use it to forecast future values of Y based on assumed values of X.

After estimating the sample regression line, the results are subjected to a series of tests to determine whether the equation is valid. If the equation isn’t valid, you reject the results and try a new model.

Multiple regression analysis

With multiple regression analysis, you estimate the relationship between a dependent variable (Y) and two or more independent variables (X1, X2, and so on).

For example, suppose that Y represents annual salaries (in thousands of dollars) at a corporation. A researcher has reason to believe that the salaries at this corporation depend mainly on the number of years of job experience and the number of years of graduate education for each employee. The researcher may test this idea by running a regression in which salary is the dependent variable (Y) and job experience and graduate education are the independent variables (X1 and X2, respectively.) The population regression equation in this case would be written as

9781118630693-eq01006.eps

The sample regression line would be

9781118630693-eq01007.eps

Using multiple regression analysis introduces several additional complications compared with simple regression analysis, but you can use it for a much wider range of applications than simple regression analysis.

Forecasting techniques

You can forecast the future values of a variable, using one of several types of models. One approach to forecasting is time series models. A time series is a set of data that consists of the values of a single variable observed at different points in time. For example, the daily price of Microsoft stock taken from the past ten years is a time series.

Time series forecasting involves using past values of a variable to forecast future values.

Some forecasting techniques include:

check.png Trend models

check.png Moving average models

check.png Exponential moving average models

A trend model is used to estimate the value of a variable as it evolves over time. For example, suppose annual data is used to estimate a trend model that explains the behavior of gasoline prices over time. The price is currently $3.50 per gallon, and you determine that on average, gasoline prices rise by $0.10 per year. A simple trend model that expresses this information would be written as:

9781118630693-eq01008.eps

In this equation, Yt represents the estimated gas price at time t, where t represents a specific year. (t = 0 represents the present time.) The term 3.50 indicates the current price of gasoline; 0.10t indicates that the price of gasoline rises by $0.10 per year. The term 9781118630693-eq01009.eps is known as an “error term”; this reflects random fluctuations in the price of gasoline over time.

A moving average model shows that the value of a variable evolves over time based on its most recent values. For example, if the price of gasoline over the past three years was:

2010 $3.25

2011 $3.32

2012 $3.42

A three-period moving average model would produce an estimated value of ($3.25 + $3.32 + $3.42) / 3 = $3.33 for 2013.

An exponential weighted average model is closely related to a moving average model. The difference is that with an exponential weighted average, older observations aren’t given the same “weight” as newer observations. The calculation of an exponential weighted average is more complex, but may give more realistic results.

The appropriate choice of model depends on the properties of the particular time series being used.