Part 1: A Qualitative Analysis
Enter the data
Examine the data and answer the following
- Interpret the smoking index for Electrical
and electronic workers.
- Interpret the mortality index for Farmers,
foresters, and fisherman.
- Which occupation has the highest smoking index?
Why do you think that group has the highest smoking index?
- Which occupation has the lowest smoking index?
Why do you think that group has the lowest smoking index?
- Which occupation has the highest mortality index?
Why do you think that group has the highest mortality index?
- Which occupation has the lowest mortality index?
Why do you think that group has the lowest mortality index?
- Select all columns of data (by clicking on the
gray column heading bar). **
Be sure all columns are selected so that data for each
occupational group remains consistent across the rows.
- Under the Data
menu, choose Sort.
- In the Sort
by window, choose a variable to sort by and decide whether
to sort ascending or descending.
Now that the smoking index is increasing in the
table, what do you notice about the mortality index?
Are there any occupational groups whose indices
do not seem to “fit” the general pattern you observed?
2: Is There a Relationship?
One way we made the data easier to interpret was
by sorting it in ascending order by one variable. Another way to help interpret data is to display it
graphically. A scatterplot
can provide a visual picture of the data in order to analyze any patterns
One way to measure the strength of a relationship
between two variables is to calculate the Pearson product-moment
correlation coefficient, or r, between them. (*Suggest
#1 at this time to explore the meaning of a correlation
coefficient.) Calculate the correlation coefficient (r)
for these variables. Based
on the meanings of these indices, interpret the value of
find an Excel function:
- Select an empty cell for the result of the
calculation to appear.
- Click on the function wizard icon
. Look under the descriptor that best describes the
function you are trying to find.
For example, the find the function for calculating the
Pearson product-moment correlation coefficient, click on Statistics, and scroll through the functions until you locate the
function called CORREL.
- The function wizard will walk you through
step-by-step for formatting all functions. (Hint: The arrays refer to cell ranges.)
| To Draw
a line on a scatterplot using the Line tool:
- Select the Draw tools
and notice the icons on the toolbar at the bottom of the screen.
- Choose the Line tool
Move the mouse pointer over the scatterplot so
the pointer turns into a crosshair. Click the mouse to anchor
the line at an appropriate point on the graph, then drag the
line forward and click to end the line segment.
- You can change the slope of the line by grabbing
one of the white end handles. You can move the position of the
line by selecting it and then dragging the line.
- Estimate two points that would lie on your
estimated drawn line. Use these two points to algebraically determine
the equation for this line. Use
your equation to predict the mortality index for a smoking index of
120. Compare this with your previous prediction.
- One way to accurately calculate a line of best
fit is by using a least squares regression line. Excel allows the user to add several types of trendlines
(regression) to a graph. Add
a linear trendline to the graph.
| To add
a trendline to a graph:
- Select the graph of the data.
Under the Chart menu, choose Add Trendline…
- Select the type of regression (in
this case choose linear).
- Click on the Options tab. Be sure
to select the option to display the equation.
- Click OK.
Chances are the equations from the estimated line
and the one calculated from the regression technique are not the same.
What makes a line the “best fit”? Explore Resource
#2 at this point to investigate the meaning of the least squares
- Interpret the linear equation for this data. Use
the regression equation to calculate the predicted mortality index for
an occupational group with a smoking index of 120. How does this compare with your previous predictions?
What are the pros and cons of making the prediction from the
graph of data and making the prediction from the regression equation?
3: Comparing Actual with Predicted Results
find the statistical functions that will calculate the slope and y-intercept of
the regression line. Use
these functions to calculate the values.
- Use the FunctionWizard
Using the slope and y-intercept values, enter a
formula to calculate the predicted mortality index for each of the
occupational groups. Put
the predicted indices in column D and label this column Predicted
Look at the columns of data containing the actual
Mortality Index and Predicted Mortality Index. How do the actual mortality values numerically compare with
the predicted values? One
way to compare such values is to calculate their differences (these
differences are called the residuals). Create another column and calculate these residuals.
What is the sum of the residuals?
Calculate and interpret this sum.
Create a scatterplot of the smoking index vs.
residuals. Interpret this
graph. In particular,
discuss the nature and effect of the “outliers”, that is the
points whose actual values are furthest from the predicted values.
Based on the analysis of the data, do you think
smoking is related to lung
cancer? Do you think
smoking causes lung cancer?
Support your argument.
Excel has an Analysis ToolPak that will do MANY
statistical calculations. Under
the Tools menu, select Data Analysis. Scroll
down to find the regression
analysis and click OK. Enter
the appropriate range for the Y data and X data.
Also select the Residuals, Residual Plot, and Line Fit Plot. When you click on OK, Excel will do the regression analysis
and output the results on a separate worksheet.
Analyze the output. Can
you find all the information you previously calculated and graphed?
Does the output match your previous calculations and graphs?
What other information is provided in the output?
Discuss the advantages and disadvantages of doing the regression
analysis by the method used in the activity and by using the Data
Analysis feature in Excel.
Are there any outliers or influential data
points? If so, what effect
do they have on the correlation? Try
discarding an occupational group that could possibly be influencing the
line of best fit. Recalculate