Center Home -> Content Areas Home -> Math Home -> Project Activities -> Excel Activities ->

Analyzing Smoking and Lung Cancer





Part 1: A Qualitative Analysis of Data

  • Enter the data in Excel.

  • Examine the data and answer the following questions:

  • Interpret the smoking index for Electrical and electronic workers.
  • Interpret the mortality index for Farmers, foresters, and fisherman.
  • Which occupation has the highest smoking index?  Why do you think that group has the highest smoking index?
  • Which occupation has the lowest smoking index?  Why do you think that group has the lowest smoking index?
  • Which occupation has the highest mortality index?  Why do you think that group has the highest mortality index?
  • Which occupation has the lowest mortality index?  Why do you think that group has the lowest mortality index?
  • Look at the data and discuss any relationships between occupational group and smoking index and between occupational group and mortality index. What other factors could cause mortality from lung cancer?

  • Describe any relationships between the smoking index and the mortality index.  Discuss any strategies used to find a relationship between these variables.

  • Did the way the data was presented in the table make it difficult to see any relationships?  Sort the data so that the smoking index is in ascending order.  What is the benefit of sorting data?

To sort data:
  • Select all columns of data (by clicking on the gray column heading bar). ** Be sure all columns are selected so that data for each occupational group remains consistent across the rows.
  • Under the Data menu, choose Sort.
  • In the Sort by window, choose a variable to sort by and decide whether to sort ascending or descending. 
  • Now that the smoking index is increasing in the table, what do you notice about the mortality index? 

  • Are there any occupational groups whose indices do not seem to “fit” the general pattern you observed?

Part 2: Is There a Relationship?

One way we made the data easier to interpret was by sorting it in ascending order by one variable.  Another way to help interpret data is to display it graphically.  A scatterplot can provide a visual picture of the data in order to analyze any patterns or trends.

  • Create a scatterplot.  Which variable (Smoking or Mortality) should be the independent variable (X)?  Why?

  • Describe the trend in the graph.  What does this trend suggest about the relationship between smoking and lung cancer?

  • How does the relationship you found by looking at the table compare to that observed by looking at the graph?

  • What are the advantages and disadvantages of each representation (tables and graphs)?

  • How strong do you think the relationship is between these two variables?  Explain how you determined the “strength” of the relationship.

  • One way to measure the strength of a relationship between two variables is to calculate the Pearson product-moment correlation coefficient, or r, between them.  (*Suggest using Resource #1 at this time to explore the meaning of a correlation coefficient.) Calculate the correlation coefficient (r) for these variables.  Based on the meanings of these indices, interpret the value of r.

To find an Excel function:
  • Select an empty cell for the result of the calculation to appear.
  • Click on the function wizard icon . Look under the descriptor that best describes the function you are trying to find.  For example, the find the function for calculating the Pearson product-moment correlation coefficient, click on Statistics, and scroll through the functions until you locate the function called CORREL.
  • The function wizard will walk you through step-by-step for formatting all functions. (Hint:  The arrays refer to cell ranges.)
  • The square of the correlation coefficient is the proportion of variance in one variable that is associated with the variance in the other variable.  What is the value of r2 for this data?  Interpret this value in your words.

  • If an occupational group has a smoking index of 120, what would you predict their mortality index to be?  Explain how you made your prediction.

  • Looking at the scatterplot of the data, Use the Line tool to place a “best fit” line through the data points to approximate the trend you observe and help with making predictions.
To Draw a line on a scatterplot using the Line tool:
  • Select the Draw tools and notice the icons on the toolbar at the bottom of the screen.
  • Choose the Line tool . Move the mouse pointer over the scatterplot so the pointer turns into a crosshair. Click the mouse to anchor the line at an appropriate point on the graph, then drag the line forward and click to end the line segment.
  • You can change the slope of the line by grabbing one of the white end handles. You can move the position of the line by selecting it and then dragging the line.
  • Estimate two points that would lie on your estimated drawn line. Use these two points to algebraically determine the equation for this line.  Use your equation to predict the mortality index for a smoking index of 120. Compare this with your previous prediction.

  • One way to accurately calculate a line of best fit is by using a least squares regression line.  Excel allows the user to add several types of trendlines (regression) to a graph.  Add a linear trendline to the graph.
To add a trendline to a graph:
  • Select the graph of the data. Under the Chart menu, choose Add Trendline…
  • Select the type of regression (in this case choose linear).
  • Click on the Options tab. Be sure to select the option to display the equation.
  • Click OK.
  • Interpret the linear equation for this data. Use the regression equation to calculate the predicted mortality index for an occupational group with a smoking index of 120.  How does this compare with your previous predictions?  What are the pros and cons of making the prediction from the graph of data and making the prediction from the regression equation?

  • Chances are the equations from the estimated line and the one calculated from the regression technique are not the same. What makes a line the “best fit”? Explore Resource #2 at this point to investigate the meaning of the least squares regression technique.

Part 3: Comparing Actual with Predicted Results

  • Use the FunctionWizard to find the statistical functions that will calculate the slope and y-intercept of the regression line.  Use these functions to calculate the values.

  • Using the slope and y-intercept values, enter a formula to calculate the predicted mortality index for each of the occupational groups.  Put the predicted indices in column D and label this column Predicted Mortality.

  • Look at the columns of data containing the actual Mortality Index and Predicted Mortality Index.  How do the actual mortality values numerically compare with the predicted values?  One way to compare such values is to calculate their differences (these differences are called the residuals).  Create another column and calculate these residuals.

  • What is the sum of the residuals?  Calculate and interpret this sum.

  • Create a scatterplot of the smoking index vs. residuals.  Interpret this graph.  In particular, discuss the nature and effect of the “outliers”, that is the points whose actual values are furthest from the predicted values.

  • Based on the analysis of the data, do you think smoking is related to lung cancer?  Do you think smoking causes lung cancer?  Support your argument.

Extensions:

  • Excel has an Analysis ToolPak that will do MANY statistical calculations.  Under the Tools menu, select Data Analysis.  Scroll down to find the regression analysis and click OK.  Enter the appropriate range for the Y data and X data.  Also select the Residuals, Residual Plot, and Line Fit Plot.  When you click on OK, Excel will do the regression analysis and output the results on a separate worksheet.  Analyze the output.  Can you find all the information you previously calculated and graphed?  Does the output match your previous calculations and graphs?  What other information is provided in the output?  Discuss the advantages and disadvantages of doing the regression analysis by the method used in the activity and by using the Data Analysis feature in Excel.

  • Are there any outliers or influential data points?  If so, what effect do they have on the correlation?  Try discarding an occupational group that could possibly be influencing the line of best fit.  Recalculate the regression.



Back to Project Activities | Back to Math Homepage

Send questions or comments here.
Last modified on July 27, 2001.