Module B4:Basic Data Analysis Techniques

2. Descriptive Statistics

Descriptive Statistics are the most frequently used procedures in SPSS. Descriptive statistics are used for everything from initial analysis and checking the validity of data to extracting education data and constructing indicators from a household survey. Although the ‘Report’ function could provide similar statistics, Descriptive Statistics are very user-friendly and provide a greater variety of charts.

2.1.       Frequencies

Frequencies are commonly used for the initial analysis of a data set. Frequencies provide statistics and graphical displays that are useful for describing all different types of variables.

The Frequencies procedure can produce such statistics as: frequencies (counts), percentages, cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum and maximum values, standard error of the mean, skewness and kurtosis (both with standard errors), quartiles and percentiles. It can also produce bar charts, pie charts, and histograms.

To improve the display of table and charts, distinct values can be arranged in ascending or descending order by their category labels or by the value of their counts. The frequencies report can be suppressed when a variable has too many distinct values. Charts, which are produce by this command, can be labelled with frequencies (default) or percentages. To produce a simple frequency table:

  1. Click ‘Analyze’ on main menu bar.
  2. Click ‘Descriptive Statistics’.
  3. Click again ‘Frequencies’, and a new window will appear with complete list of variables.

  1. Select (categorical) variables to produce frequency tables (each variable will have a table).
  2.  Click the ‘Format’ button, and set the output formats for:
    1. The order of categories in the frequency table (ascending or descending order of values or count)
    2. The way outputs will be organised if more than one variable is selected.
    3. The maximum number of values to display when the table has many categories.
  3. Click the ‘OK’ button to start create frequency tables with the selected charts and format.

The following outputs will be obtained from the steps presented above.

Please note that, although three variables are selected, only two frequency tables were generated. This is because SPSS suppressed the frequency table of ‘HV105 – Age of household member’ as the number of categories is more than the set value of 15 (in this sample, it’s approximately 100).

Generally, frequencies have two key purposes: Firstly to generate a frequency table that contains categorical variables with limited number of different items, such as sex, educational attainment and age group; and secondly to produce summary statistics of continuous variables, such as for variables in interval or ratio scales, or where there is a large variation of values.

Bar charts, pie charts and histograms can be created automatically for the categorical variables with a limited number of different items by clicking the ‘Charts’ button. To do this, select the chart type and other chart options after Step 5. In this example, ‘Pie chart’ is an appropriate type of chart to review gender composition (HV104) of the sample population while ‘Bar chart’ should be used for education levels (HV109). Therefore, you have to create charts for these two variables separately.

Similarly, we can choose the type of statistics that will be displayed by clicking the ‘Statistics’ button after selecting charts. The following exhibit shows the outputs for the variable ‘HV105 – Age of household member’ without a frequency table by age.

2.2       Descriptives

Descriptives computes univariate statistics, such as mean, standard deviation, minimum, and maximum for numeric variables and displayed in a single table for better comparison.

When there is no need to sort values into a frequency table, Descriptives are an efficient means of computing summary statistics for continuous variables. Almost all statistics provided in descriptives can be obtained from other procedures such as Frequencies, Means, and Examine.

Although Frequencies could also provide univariate statistics, Descriptives displays summary statistics for several variables in a single table. It can also calculate and save the standardized values (Z-scores). Variables can be ordered by the size of their means (in ascending or descending), alphabetically, or by the order in which user selects the variables (default).

When Z-scores are saved, they are added to the current data set and can then be used for analyses and listings. When variables are recorded in different units (for example, ‘household members’ and ‘education in single years’) the Z-score transformation places variables on a common scale for easier visual comparison. Descriptives are especially efficient for analysing large files with tens of thousands of cases.

To use Descriptives:

  1. Click ‘Analyse’ on main menu bar.
  2. Click ‘Descriptive Statistics’.
  3. Click ‘Descriptives’, and a new window will appear with complete list of variables.
  4. Select continuous (interval or ratio scale) or dichotomous (just 0 and 1) variables.
  5. Click ‘Options’, (i) select the preferred statistics from the lists, (ii) define the order of the variables to be displayed in the output table, and (iii) click ‘Continue’.
  6. Optionally, tick ‘Save standardized values as variable’ to save the Z-score (or standardized values) of the selected variable(s) in the current data set.
  7.  Click the ‘OK’ button to calculate the summary descriptive statistics.

As in most statistical analysis, it is important to generate descriptive statistics to check the variables that are being studied only contain valid values before undertaking further analysis. For example, in the variable ‘HV108 – Education in single years’, code 97 is used for ‘Inconsistent values’, code 98 represents ‘DK or Don’t know’, and code 99 is ‘missing values’. In this case, 97, 98 and 99 are not valid years of study and should be excluded from analysis. When we have identified this, we can put all those codes into ‘missing values’ and exclude them from further computations (see Module B2 to edit missing values).

Two similar ‘Descriptive Statistics’ tables are presented in the above example: the first is constructed with the default missing values using the codes 97 and 98 as valid values and the second is constructed after setting 97 and 98 as missing. Since number of cases is large, the difference between the values in the summary statistics is small. If the same calculation was conducted using a subset of the data with limited number of cases, however, the difference could be significant. In the above output table, the mean value 0.61 of the variable ‘member still in school’ can interpret as ‘61% of 20,540 persons are still in school’.

The following example presents all available statistics (set in the options) in ‘Descriptives’.

Please note that the descriptive statistics calculated for the variable ‘HV024 – Division’ cannot be used for any meaningful analysis. ‘HV024’ is just a nominal variable with codes 1 to 6, to represent six distinct districts of Bangladesh. The average of these values, 3.48, has no meaning.

One of the significant features of SPSS’s Descriptives procedure is its ability to save standardized values (Z-score) for the selected variables that can be used for further analysis. To add Z-scores of a variable into current data set, just tick the checkbox next to ‘Save standardized values as variables’. A series of new variables, which have the original variable name with the character Z added as a prefix, are generated. The new variable for Z-score of ‘HV009’, for example, is simply ‘ZHV009’.

2.3 Explore

Explore produces summary statistics and graphical display, either for all cases or separately for groups of cases. It is particularly useful in data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations (groups of cases).

Data screening is used to identify the existence of unusual values, extreme values, data gaps, or other peculiarities. By exploring data, users can determine which statistical techniques are appropriate for analysing the data. It can help analysts decide whether to transform the data (in case the technique requires a normal distribution) or to use nonparametric tests.

Dependent variables, or variables to be explored [List (a) in the following chart], must be quantitative (interval or ratio-level measurements). Factor variables [List (b)], with short string or numeric values, are used to break the dependent variables into groups of cases. The factor variables should have a reasonable number of distinct values, but should have less than 10 categories. The case label variable [List (c): allowed only one variable] that labels outliers in boxplots, can be a short string, long string (but use only first 15 bytes), or numeric. To analyse a data set with Explore:

  1. Click ‘Analyze’ on main menu bar.
  2. Click ‘Descriptive Statistics’.
  3. Click ‘Explore’. A new ‘Explore’ window will appear with complete list of variables.
  4. Select continuous (interval or ratio scale) variables to produce univariate statistics.
    Because of the voluminous outputs that are produced by ‘Explore’, just one variable ‘HV108 – Education in single years’ is used with simple, mostly default, settings.

  1. Click ‘Statistics’, set the preferred statistics from the lists, and click ‘Continue’.
  2. Click ‘Plots’, set the preferred types of plots from the lists, and click ‘Continue’.
  3. Click ‘Options’, set how missing values should be handled, and click ‘Continue’;
  4. Select ‘Display’ option (only statistics or plots, or both) on the ‘Explore’ window.
  5. Click the ‘OK’ button to start ‘Explore’, and the following outputs will be displayed.

By selecting all statistics and available charts, exploring ‘HV108 – Education in single year’ factored by ‘HV024 – Division’ produced 33 tables and charts as in the following output (starting from ‘Case Processing Summary’ to ‘Spread-versus-Level Plot’).

2.4 Crosstabs

Crosstabs is useful for investigating the relationship between two or more categorical variables by providing information about the intersection of variables.

Frequencies and Explore are efficient for analysing univariate statistics, but those procedures cannot provide information about the relationship between categorical variables. For example, Frequencies could tabulate ‘number of household heads by education level’ and ‘number of household heads by sex’ or ‘number of households by economic status (wealth index)’, but cannot provide ‘households by household wealth status by sex and education level of household head’ or even a simple table like ‘distribution of households by sex of household head by economic status’.

The Crosstabs function uses values of a numeric or short string variables to define categories for each variable. For example, codes ‘1 and 2’ or ‘male and female’ or ‘M and F’ are valid for the variable ‘sex’. Ordinal variables can be numeric codes that represent categories, for example numeric codes ‘1 to 5’ can be used for variable ‘Wealth Index’ where ‘1 = poorest, 2 = poorer, 3 = middle, 4 = richer, and 5 = richest’ or for string values ‘a to e’ where ‘a = richest, b = richer, c = middle, d = poorer, and e = poorest’.

In SPSS, the alphabetic order of string values is assumed to reflect the true order of the categories. Therefore, if a string variable with codes ‘L, M, H’ represents ‘low, medium and high’, the order of the categories in the output will be ‘H, L, M’ by alphabetical order and the results might be misinterpreted. In general, it is more reliable to use numeric codes and provide appropriate value labels to represent ordinal data.

Selection of Variables

For cross-tabulation, at least one variable must be selected for each of the rows and columns in the output table. Other variables can be used as layers; these are known as ‘factor’ variables. The variables used in the Crosstabs procedure must be categorical ones (measured in nominal or ordinal values) with a limited number of value items (generally, less than 10 different values). On the other hand, discrete scale variables can be used to get statistics if the range of values is not large enough to suppress the table output. The factor variables must be categorical.

Statistics Option

In Crosstabs, statistics and measures of association are computed for two-way tables only. If a table is formed in multi-ways as ‘row, column, and layer (control) variables’, the Crosstabs procedure forms one panel of associated statistics and measures for each value of the layer (or a combination of values for two or more control variables). If, for example, ‘sex’ is a layer factor for a table of ‘educational attainment’ against ‘wealth index’, two separate two-way table for males and females are computed.

Crosstabs is one of the procedures that produce an additional variety of statistics, such as:

Chi-square tests of independence/association are generally used for 2 x 2 tables. One can select: Pearson chi-square, the likelihood-ratio chi-square, Fisher’s exact test, and Yates’ corrected chi-square (continuity correction). For tables with any number of rows and columns, select Chi-square to calculate the Pearson chi-square and the likelihood-ratio chi-square.

Spearman’s rank correlation coefficient (rho) is calculated if both rows and columns contain ordinal variables (numeric data only). When both row and column variables are quantitative, Pearson’s correlation coefficient (r), a measure of linear association, is calculated.

For more explanations on statistics please see ‘PASW Statistics 17 Base User Guide’.

Cells Display Option

By default, Crosstabs displays the ‘count’, or the number of cases actually observed in each cell. Optionally, the number of ‘expected’ cases can be displayed instead. Similarly, row, column and total percentages can be displayed in the cells together with the observed number of cases (count).

To uncover the patterns in data contributing to a Chi-square test, three types of residuals (deviates) that measure the difference between observed and expected frequencies can be displayed.

  • Unstandardized: the difference between an observed value and the expected value.
  • Standardized: the residual divided by an estimate of its standard deviation. Standardized residuals, also known as Pearson residuals, have a mean of 0 and a standard deviation of 1.
  • Adjusted standardized: the residual for a cell (observed value minus expected value) divided by an estimate of its standard error.

Non-integer weights Option

Cell counts are normally integer values. However, cell counts can be fractional values, if the data set is weighted by a variable with fractional values (e.g. 1.25). In this case, counts can be truncated or rounded either before or after calculating the cell counts, or fractional cell counts for both table display and statistical calculations can be used.

Using Crosstabs

Follow the steps below:

  1. Click ‘Analyze’ on main menu bar.
  2. Click ‘Descriptive Statistics’.
  3. Click ‘Crosstabs’ and a new ‘Crosstabs’ window will appear with complete list of variables.
  4.  Select categorical variables (or scale variables with limited number of different values) and send to rows, columns and layers (click ‘Next’ to add another layer if needed). Layer variables can be organized as: all on the same layer (one set of tables per each layer variable) or on different layers (just one set of tables with cross-layers cells).
  1. Select appropriate statistics to be calculated. In this example, no statistics are selected, though because both row and column variables are ordinal, it is appropriate to calculate chi-square, correlations, Gamma and Kendall’s tau.
  2. Select the contents of the cells in the cross-tabulation.
  3. Set the row order to ascending or descending.

  1. Set whether clustered bar charts should be generated.
  2. Set whether to suppress tables (or display the main crosstab table); and
  3.  Click ‘OK’ to construct the tables and charts with the selected options.

In this example, no options are set and just two tables, (1) Case Processing Summary, and (2) basic cross-tabulation table with simple counts in cells, are produced. In cross-tabulation the missing  values are handled list-wise (across variables), and thus it is important to observe the ‘number of valid cases’ in the ‘case processing summary’ statistics.

If different cell display options, such as number of observed and expected counts; row, column and total percentages, and residuals, are selected in the Step 6, the following crosstab table is created after using the pivoting capabilities offered in SPSS and a few minor touches.

In this example, it has been edited to (1) shorten a long value label; (2) hide the variable label of HV026 and (3) move ‘Statistics’ to ‘LAYER’ in the ‘Pivoting Trays’.

Percentage distribution of households within ‘Place of residence’ and within ‘Wealth index’ by ‘Sex of household head’ were extracted from the above pivot table.

By selecting both ‘Display clustered bar charts’ and ‘Suppress tables’ options, the following charts will be produced, but the output tables will be suppressed.

2.5 Ratio Statistics

Ratio Statistics provides a comprehensive list of summary statistics for describing the ratio between two scale variables with positive values.

Using Ratio Statistics, outputs can be sorted by values of a grouping variable in ascending or descending order. Grouping variables must have nominal or ordinal level measurements. It is best to use numeric codes or short strings. The ratio statistics report can be suppressed in the output, and the results can be saved to an external file.

It provides statistics of central tendency (median, mean, weighted mean), confidence intervals for mean and median, measures of dispersion (AAD – average absolute deviation, COD –coefficient of dispersion, PRD – price-related differential or index of regressivity, median-centred coefficient of variation, mean-centred coefficient of variation, standard deviation, range, minimum and maximum values) and the concentration index (ratio between a user-specified range or percentage within the median ratio).

Practical Example:

While analysing household survey data for participation in general education using the total number of children aged 6-15 (var1) and those who are currently attending primary schools (var2) by sex (or by urban/rural residence or division etc.) as a grouping variable, the age-specific enrolment ratios for children who are aged 6-15 by sex can be calculated. Variation in the distribution of ratios between male and female can also be observed.

There is no variable, however, that can store the ‘number of children at age x’ after summing up within the grouping variable. This variable must be computed. We can name it ‘pop’ and increment the value by 1 for each child who is aged 6-15. Use the ‘Compute’ command as in the following example:

And, define the variable label (‘Population aged 6-15’) and format (Display: 5 and Decimal: 0).

Warning:Be careful when using DHS survey data for the ‘current schooling status’ because the DHS asks the question ‘Whether xx is still in school or not?’. This question is only asked of those who have been to school. If anyone has never been to school they are omitted or treated as ‘missing’.

To obtain the correct ‘current schooling status’ of every person, another variable must be created from ‘HV110 – Member still in school’ by setting ‘schooling = 1’ for the case ‘HV110=1’, and ‘schooling = 0’ all other cases. Here, the new variable (let’s call it ‘schooling’) can be created by using ‘compute’ command twice. First, compute ‘schooling = 0’ for all cases, then compute ‘schooling = 1’ for those who are currently attending school, that is, HV110 = 1. Then, set appropriate properties for the new variable.

After having completed creating new variables, use the Ratio Statistics as following:

  1.  Click ‘Analyze’ on main menu bar.
  2. Click ‘Descriptive Statistics’.
  3. Click again on ‘Ratio’. A new ‘Crosstabs’ window will appear with complete variable list.
  4. Select two scale variables for ‘Numerator’ and ‘Denominator’, and a categorical (nominal or ordinal) variable for ‘Group’ variable.
  5.  Set whether to sort the group variable in ascending or descending order.
  6.  Set whether to display results or not (or just to save the results in a new file).
  7.  Set whether to save results to a new data file for further analyses.
  8. Click the ‘Statistics’ button and select the required statistics in ‘Statistics’ window.
  9. Click the ‘OK’ button to construct statistics using the selected options.

The following exhibit shows both the ‘statistics options’ selected and the ‘results’ obtained.

Normally, the group variable is displayed on the rows and statistics on the columns. If several statistics are chosen, however, the output table may be difficult to read or print. In such a case, double-click the table to open the Pivot Table editor. Apply ‘Transpose Rows and Columns’ under the ‘Pivot’ menu to view the statistics on rows and groups on columns to make the table easier to read and more accessible.



Comments are closed.