|
|
Introduction
to SPSS: |
|
|
Research
Computing Support Group |
Introduction to SPSS is a two-part computer workshop taught by the
Research Computing Support Group of Information Technology and Communication,
University of Virginia. The workshop is an overview of layout and procedures
of SPSS for Windows or Macintosh, including file operations, data definition
procedures, running basic descriptive statistics, data transformation procedures,
and basic analytical procedures. Procedures are learned through hands-on work
related to an actual research question.
For a schedule of the next SPSS workshop, please check out the courses page of the ITC Research Computing Support Group, (http://www.itc.Virginia.EDU/research/courses.html).
This document is the first part of the Introduction to
SPSS workshop; the second
part is also available online.
Prerequisites: This document presumes that you have familiarity with Windows 98/2000/XP and its commands. It will not review any DOS or Windows concepts such as filenames, paths, booting up, erasing files, using the mouse, scrolling, etceteras. If you are not yet comfortable with Windows and must use SPSS, please see the pointers in the appendix. It is recommended that Unix users definitely plan to take the second day of the course, as well as the first, since syntax commands are addressed in the second part.
Table of Contents
2. The Basics of SPSS
for Windows
2.1 The Various Windows That You Will
Use
When SPSS starts up for the first time, two windows will start by default: the SPSS Data Editor and the SPSS Viewer.
This is what the Data Editor window looks like:

(Click image for full-screen screen-shot)
The SPSS Data Editor window resembles a spreadsheet. Data can either be typed directly into this window or you can open an existing data file into the window. Existing files can be of various types such as an SPSS data file (*.sav), an ASCII text file (*.txt, *.csv, or *.dat), and Excel file (*.xls), or a Dbase III file (*.dbf). During the first day of this course, we will be using an existing SPSS data file. On the second day, we will address how to read in other file types. The Data Editor window has two tabs, one for viewing your Data "Data View" and one for viewing information about the variables, "Variable View"
The SPSS Viewer window is where SPSS places the output of associated commands. You may choose to have your commands included in your output (by selecting Edit -- Options, clicking on the "Viewer" tab and checking the box that reads, "Display commands in log."), and you may edit your output before you print it. This is what the Viewer window looks like:

(click image for full-screen screen-shot)
Note: If the Viewer window does not open by default, open one from the File Menu by choosing New, then Output (as shown in the image below).

In addition to the SPSS Data Editor and SPSS Viewer windows, the SPSS Syntax Editor window is necessary if you wish to execute SPSS procedures using text commands. You can open a new SPSS Syntax Editor window from the File menu by choosing New, then Syntax. You can open an existing syntax file from the File menu by choosing Open, and selecting the name of the desired file. While some procedures are available only through syntax, we will concentrate on the point-and-click method, utilizing procedures available through the menu system and therefore not requiring a syntax window.
2.2 Entering SPSS commands
There are two principal ways in which you can execute SPSS procedures.
The first way to enter commands is by using the pull-down menus and toolbar buttons found at the top of all SPSS windows - indicated in the image below with arrows as 1 and 2, respectively. These are easy to use and have Help options built directly into each procedure.

The second way to enter commands is to issue text commands from within the SPSS Syntax Editor window. These text commands can come from one of three places:

Today, we will be using the pull-down menus and toolbar buttons, but will learn about using the Syntax Editor to enter commands during the second day of the course.
3. Open Existing SPSS Data File
To provide us with some actual data and a research question to be answered, we provide an existing SPSS save file (system file). These data come from an actual sexual discrimination lawsuit involving pay inequity between male and female workers at a bank. The female workers sued the bank alleging low pay based on gender rather than valid work, education, and experience differences. In order to answer our question about whether the bank was discriminating against women workers, we will need data about the people who work for the bank. This data is contained in a file called bank.sav. To download this file to your local machine, right-click your mouse on the file name below and select the option to "Save Target As" (Explorer) or "Save Link As" (Netscape). Save the file to a local directory such as c:\TEMP.
Once you have save the bank data on your local machine, open the file in SPSS as follows:


A data set is made up of observations or cases. One case or observation is the basic unit of all data one wishes to analyze. A case consists of all the different data values for a particular subject, animal, time point, etc. Variables are made up of the data values that describe a particular characteristic for all of the cases. How you arrange your data and what you decide is the unit of analysis (the case or observation) is critically important! The questions you can ask and the analyses you can perform to get the answers are in large part determined by how you arrange your data.

This example illustrates how cases and variables are organized in the SPSS Data Editor window, Data View tab. Cases appear in the rows of the Data Editor; in this example, each row represents a bank employee. Variables appear in the columns of the Data Editor, with the variable name at the top of the column. Note that variable names may be up to 8 characters in length and must start with a letter, #, $, or @. (Warning: # and $ have special meanings, and should be used only with caution.) The first case is a male born February 3, 1952, with 15 years of education, earning $57K in job category 3, etceteras.
In addressing our research question, the variables that we would first like to look at are GENDER and SALARY. These are shown above in the second and sixth columns of the Data Editor.
4. Generating Frequency Tables
A basic procedure for taking a summary look at a variable is to look at the number of cases associated with each value of a variable (or the frequency of each value in the data set). The Frequency procedure in SPSS generates a table with this information. Let's generate a frequency table for the variable GENDER:
1. Under ANALYZE - DESCRIPTIVE STATISTICS, click on FREQUENCIES.

2. In the dialog box which opens, highlight the variable GENDER in the left-hand box, click the arrow between the two boxed to move the variable name to the right-hand box, and click OK.

The output from the Frequencies procedure is the following:
|
Frequencies |
|
|
|
|
There are several problems with this GENDER variable. First, it is not obvious which category of GENDER represents men and which women. Second, there are three categories (e.g. 1, 2, and 99), but only two genders (e.g. male and female). To interpret this frequency table, we would have to rely on a codebook that identified the meaning of each category. The coding for GENDER in a codebook would look something like this:
|
GENDER: |
You should add this type of information about each variable to the data file, so that it is always available and so that your output is more easily interpretable. Adding such information involves defining variable labels, value labels, and missing values... which is what we'll do next.
5. Defining Labels and Declaring Missing Values
5.1 Variable and Value Labels
A good data set will include variable and value labels that provide a fuller
description of both the variable and the meaning of each value within a
variable (for nominal and ordinal data; value labels are not needed for
continuous data).
|
Variable Labels |
provides labels for variables. Unlike the variable names that are limited to 8 characters, the label may be up to 120 characters long. In our example, we may want to give a fuller description of the variable GENDER such as "Gender of the respondent." |
|
Value Labels |
provides labels for values that variables may take. These labels are printed in all procedures that you request. Labels may be up to 60 characters, although few procedures will print all 60 characters. In our example, we will want to indicate that the GENDER value of "1" represents males and the value of "2" represents females. |
Let's create both variable and value labels for GENDER. The steps are shown here:
1. Double-click on the column heading for GENDER. This will make the column bold, and will open the "Variable View" (shown below). You can also go directly to the variable view by clicking on the variable view tab on the bottom left corner.
2. Click in the field under "Label" for gender variable and type in a description of this variable (e.g., sex of respondent). This is a "variable label", as opposed to the "variable name" which is shown at the top of each column.
3. Click in the field under “Values” for gender and click on the button on the right of this field. This will open a dialog box.

4. Type in one of the values for the variable, specify a label for that value, and click "Add". (In the example shown, the label Female is given to cases with a value of 2, and values of 1 are labeled Male. You would of course need to know the coding used for observations in order to apply appropriate labels.) Repeat as necessary for all values, and then click "Continue" when done with all labels. (This will close the "Define labels" dialog box.)
5. Click on "OK" to close the "Define Variable" dialog box.
5.2 Missing Values
SPSS has two types of missing values that are automatically excluded from statistics computed by procedures: system-missing values and user-missing values. Any variable for which a valid value cannot be read from raw data or computed is assigned the system-missing value. Thus, generally, there is no reason to "set blanks to zero".
User missing values are values that you tell SPSS to treat as missing for particular variables. These values are values (other than blanks) that you coded into your data to indicate non-acceptable responses.
In our example, the GENDER value of "99" indicates missing data for the respondent, and we will want to indicate this fact to SPSS (and to anybody looking at our output). Let's declare this as a missing value:
1. Click in the "Missing" field for gender variable in the variable view and click on the button on the right of this field to open missing values dialog box.

2. Click on the white circle beside "Discrete Missing Values" and enter the missing value for GENDER, which here is 99. (Note that you are allowed up to three discrete user-defined missings.) Select "Continue" when done.
3. Select "OK" when done.
After adding variable labels, value labels, and missing values, you should rerun Frequencies for GENDER in order to check the changes that you have made. In this example, the output should look like this:
|
Frequencies |
|
|
|
|
6. Generating Simple Summary Statistics
For some types of variables (especially continuous variables), we will want to obtain summary statistics other than the number of cases in each category of the variable. For example, we might be interested in the mean, median, or standard deviation of a particular variable. The variable SALARY has too many values for a frequency table to have any meaning, but we would be interested in knowing things like the mean of SALARY and the highest and lowest values.
You may be wondering, when should I use "DESCRIPTIVES" and when should I use "FREQUENCIES"? Well, it depends on what kind of descriptive statistics you're seeking. In FREQUENCIES, you can get summary statistics such as MEAN, Standard Deviation, SUM, just as you can in DESCRIPTIVES, however, in addition, you can get a frequency table, giving you counts and percentages of each category of a variable. Also, you can get some summary statistics that are not available in the DESCRIPTIVES command, for example, the mode and median. One quick and easy guide is to use the FREQUENCIES command for categorical data (when you need frequency counts or percentiles) and when you need the mode and median and use the DESCRIPTIVES command for continuous and interval variables or when you don't need frequency counts, the mode, or median.
SPSS can generate such information through the Descriptives procedure:
1. From the menu bar, select ANALYZE - DESCRIPTIVE STATISTICS - DESCRIPTIVES.

2. Highlight the variable SALARY in the left-hand box, click the arrow between the two boxes, and SALARY will be moved to the variable list in the right-hand box. Click "Options" at the bottom right of this dialog box. This will open the Options dialog box.

3. In the Options dialog box, check the boxes beside at least the mean, minimum, and maximum. (You will also typically want the standard deviation.)
4. Click on "Continue" and then "OK" when done, to close these two dialog boxes and submit the descriptives request to SPSS.
The output from the Descriptives procedure is the following:

7. Editing Pivot Tables
SPSS 11 displays the output from descriptive statistics procedure in pivot table with cells divided with vertical lines. Sometimes, the default width of the output table columns is not enough to fit the values that will be inserted in the cells. For example, let's ask for descriptive statistics on Current Salary.

Click OK and results in the output window look like this:

Clicking only once on the table, highlights the table box (notice the new rectangle around it).
When you single-click, you are not able to re-size the column!

What you need to do is double click the table to allow you to change the properties of the table.
Notice the cursor on the vertical line between the mean and Std. Deviation columns.

Now you should be able to drag the horizontal cursor to widen the column.

8. Comparing Gender Differences in Mean Salary
While the Descriptives procedures gives us a general picture of the SALARY variable, what we're really interested in is the difference in salaries between men and women. SPSS can also give us with the means for various groups using the Means procedure. Let's see how mean salaries differ for mean versus women:
1. From the menu bar, select ANALYZE - COMPARE MEANS - MEANS.

2. Select SALARY from the left-hand variable list and click the top arrow to make this the Dependent variable (the one whose distribution we wish to explain).

3. Select GENDER from the left-hand variable list and click the bottom arrow to make this the Independent variable (the one which we think explains the distribution of the dependent variable).
4. Click on "OK" to close this dialog box and submit the command request to SPSS.
The output from this procedure should be as follows:

As you can see, there is indeed a difference in salaries between men and women within the bank: Men make almost $15,000 more than women, on average $41K vs. $26K). Possible explanations for this difference might include education, previous job experience, age, length of tenure with the bank, and sexual discrimination.
Let's evaluate the plausibility of education as an explanation of the difference in salaries between men and women: Is there a difference in the percentage of males and females who have completed high school and college?
To answer that question, let's start by looking at the actual education variable already in our data set. Look at the variable EDUC in the Data Editor window. What type of variable is EDUC? Is the current variable coded according to whether the individual has completed high school and college? No. The current variable indicates the number of years of education completed by the respondent.
We can, however, change this variable so that only three categories are present: less than 12 years of education (which we infer means the respondent did not complete high school), 12 years or more (from which infer at least a high school diploma), and 16 years or more (from which we infer that the respondent has a college degree). This process of regrouping values is called Recoding, and that's what we'll do next.
9. Recoding Variables
Variables can be recoded in either of two ways.
|
Into New Variable: |
creates a new variable with new values based on the values of the original variable. In our example, a new variable would be created with three values (less than high school, high school degree, college degree or higher). The value of the new variable would be based on the value of the original variable EDUC. The original variable EDUC would remain in the data set and its values would continue to represent the number of years of education completed by the respondent. |
|
Into Same Variable: |
overwrites the original variable, replacing the original values with the new values that you have specified. In our example, the old values of EDUC, which represent the number of years of education completed by the respondent, would be replaced by the new values (less than high school, high school degree, college degree or higher). |
Unless you have some compelling reason to do otherwise, it is almost always better to Recode Into a New Variable. Doing so preserves the original values should you ever want to use the original values or recode in another manner.
Let's recode the variable EDUC into a new variable:
1. From the menu bar, select TRANSFORM - RECODE - INTO DIFFERENT VARIABLE.

2. Highlight the variable EDUC in the left-hand white box, and click the arrow between the two boxes to bring the variable over to the middle white box.
3. Type a new Name and Label for the Output Variable and click "Change". (Note that the question mark in the middle box changes to the new Name you have provided.)

4. Click on "Old and New Values" to open a second dialog box, which allows you to select and apply your chosen regroupings or re-categorizations.
5. Under "Old Value", first select the radio button beside "Range", specify the range of values of the old variable that constitute a value of the new variable.
6. Then under "New Value", specify the corresponding value of the new variable, and click "Add". Note that this recoding is added to a list under "Old -> New". Repeat steps 5 and 6 as necessary.

7. Note that, if you don't know the highest or lowest value, you can select a different radio button. Try this for recoding those respondents with 16 or more years of education.
8. Click "Continue" when done.
After creating the new education variable, you should create variable and value labels for the new variable (in the same way that you did so for GENDER previously) and generate a frequency table to check that the new variable looks correct. The Frequency table for the new variable should look like this:
|
Frequencies |
|
|
|
|
10. Generating a Cross-Classification Table (Crosstab)
Now that we have recoded the variable EDUC, we can determine the difference in the percentage of males and females who have completed high school and college. Again, making this comparison will help us assess the plausibility that differences in educational levels actually explain the difference in salaries between men and women.
Since both the new education variable and the gender variable are categorical variables, the appropriate procedure to assess differences in educational level across the two genders is to generate a Cross-Classification table, also called a CrossTab. We do this in the following manner.
The output from procedure looks as follows:
1. Under the menu bar, select ANALYZE - DESCRIPTIVES - CROSSTABS.

2. From the left-hand box in the dialog which opens, select EDUC2 for the Row and GENDER for the Column, using the arrow buttons as you have done previously. (Usually, by convention, one puts dependent variables in the rows and independent variables in the columns of a crosstab table.)

3. Click on the "Cells" button at the bottom of this dialog box.
4. In the "Cell Display" options dialog that pops up, check the boxes for Observed counts and Column percentages.

5. When you are done, click on "Continue" on this dialog box, then "OK" on the Crosstabs box.
The Output should look like this:
|
Crosstabs |
|
|
|
|
From this table, we can see that while 32% of male employees have a college degree or higher, only 12% of female employees have such an education. Thus, it is at least possible that education might account for salary differences between males and female bank employees.
In order to fully assess whether education accounts for the salary difference, we can utilize a procedure called Linear Regression. Using Regression, we can also examine the influence of the other factors that we hypothesized might account for salary differences: age, previous job experience, and length of tenure with the bank.
11. Computing a New Variable
While there are already variables in the data set for job experience (PREVEXP) and tenure (JOBTIME), there is currently no variable for age, only one for date of birth. To remedy this situation, we will compute a new variable for age.
We will compute a new variable, AGE, in the following way:
1. From the menu bar, select TRANSFORM - COMPUTE.

2. Type the name of the new variable that you will create in the box under the "Target Variable" heading.
3. Create the mathematical expression that represents the new variable, either by typing that expression into the box under "Numeric Expression" or by using the keypad.

4. Click on "OK" when done.
Now that we have created the variable, AGE, we can proceed with the Linear Regression which will be the final procedure that we will cover during the course.
12. Generating a Simple Linear Regression
To generate a Linear Regression that addresses our research question, we will need to open the Regression dialog box and indicate our dependent variable and various independent variables that we believe predict the dependent variable. The steps are as follows.
The output from this procedure is as follows:
1. From the menu bar, select ANALYZE - REGRESSION - LINEAR.

2. From the left-hand box, move SALARY to the Dependent box, and pull all of our predictor variables (i.e. GENDER, AGE, EDUC, JOBTIME, and PREVEXP) to the Independent box.
3. Click "OK" when done.

Depending on the options that you have selected for the regression, the output may have many different parts, but the following two parts are of primary interest to us right now:
|
Regression |
|
|
|
|
The regression results indicate that education does have a significant effect on salary (significance level of the parameter estimate is displayed in the column labeled "Sig."). However, even after controlling for the effects of educational differences between the sexes, gender continues to have a significant effect. Specifically, female employees make almost $9000 less than males employees, after controlling for education, previous job experience, and tenure with the bank. This suggests the possibility of either some other unmodeled difference between the sexes or sexual discrimination on the part of the bank.
13. Saving SPSS Files
When you are finished with your analyses, it is important to save all of your work. This may include the SPSS data file that you have modified, the output from the procedures, and the text commands that you have entered into the Syntax window.
The SPSS data file contains the actual data, variable and value labels, and missing values that appear in the SPSS Data Editor window. You should definitely save this file UNLESS you do not want to keep any of the modifications that you have made to your data. By default, the names of SPSS data files are given the extension, ".SAV".
The output from your procedures appears in the SPSS Viewer window. These results can be saved to a file. By default, the names of SPSS output files are given the extension, ".SPO".
If you have submitted commands to SPSS using syntax, you may want to save the commands that you have typed in the SPSS Syntax Editor window. These commands can be kept for future reference (indicating what precisely you did) and to replicate the same analyses. By default, the names of SPSS syntax files files are given the extension, ".SPS".
Right now, we will only be concerned with saving the data file that you have modified during this course. You may save the data file by going to the File menu, selecting Save, and typing in a file name (again, SPSS data files should always end with the ".SAV" suffix in order to identify them to SPSS).
14. Keeping Track of Everything That You Have Done
SPSS automatically keeps track of every command that you submit in what is called a "journal" file. This file is usually located in the C:\Windows\Temp directory and is called "spss.jnl." SPSS creates a journal file in one of two ways. It either overwrites the old journal file each time you start a new session or it appends the old journal file with your new commands. You can choose which method SPSS uses under Edit: Options. Under the General tab of the preference menu, you will find a section in which you can specify the location of the journal file and the method of logging. This is a useful feature for either replicating procedures or as a reference for what you actually did (e.g. if you forgot the precise values you used to recode a variable).
15. Getting Help
Statistical Consulting hours: For the hours when a statistical computing consultant is available, please contact the Research Computing Support Center by telephoning 243-8800 or e-mailing res-consult@virginia.edu
Helpful Web Pages:
16. Appendix
16.1 Windows pointers
You can rearrange the position of windows in SPSS any you wish.
To bring a partially-hidden window to the front, click once anywhere in that window. (Note that some other window now becomes partially hidden).
To find a window that is not even partially-hidden, choose WINDOWS from the menu bar, and then select the item you wish to view. (You can either use the mouse to click on one of the numbered options, or click the appropriate number on the keyboard.
To move any window:
To re-size/reshape any window:
16.2 Statistical Computing at UVA
What is a statistical package? It is a computer program or set of programs that provides many different statistical procedures within a unified framework. The advantages of such packages are many. They are much easier to use than most programming languages. They allow you to run complex analyses without getting bogged down in the details of computations, and because of wide use, they are less likely to have unknown "bugs." The principle disadvantage of such packages is that they sometimes make doing statistics too easy. It is possible to apply complex procedures inappropriately or to properly apply a procedure and then misinterpret the results. They also do such a nice job presenting output that the unwary user may be lulled into a sense of complacency, leading to a failure to detect errors (such as reading the data incorrectly). A more common problem, however, is the over-analysis of data. Since analyses are so simple to run, it is very easy to generate a huge pile of output, with the numbers you really need lost somewhere in the middle.
What statistical packages are available at UVA and supported by ITC? Information Technology and Communication (ITC) provides access to a number of general purpose statistical packages as well as many other programs. For an up-to-date list, please visit our software licensing web site at: http://www.web.virginia.edu/rescomp/ldb/swdb.asp If you have any questions about getting a copy of one of these statistical packages, please contact the Research Computing Support Center at res-consult@virginia.edu or by calling us at 243-8800.
![]()