Language of Data

Interactive SAS tutorials supporting the OpenIntro Introduction to Modern Statistics textbook.

Language of Data

This tutorial does not assume any previous SAS experience, but you should be familiar with SAS from the SAS Bootcamp and will learn more about SAS over the course of the semester in PHS 703.

SAS Macros

Macros in SAS are packages of pre-written code that allow users to perform specific tasks without having to write the code for those tasks themselves. This may be a new concept to you, but you need to trust me that macros are extraordinary useful. We will rely on a few macros in these tutorials to do simple things like making working copies of course data sets and checking the contents of datasets.

Macros in SAS are identified by the % sign. For example, if you see %use_data(census2010) in a SAS program, this line is calling the %use_data macro with the census2010 argument. The name, arguments, and details of the macro can by found by looking at the macro code itself.

For example, here are 2 macros that we will use frequently in these tutorials:

It is important to note that these macros are not generally available to use unless and until you define them. We will do that by running the following short SAS program every time we start a new tutorial. This program tells SAS where the course dataset library is located, how to find the relevant data formats, and defines the two macros described above.

%include "~/my_shared_file_links/hammi002/sasprog/run_first.sas";

Run this code now by copying the code from here into SAS Studio in SAS ODA and clicking Running man(Run). You’ll do with with any code in a code block like above, by the way. You may want or need to open a new, blank program file in SAS Studio (using the F4 key) for each tutorial. As with any program in SAS, after running, check to log to ensure SAS did not encounter any errors.

Feel free to open this program file to see the SAS code associated with the %use_data and %glimpse macros.

Data in SAS

In this lesson, we’ll begin by introducing the terminology of datasets and data in SAS.

One of the datasets that we will work with in this tutorial comes from the High School and Beyond Survey, which is a survey conducted on high school seniors by the National Center of Education Statistics.

The data are organized in what we call a data set, where each row represents an observation (or record or case) and each column represents a variable (or field). If you ever use spreadsheets, such as Excel, this representation should be familiar to you.

In this lesson we’ll work with the High School and Beyond dataset, stored in the course SAS dataset library. The data are stored in a data set called hsb2.

To make a copy of this dataset to use, copy/paste and run the following code:

* Make a working copy of the HSB2 dataset for your use;
%use_data(hsb2);

Because we aren’t requesting any output, SAS Studio should take you to the Output Data tab, so you can preview the dataset. You can use the scroll bars on the right and bottom of the window to scroll through the records and variables in the dataset, respectively.

Loading data into SAS

Note that there are many ways of loading data into SAS, depending on where and how your data are stored. In this lesson, we’re using a dataset that is located in the PHS 701 dataset library. Other commonly used formats of data are plain text files (.txt), comma-separated value files (.csv), and Excel files (.xls or .xlsx).

In all the tutorials in these series, we will use existing data from the PHS 701 library, so you won’t need to worry about the myriad ways of loading data into SAS. However, learning more about loading data into SAS is important for when you’re working on an analysis with your own data. You should learn about this in PHS 703, but other resources are available (for example, from SAS documentation or from a different SAS tutorial).

Take a peek

When you want to work with data in SAS, a good first step is to take a peek at what the data look like. The %glimpse() macro we have defined is one good way of doing this. Copy/paste and run the code below:

* Gives you a glimpse into the data;
%glimpse(hsb2);

The output of %glimpse includes output from PROC CONTENTS . The top section of this output includes information about the dataset itself. Make sure you can find the following key information here:

The next section, titled Engine/Host Dependent Information, includes more technical information about the dataset and can generally be ignored.

The final section of the PROC CONTENTS output, titled List of Variables and Attributes, is essential, however. For each variable, it lists that variable’s name, data type (Numeric or Character), label (if assigned), and format (if used). Before we look into this section in more detail, also note that %glimpse prints out the variable values of the first 10 records.

Based on both this last section of PROC CONTENTS output and the first 10 records of the dataset, make sure you understand the following information:

Well, that ses variable is confusing, isn’t it? The data type is numeric, but the expected values are character. This is where the format comes in. SAS formats are tricky, so I want to make sure you understand how we will use them in the course datasets. The most important thing to know about SAS formats is that they assign a label to a specific data value.

For ses, this means assigning the label “Low” to the numeric value 1, “Middle” to the numeric value 2, and “High” to the numeric value 3. In this course, we actually include the underlying data value in the label, to help you understand this relationship. That’s why, in the sample records, you see “1: low”, “2: middle”, and “3: high”. The numeric values (1, 2, and 3) are what is stored in the SAS dataset to represent each of the different labels.

A key point to remember (and we’ll reiterate this later) is that when you interact with the ses variable, you will have to use the numeric, unformatted values in order for SAS to know what we’re talking about.

Practice time

Next, we’ll practice on another dataset, email50, which contains a subset of incoming emails for the first three months of 2012 for a single email account. Make a working copy of the dataset and examine its structure.

%use_data(email50);
%glimpse(email50);

Can you answer the following questions?

Types of variables

When you first start working with a dataset, it’s good practice to take a note of its dimensions; how many rows or observations and how many columns or variables the data frame has.

You learned how to do this in the previous section using the %glimpse macro. You also learned how to take a quick look at the list of variables in the dataset. In this section, we will delve deeper into the classification of variables as numerical or categorical. This is an important step, as the type of variable helps us determine what summary statistics to calculate, what type of visualizations to make, and what statistical method will be appropriate to answer the research questions we’re exploring.

There are two types of variables: numerical and categorical.

Numerical data

Numeric variables can be further categorized as continuous or discrete.

When determining whether a numerical variable is continuous or discrete, it is important to think about the nature of the variable and not just the observed value, as rounding of continuous variables can make them appear to be discrete. For example, height is a continuous variable, however we tend to report our height rounded to the nearest unit of measure, like inches or centimeters.

Categorical data

Categorical variables that have ordered levels are called ordinal.

Think about a survey question where you’re asked how satisfied you are with the customer service you received and the options are very unsatisfied, unsatisfied, neutral, satisfied, and very satisfied. These levels have an inherent ordering, hence the variable would be called ordinal.

If the levels of a categorical variable do not have an inherent ordering to them, then the variable is simply called categorical. For example, do you consume caffeine or not?

Variables in hsb2

Let’s take a moment to go through the variables in the High School and Beyond dataset again:

%glimpse(hsb2);

Using the %glimpse macro, we can obtain a list of the variables in the dataset and also see what the values stored in these variables look like.

The first variable is id, which is an identifier variable for the student. Strictly speaking, this is a categorical variable, though the labeling of this variable is likely not that useful since we would not use this variable in an analysis of relationships between the variables in the dataset. You can think of this variable as being an anonymized version to having the names of the students in the dataset.

The next variable is gender, a categorical variable, with levels "male" and "female". It should be noted that the language of government surveys, such as High School and Beyond, is slow to change. So with these types of data, you will continue to see variables mislabeled as “gender” when they in fact measure the biological sex (male, female) of the participant.

There is no inherent ordering to the levels of this variable, no matter what anyone tells you! So, this is just a categorical variable. The same is true for the race variable, which has levels of "white", "african american", "hispanic", and "asian".

Socio-economic status (ses) on the other hand, has three levels "low", "middle", and "high" that have an inherent ordering, hence this variable is an ordinal categorical variable.

School type (schtyp) and program (prog) are also both categorical variables, with no inherent ordering to their levels.

The remaining variables are scores that these students received in reading (read), writing (write), math (math), science (science), and social studies (socst) tests. Since these scores are all whole numbers, and assuming that it is not possible to obtain a non-whole number score on these tests, these variables are discrete numerical.

Next we will practice identifying variables in a different dataset.

Variables in email50

Let’s have another look at the email50 data, so we can practice identifying variable types.

Use the %glimpse macro to view the variables in the email50 dataset.

%glimpse(email50);

Review the output to identify each variable as either numerical or categorical, and further as discrete or continuous (if numerical) or ordinal or not ordinal (if categorical).

Categorical data in SAS

A common step in many analyses that involve categorical data is a subgroup analysis, where we work with only a subset of the data. For example, analyzing data only from students in public schools or only for students who identified as female. We can obtain these subsets by either using where statements in SAS procedures or by creating a new copy of the dataset, filtered for the specific level(s) we’re interested in. The latter approach is fine for small datasets (and what we’ll do below), but can be wasteful of resources when working with very large datasets.

Suppose we want to do an analysis of only the students in public schools in the High School and Beyond dataset. Let’s first find out how many such students there are. One option for obtaining this information in SAS is PROC FREQ. This basic SAS procedure is very useful for presenting and analyzing categorical data. In this case we’re interested in the number of students for each level of the schtyp (school type) column:

* Frequency table for school type;
proc freq data=hsb2;
	tables schtyp / missing;
run;

You should see that there are 168 students in public schools and 32 in private schools.

The missing option on the tables statement is very helpful for checking to see if your variable has any missing data. In these data, there is no missing data for school type.

Next, let’s create a new dataset that includes only public school students:

* New dataset including only public school students;
data hsb2_public;
	set hsb2;
	where schtyp = 1;
run;

We can read the above code as: “create a new dataset, hsb2_public, by using the hsb2 dataset where school type is equal to 1. This is an instance where it is essential to know that the school type variable (schtyp) is formatted, and that the underlying numeric value of 1 is associated with the desired school type of “public”.

Let’s check the frequency distribution of schtyp to ensure that this subset worked:

* Frequency table for school type;
proc freq data=hsb2;
	tables schtyp / missing;
run;

As an aside, because the school type variable contains numeric values to represent different text labels (i.e., is formatted), the following code would not work to achieve this same subsetting:

* Cannot use formatted values to subset a dataset;
data hsb2_public;
	set hsb2;
	where schtyp = "public";
run;

You can try it if you’d like, but you’ll get an error that the where clause has incompatible variables. This just means that we were trying to subset a variable that contains numeric data using a text value.

Note that if we wanted to create a subset of the hsb2 data that only contained white students, we would use a text value, since data in the race variable are stored as character values:

* New dataset including only white students;
data hsb2_white;
	set hsb2;
	where race = "white";
run;

In this case, the text value in the where statement must match the values stored in the data exactly (i.e., same capitalization and, if applicable, spacing). This is another reason why checking the distribution of a character variable using PROC FREQ is helpful. The output shows you exactly what values appear in the dataset variable.

Now we will practice filtering and handling factors with a different categorical variable.

Filtering based on a factor

Next, we’ll practice working with a factor variable, number, from the email50 dataset. This variable tells us what type of number (none, small, or big) an email contains.

How would we create a new dataset called email50_big that is a subset of the original email50 dataset containing only emails with "big" numbers. This information is stored in the formatted number variable.

* Check number distribution in email50;
proc freq data=email50;
	tables number / missing;
run;

* New dataset including only big number emails;
data email50_big;
	set email50;
	where number = 3;
run;

Does it make sense why we limited the new dataset to records where the number variable equaled 3? How many records had “big” numbers?

You can always %glimpse the new dataset to set more detail about the new dataset:

%glimpse(email50_big)

Creating new variables from existing variables

Two common ways of creating a new variables from an existing variables are (1) discretizing numerical variable values and (b) combining categorical variable values.

Discretizing is the process of converting a numerical variable to a categorical variable based on certain criteria. For example, suppose we are not interested in the actual reading score of students, but instead whether their reading score is below average or at or above average. We could create a new variable based on the scores in read to achieve this.

Combining is the process of reducing the number of categories (values) in a categorical variable. For example, the email50 dataset has a categorical variable called number with levels "none", "small", and "big". But suppose we’re only interested in whether an email contains a number. We would create a new variable that combines the levels "some" and "big" into one level named "yes", and keeps the "none" level separate as a level named "no".

Neither of these processes will be covered here, since this is something covered in detail in PHS 703. If there is need to discretize or combine data in a 701 tutorial, we will give you the needed code.

You have successfully completed this tutorial.

< Back to Section 1