Data Processing in SAS:How & Why
Understanding SAS Program
For Data Analysis in SAS,first of all we should know how a SAS Program relates to data which is very much like a Surgeon (You!) getting ready with his surgery-kit (your SAS Program) and the so-called Patient is our Data.So we need to understand SAS Data like a Doctor examines his patient!Fundamentally, A SAS program is a sequence of steps that can be submitted for execution. There are two types of steps in every SAS program:
Data steps are normally used to create SAS data sets.Proc steps are normally used to produce reports, but sometimes to create SAS data sets.
Data must be in the form of a SAS data set in order to be processed by SAS!
Employee No. | Job Role | Country | Salary | Global Code |
12345AQ | Seller | US | $345 | XYZ67 |
32345GR | Accounts | India | $280 | BYU76 |
23453RR | Analytics | UK | $343 | HHN65 |
23467QW | Operations | US | $290 | KLH34 |
32345PU | Finance | India | $305 | LLG87 |
SAS data sets are made up of rows and columns. Columns are frequently called “variables” or “fields”, and rows are frequently called “records” or “observations”.Looking at the data set above, “Employee No.”, “Job Role”, “Country”, “Salary”, and “Global Code” are all variables. The information on each of the accounts (each row) is considered a record.
Variable Types:There are two types of variables:
Character variables can be from 1 to 200 characters (bytes) long.Numeric variables are stored as floating point numbers in 8 bytes (by default) of storage. An example from our data set of a character variable would be “Employee No.” ,“Country”,”Global Code” and “Salary” would be numeric variables.Date variables stored as SAS dates are special numeric variables.A SAS date value is interpreted as the number of days between January 1, 1960, and the date. The SAS date value of January 1, 1960 is 0. The value of January 2, 1960 is 1, etc…. Any date before this date is considered negative.
Missing Data:Often, a data set will contain missing values. If data is missing, it is displayed as ‘.’ for numeric data, and ‘ ‘ for character data.
Employee No. | Job Role | Country | Salary | Global Code |
12345AQ | Seller | $345 | XYZ67 | |
32345GR | Accounts | India | $280 | BYU76 |
23453RR | UK | $343 | HHN65 | |
23467QW | Operations | US | KLH34 | |
32345PU | India |
Editor, log, and output windows
When you open SAS, you’ll notice three separate screens: the program editor screen, the log screen, and the output screen.
The program editor screen is where SAS code is typed.
The log screen contains SAS statements that have been submitted well as notes and messages about your SAS session.
The output screen displays output from proc step execution.
Submitting a SAS program
When you click on Locals->Submit, all code currently in the program editor will be executed! If you’d like to submit only a portion of the code, select the code you’d like to run, copy it, and then click on Locals->Submit Clipboard.If you submit all code in the program editor (by selecting Locals->Submit), you’ll notice that the code disappears after you submit it! To get it back, simply select Locals->Recall Text.
SAS Data Sets
SAS data sets can be kept anywhere on your PC. (Note: many collections data sets are very large! Check the size before bringing to your own machine!) When referring to SAS data sets in a program, we often create a nickname, or a libname, for the location of the data set.
For example, if you have a data set named data1 in your c:\My Documents folder, at the beginning of the program you may want to give this location a nickname, so that it will only have to be entered once. The following code does this.
libname read ‘c:\My Documents’;
Now, instead of mentioning c:\My Documents every time you want to access a data set in this folder, you reference “read” instead.SAS creates a temporary SAS library (place where data sets are stored) every time you open SAS. This library is automatically given the libname “work”, and is deleted when you log out of SAS.In any program, data sets that you use come from two places:
- The temporary (work) library
- The permanent library (libraries) mentioned in the libname statement
The following code will print out a data set from the network:
libname read ‘t:\SAS’;
proc print data=read.datasamp;
run;
In this sample code, the data set we are printing is called “datasamp”, and this data set is located at t:\sas.Often we need to make changes to the data before we print it orproduce a report. This is done in the SAS data step.For example, say we would like to propose a new minimum payment of 1/24th of the balance for the salary in our sample data set. How could we see what 1/24th of the balance would be?
data tmp1;
set read.datasamp;
newpymt=salary/24;
run;
proc print data=tmp1;
run;
We first use a data step to read in our sample data set, and then create a new variable “newpymt” that is equal to the balance divided by 24. When the “run” statement of the data step is executed, a new data set called “tmp1” (located in the temporary “work” library) is created, so that we don’t overwrite our initial data set. Then we print the new data set to see what the results are.Note: both data set names and variable names can contain up to 8 characters. They have to start with a letter or an underscore, and contain any combination of letters, numbers, and underscores after that.
Comments in SAS programs
If you wish to make comments (notes to yourself that won’t be executed), you may do so by enclosing the comment like this:
/* Here is my comment. */
You’ll want to avoid starting the comment in column 1, as the mainframe mistakes forward slashes in column 1 for something other than SAS code! It’s good programming practice to start all proc, data, and run statements at column 1, and indent all other lines. Putting lines between segments of the program makes it easier to read.
Useful SAS Procs
While there are several procs in SAS that can be used, there are a handful that are particularly useful for collections reporting. Probably the most useful is proc print, which prints out the data set that you specify. The syntax looks like this:
proc print data=tmp1;
run;
In this case, the data set that would be printed is tmp1. Note that tmp1 must be a temporary SAS data set (in the work directory) since we don’t mention a libname.This proc print would print out all the variables in the data set. But say we only wanted a few of the variables?
proc print data=read.datasamp;
var acctno due balance;
run;
This proc print would only print out the fields acctno, due, and balance.
You’ll notice that when looking at proc print output, there’s a variable called ‘OBS’ on the left hand side – this is an automatically created variable that counts the number of observations (rows).
Another useful proc in SAS is proc contents. Proc contents displays the names of all the variables in your data set, as well as other information about each variable, and the data set as a whole. The syntax is below:
proc contents data=read.datasamp;
run;
Proc Freq and Proc Univariate
Proc freq and proc univariate are useful tools to help you see the distribution of the data. Proc freq can be used on categorical variables (variables that only have a few possible values), and proc univariate should be used with continuous variables (variables that have a large or infinite number of values).
Here’s the syntax for proc freq:
proc freq data=read.datasamp;
table due;
run;
Put the variables for which you’d like to see the distribution in the table statement.Proc univariate provides many pieces of information on continuous variables, including mean, median, and the maximum and minimum values.
Here’s what the syntax looks like for proc univariate:
proc univariate data=read.datasamp;
var balance;
run;
The variables for which you’d like to see a distribution should go in the var statement.