Statskom

Kaggle – Counting data with SAS University Edition – PROC FREQ

Andrew N — Tue, 30 Sep 2014 10:01:33 +0000

This series of posts is aimed both at people who want to learn more about SAS and to show how people can use the free SAS University Edition to explore the data from the Kaggle Titanic: Machine Learning from Disaster competition.

In our previous tutorial Getting Started with SAS University Edition we looked at how to import a spreadsheet into SAS and how to use PROC FREQ to explore the relationship between the categorical variable sex and the chance of survival. In this tutorial we look in a bit more detail at the PROC FREQ procedure. This tutorial covers the following:

How to use PROC FREQ to create one way tables, two way tables and more.
What the data in each section of the PROC FREQ output means.
How to use the PROC FREQ output to explore your data and draw conclusion about it.
How to group data together so that its PROC FREQ output is more user friendly.
How to produce a dataset from the PROC FREQ.
How to produce a graph using PROC FREQ.

We assume that you already have SAS University Edition installed and have used PROC IMPORT to read the train.csv file and convert it to a SAS dataset (work.train). If you haven’t done that yet, please refer to the first tutorial.

PROC FREQ

Officially PROC FREQ is designed to ‘produce one-way to n-way frequency and contingency (crosstabulation) tables’ what this means in practice is that it is used to answer questions such as how many…? or what proportion of…? of a particular categorical variable falls into each category. For example, what proportion of passengers survived the titanic disaster. The ‘n-way’ section of the definition means that as well as looking at the proportion of a single variable that fall into each category, we can also look at the proportion of 2,3,…n variables which fall into particular categories, for example, a two way table might show what proportion of passengers in each class survived, whereas a threeway table might show the proportion of passengers of each gender who survived.

One way frequency tables

To produce a one way frequency table we use the TABLES statement along with a single variable. For example if we want to see the number and percentage of passengers who survived we would submit code like the following:

PROC FREQ DATA = train;
  TABLES survived;
RUN;

This would produce output similar to the below, from which we can quickly see that 38.4% of passengers in the training dataset survived and 61.6% died:

Two way frequency tables

To produce a two way frequency count, we would add a second variable to the TABLES statement.

PROC FREQ DATA = train;
  TABLES sex*survived;
RUN;

When producing a two way tables, notice how the variable that you specify first forms the rows of the resulting output and the variable that you specify last forms the columns. Leaving the total information aside for a moment, you will notice that the table contains four cells. Each cell contains data for a particular level of the row (sex) and column(survived) variable.

In our example our four levels are Female and did’t survive (upper left), female and did survive (upper right), male and didn’t survive, (lower left), male and survived (lower right).

You will notice that within each cell there are several numbers, PROC FREQ by default PROC FREQ will supply the following information:

the first number in each cell is the frequency, i.e. a count of the number of events (passengers) at the particular level of row (sex) and column (survived) variable (in our example this is the number of passengers who fall into each category, for example, 81 female didn’t survive (see the red square)).
the second number is the percentage, i.e. the number of events at the particular level of row and column variable as a proportion of the total number of events expressed as a percentage (in our example this is the frequency of passengers at each level divided by the total number of passengers present in the table, for example 52.53% of the total number of passengers are males who did not survive (see the blue square)).
the third number is the row percentage, i.e. the number of events present in a particular cell as a proportion of the total number of events in that row expressed as a percentage(in our example this is either the proportion of male passengers who survived or didn’t survive, divided by the total number of male passengers, or alternatively the proportion of female passengers who survived or didn’t survive, for example, 18.89% of males survived (see the green square)).
the fourth number is the column percentage, i.e. the number of events present in a particular cell as a proportion of the total number of events in that column expressed as a percentage (in our example this is either the proportion of survivors of a particular gender divided by the total number of survivors or the proportion of fatalities of either gender divided by the total number of fatalities, for example, of those that survived, 68.13% were female (see the black square)).

Finally the total row shows the row frequencies, i.e. the total number of males and females and the row percentages, i.e. the total number of males or females as a percentage of the total number of passengers. The total column shows the column frequencies, i.e. the total number of survivors or fatalities and the column percentages, and also the column percentages i.e. the total number of survivors or fatalities as a percentage of the total number of passengers.

n-way frequency tables

Now if we wish to explore the relationship between more than two variables we can also do that with PROC FREQ, for example if we wanted to explore the relationship between PCLASS, SEX and SURVIVED we add in the additional variable to the tables statement as follows:

PROC FREQ DATA = train;
  TABLES pclass*sex*survived /NOCOL NOPERCENT ;
RUN;

This will produce one two way frequency table (sex by survived) for each level of the PCLASS variable (1,2,3), giving a total of three tables as shown below. Note also that we used the NOCOL option to surpress the printing of the column percentages and NOPERCENT to surpress the printing of the overall percentages. We are not interested in these figures at the moment, so removing them helps to draw attention to the items of interest.

This table gives us useful information which is not available in the two way table, for example it tells us that:

among 1st class female passengers, 96.81% survived (blue square in first table).
among 2nd class female passengers, 92.11% survived (green square in second table).
among 3rd class female passengers, 50% survived (red square in third table).

Similarly we see that 2nd and 3rd class male passengers had a very low chance of survival, whereas around one in three first class male passengers survived.

Output datasets

Sometimes it can be useful to use the results of printed output in subsequent calculations. PROC FREQ has an option which enables you to create an output dataset containing the results of its calculations. To do this use the OUT= option and the dataset name, as shown below:

PROC FREQ DATA = train;
  TABLES pclass*sex*survived /out = train_freq ;
RUN;

This will create the dataset train_freq shown below in your work library (highlighted in the red box)

PROC FREQ with continuos data

PROC FREQ is works best with categorical data that takes a relatively small number of values, the age variable takes 88 different values (excluding the passengers with unknown age), lets see what happens when we use the PROC FREQ procedure on the age variable:

PROC FREQ DATA = train_bin;
  TABLES age *survived   ;
RUN;

The output it produces is so large that we can only fit a small section of it on the page at any one time. Does it contain any useful information? Perhaps but the number of passengers in each age group is so small that it’s difficult to draw any conclusions from it. A better approach is to first group the data into a smaller number of categories and then to use this grouped data to in your PROC FREQ. This process of grouping data is formally called binning. There are several ways to achieve this, one simple way is to use the IF…THEN…ELSE logic we learnt in the first tutorial.

This can be achieved using code similar to the below, here we create a new variable AGE_GRP which groups the ages of passengers into 10 year bins.

DATA train_bin;
LENGTH age_grp $20;
SET train;
IF .< age <= 10 THEN age_grp = "0-le10";
ELSE IF 10
The output of this procedure will be as shown below:

This table is definitely more informative than the table in which we treated each age as a separate category. We know from the one way frequency table we produced at the beginning of the tutorial that 38% of passengers in the training dataset survived. We can see from this table that 59% of the passengers in the training dataset who were under 10 survived. Suggesting that children were much more likely to survive the disaster. The previous tutorial showed that women were far more likely to survive than men, our data therefore suggests that the idea that ‘women and children’ had priority in the lifeboats is probably based on fact.
If you open the train_bin dataset you will notice that the age variable is often missing. SAS represents a missing numeric variable with the period symbol. When we assigned our observations to bins, we did not assign the observations with missing values to any bin, the AGE_GRP variable is therefore missing for these observations. PROC FREQ excludes this data from its output, the number of excluded observations is shown by the “Frequency Missing = 177″ line underneath the table.
Graphs with PROC FREQ
Finally PROC FREQ also offers the ability to create basic charts. Viewing your frequency counts in an alternative format can be very informative and with PROC FREQ this can be achieved with very little effort. Adding the option PLOTS = FREQPLOT will produce the barchart shown below.
PROC FREQ DATA = train;
TABLES embarked*survived /NOCOL NOPERCENT PLOTS = FREQPLOT ;
RUN;

From this bar chart we can quickly see that:

the proportion of passengers who embarked at Cherbourne and survived seems to be much greater than the proportion who embarked at other ports.

In the next tutorial we will look at how we can create and customise more complex graphs.
Summary
In this tutorial we have shown how PROC FREQ is a powerful tool for couting data and how ii can be used to create a printed table, a SAS dataset and a graph. We’ve also seen how this output can be used to identify patterns in your data and in particular to see whether certain subgroups have different characteristics.
Using the example code above as a basis try to examine the impact that different combinations of variables have on the probability of survival, by for example adding and removing variables to the tables statement and investigating the output. In the next tutorial we will look at how to create and customise more complex graphs and how to use these graphs to investigate your data.

The post Kaggle – Counting data with SAS University Edition – PROC FREQ appeared first on Statskom.

Kaggle – Getting started with SAS university edition

Andrew N — Wed, 24 Sep 2014 19:45:34 +0000

This is the first of our tutorials on using SAS university edition to explore the data from the Kaggle Titanic: Machine Learning from Disaster edition. The tutorial is designed to be roughly equivalent to the first excel lesson available on the Kaggle website. It should be useful both for people who want to learn SAS, but also for those who want to use SAS to enter the Kaggle competition. It explains how to use the SAS University Edition to do the following:

Reading a CSV file into SAS using PROC IMPORT.
Calculating the proportion of men and women who survived using PROC FREQ.
Using IF…THEN..ELSE logic to modify a dataset.
Using the KEEP statement to only keep variables that you want in your dataset.
Outputting a dataset to a CSV file in preparation for submitting it to Kaggle

Before you get started you will need to install a virtual machine and then the SAS university edition files. If you haven’t done this yet that you can follow the tutorial here.

Now launch the virtual machine and log onto the SAS University Edition.

Download the train.csv and test.csv file from the Kaggle website and store them within the shared folder you setup when installed SAS unviersity edition, usually this will be : C:\SASUniversityEdition\myfolders\ .

Now we will import the train.csv file. To import the CSV file, we will use the PROC IMPORT procedure. Enter the following code into the code window as shown below.

PROC IMPORT DBMS=csv OUT=train  replace
  DATAFILE="/folders/myfolders/train.csv";
  GETNAMES=YES;
RUN;

Here the DATAFILE option tells the procedure the name and location of the input file (note how this is relative to the shared folder you setup when you installed SAS University Edition).
The DBMS option tells the procedure the type of file it will be reading.
The OUT = option gives the name of the output file, as we are not specifying a LIBREF this dataset is created in the work library.
The REPLACE option tells the procedure that if the TRAIN file already exists in the work library then it should be overwritten.
The GETNAMES option tells the procedure that the first row of the CSV file contains the values that will be used to label the variables in the TRAIN dataset.

Execute the code by pressing F3, or by clicking on the little running man.

The program will import the CSV file into SAS and convert it to a SAS dataset. The log file will show something similar to the below, confirming that the import was successful, it will also have information on the variables which have been created.

891 rows created in WORK.TRAIN from /folders/myfolders/train.csv.

 NOTE: WORK.TRAIN data set was successfully created.
 NOTE: The data set WORK.TRAIN has 891 observations and 12 variables.
 NOTE: PROCEDURE IMPORT used (Total process time):
       real time           0.17 seconds
       cpu time            0.18 seconds

To check that the import has worked as expected, open the train dataset within the work library as shown below.

Next we look to see what effect the gender of the passenger has on their chances of survival. To do this we use the PROC FREQ procedure as follows:

Enter the following into the code window, highlight it and execute the block of code either with the F3 key or by pressing the running man icon.

PROC FREQ DATA =train ;
  TABLES survived*sex / NOROW NOPERCENT;
RUN;

This code produces a two way frequency table. Note that we use the NOROW and NOPERCENT options to suppress the printing of some unwanted information. The output window should automatically be shown and should display something similar to the below.

This table tells us lots of useful information.

The bottom right number tells us the total number of passengers in the dataset (891)
We can also see the number of females (314) and the number of males (577).
The number of people who survived (342) and the number that did not (549)
Most interestingly perhaps we can see the percentages of men and women who survived.

We notice that 74.2 % of women on board survived, but only 18.89% of men. It seams reasonable therefore that we might first attempt to work out which passengers from the training set survived, by guessing that all females survived and all males did not. To make a submission to Kaggle, you need to import the test.csv file, modify the dataset, output it back to a CSV file and then upload it to Kaggle, the process for doing this is explained below.

The process for importing the test.csv file is the same as the process for importing the train.csv file explained above with the exception that the file name is changed. Try to write and submit this code yourself, or scroll down to see how it should look.

PROC IMPORT DBMS=csv OUT=test replace
  DATAFILE="/folders/myfolders/test.csv";
  GETNAMES=YES;
RUN;

Next we want to use conditional programming logic to assign values to a variable SURVIVED according to the gender of the passenger. To do this we use IF…THEN…ELSE logic as shown in the code below to create a dataset called gender_submit

DATA gender_submit(KEEP=survived passengerid );
  SET test;
  IF sex = "female" THEN survived = 1;
  ELSE survived = 0;
 RUN;

Survival in this dataset is indicated when the SURVIVED variable takes a value 1. In this example we use IF…THEN… ELSE logic to set the value of the SURVIVED variable to 1 if the sex variable = “female” and to set the value of the SURVIVED variable to otherwise. When making a submission for this challenge, your CSV file can only contain the variables SURVIVED and PASSENGERID. the KEEP statement in the first line, ensure that the output dataset only contains the variables you are interested in. To confirm this has worked as expected, open the dataset gender_submit from the work library, it should look similar to the following;

Finally you need to convert the dataset gender_submit into a CSV. This is done with the PROC EXPORT procedure as shown below.

PROC EXPORT DATA=gender_sub DBMS=csv
  OUTFILE= "/folders/myfolders/gender_sub.csv" REPLACE;
RUN;

Notice how the OUTFILE option tells the procedure where to write out the CSV file. Your shared folder should now contain the CSV file. Upload this to Kaggle and make your first submission.

The next tutorial will explain how to use graphics to explore your data further.

The post Kaggle – Getting started with SAS university edition appeared first on Statskom.

Kaggle – Titanic: Machine Learning from Disaster

Andrew N — Tue, 23 Sep 2014 13:21:47 +0000

Kaggle is a platform for analytics and data science. It presents a series of competitions which are free to enter and often offer cash prizes. The competitions provide a great way to learn about programming data analysis and machine learning.

One of its most popular competitions involves using a dataset of subject characteristics to predict which passengers survived the sinking of the Titanics. Kaggle provides a series of tutorials explaining how to get started on this problem using Excel, Python or R, however there is no tutorial on how to get started with SAS.

SAS recently released SAS University Edition. It is a free version of SAS designed for those interested in learning to work with the SAS language. It includes Base SAS, SAS Stat, SAS Access, SAS IML and SAS Studio. It can be downloaded from the SAS website here.

We’ve decided to take advantage of the release of SAS University Edition to provide a series of introductory tutorials which explain how to use SAS to enter the Kaggle – Titanic: Machine Learning from Disaster competition. The tutorials are aimed at those people who are new to the SAS language and are interested in learning more. They will be posted here over the next few days.

The post Kaggle – Titanic: Machine Learning from Disaster appeared first on Statskom.