Search the site...

SASCRUNCH TRAINING
  • Home
  • Member's Area
  • How to Start
  • SAS Interface
  • Creating a Data Set
  • Practical SAS Training Course
  • SAS Certified Specialist Training Program
  • Proc SQL Course
  • Introduction to Time Series Analysis
  • SAS Project Training Course
  • Full Training / Membership
  • Sign up
  • About us
  • Contact us
  • Home
  • Member's Area
  • How to Start
  • SAS Interface
  • Creating a Data Set
  • Practical SAS Training Course
  • SAS Certified Specialist Training Program
  • Proc SQL Course
  • Introduction to Time Series Analysis
  • SAS Project Training Course
  • Full Training / Membership
  • Sign up
  • About us
  • Contact us
Task [2-4]


The next step to understand the data set is to examine the missing values.

Since the data set is fairly large, it is highly inefficient to check the missing values line by line.

We will use a FORMAT procedure to systematically identify the number of missing values in each columns.
proc format;
 value missing
  . = 'missing'
  other = 'present'
  ;
 value $missing
  ' ' = 'missing'
  other = 'present'
  ;
run;    

The FORMAT procedure creates two formats for missing values: one for numeric variable and another one for character variable.

Once the formats are created, a Proc Freq procedure can be run to find out the proportion of missing values in each columns:
proc freq data=a.online_retail;
 tables _all_ /missing;
 format _numeric_ missing. _character_ $missing.;
run;

The FREQ procedure creates eight tables that display the number of missing values in each variables.

Most of the columns have no missing values except for the DESCRIPTION and CUSTOMERID
Picture

The situation with CUSTOMERID is more alarming as there are almost 100,000 observations missing the customer id.

Now, we will also look at the number of purchases by country.
proc freq data=a.online_retail;
 tables country;
run;

From the table created, we know that most of the sales came from United Kingtom:
Picture

Let's also look at the summary statistics for UNITPRICE and QUANTITY sold.

A MEANS procedure is run on the two variables:
proc means data=a.online_retail min max mean median mode maxdec=2;
 var unitprice quantity;
run;

The minimum, maximum, mean, median and mode are computed for the two numeric variables.

Some statistics actually caught our eyes.
Picture

The minimum unit price and quantity are negative.

These are likely due to returns and fraud.

A simple Proc SQL step can be run to identify these observations from the table:
proc sql;
select *
from a.online_retail
where unitprice <0;
select *
from a.online_retail
where quantity <0;
quit;

Now, we will re-run the analysis for purchases by country without the records where the unit price or quantity are negative.
proc freq data=a.online_retail;
 tables country;
 where unitprice >0 and quantity >0;
run;

Overall, United Kingdom is still the country where most sales are generated:
Picture
Go to task 3
Do you need help?
Already a member? Go to member's area.