SAS Project Training 1
[2-6]
[2-6]
The next step to further understand the data set is to examine the missing values.
The data set has more than 370,000 observations.
It would be highly inefficient to check the missing values line by line.
Instead, we will use a FORMAT procedure to systematically identify the number of missing values in each column.
proc format;
value missing
. = 'missing'
other = 'present'
;
value $missing
' ' = 'missing'
other = 'present'
;
run;
value missing
. = 'missing'
other = 'present'
;
value $missing
' ' = 'missing'
other = 'present'
;
run;
The FORMAT procedure creates two formats for the missing values: one for the numeric variable and another for character variable.
Once the formats are created, a Proc Freq procedure can be run to find out the number of missing values in each column:
proc freq data=a.online_retail;
tables _all_ /missing;
format _numeric_ missing. _character_ $missing.;
run;
tables _all_ /missing;
format _numeric_ missing. _character_ $missing.;
run;
The FREQ procedure creates eight tables that display the number of missing values for each variable.
Most of the columns have no missing values except for the DESCRIPTION and CUSTOMERID:
The situation with CUSTOMERID is more alarming as there are almost 100,000 observations missing a customer id.
Now, we will also look at the number of purchases by country.
proc freq data=a.online_retail;
tables country;
run;
tables country;
run;
From the table created by the FREQ procedure, we know that most of the sales came from the United Kingdom:
Let's also look at the unit price and the quantity sold.
A MEANS procedure is run on the two columns:
proc means data=a.online_retail min max mean median mode maxdec=2;
var unitprice quantity;
run;
var unitprice quantity;
run;
The minimum, maximum, mean, median and mode are computed for the two numeric variables.
The minimum unit price and quantity sold caught our eye.
The minimum unit price and quantity are negative.
These negative values are likely due to returns and fraud.
A simple Proc SQL step can be run to identify these observations from the table:
proc sql;
select *
from a.online_retail
where unitprice <0;
select *
from a.online_retail
where quantity <0;
quit;
select *
from a.online_retail
where unitprice <0;
select *
from a.online_retail
where quantity <0;
quit;
The tables generated show all the records with a negative unit price and quantity:
Now, we will re-run the analysis for purchases by country without the negative unit price and quantity.
proc freq data=a.online_retail;
tables country;
where unitprice >0 and quantity >0;
run;
tables country;
where unitprice >0 and quantity >0;
run;
Overall, the United Kingdom is still the country where most sales are generated:
Fill out my online form.