SAS Project Training 1
[5-6]
[5-6]
The German data center has also submitted their sales data record to you for October, 2011.
Your job again is to update the online_retail data set to include the new German sales records.
Unlike France, the Germans have provided you with two data sets, GERMAN_SALES_201110 and GERMANY_PRODUCTS.
Download the two files below and save them in the same directory folder as the online_retail data set.
germany_products.sas7bdat |
germany_sales_201110.sas7bdat |
You should be able to access these two data sets in the same library we previously defined:
The German sales records are split into two data sets.
The germany_sales_201110 data set contains the following variables:
- INVOICENO
- STOCKCODE
- QUANTITY
- INVOICEDATE
- CUSTOMERID
- COUNTRY
The germany_products data set contains:
- STOCKCODE
- DESCRIPTION
- UNITPRICE
We will merge the two data sets and combine the data into one data set.
proc sort data=a.germany_sales_201110 out=germany_201110_sort;
by stockcode;
run;
proc sort data=a.germany_products out=germany_sort;
by stockcode;
run;
data germany_201110_combine;
length match_flag $20;
merge germany_201110_sort(in=a) germany_sort(in=b);
by stockcode;
if a and b then match_flag='in both';
else if a then match_flag='sales only';
else if b then match_flag='product list only';
run;
by stockcode;
run;
proc sort data=a.germany_products out=germany_sort;
by stockcode;
run;
data germany_201110_combine;
length match_flag $20;
merge germany_201110_sort(in=a) germany_sort(in=b);
by stockcode;
if a and b then match_flag='in both';
else if a then match_flag='sales only';
else if b then match_flag='product list only';
run;
The data step above merges the sales data set with the product data set that contains the product description and unit price.
It also creates a flag that identifies whether the observations are found:
- In both data sets.
- In the sales data set only.
- In the product list data set only.
Let's find out the match rates between the two data sets using the FREQ procedure.
proc freq data=germany_201110_combine;
tables match_flag;
run;
tables match_flag;
run;
The FREQ procedure breaks down the number of matched and unmatched observations:
There are 1,378 matched observations.
There are also 1,115 observations that are found only in the product list data set.
These are likely to be the products that were discontinued, and are no longer for sale. These observations can be ignored.
Finally, there are 56 observations that are found in the sales data set but not the product list data set.
These are the sales of the products that are not defined in the product list data set. We will also ignore them for now.
Now, we are going to ignore the unmatched observations and create a data set that contains only the matched observations.
data germany_201110_clean;
set germany_201110_combine;
where match_flag='in both';
drop match_flag;
run;
set germany_201110_combine;
where match_flag='in both';
drop match_flag;
run;
The germany_201110_clean data set now contains only matched observations.
Our goal, again, is to load the sales data from the German data set to the online_retail data set.
Before we do that, we must ensure the germany_201110_clean data set has the same structure as the online_retail data set.
Let's run a COMPARE procedure to compare the two data sets.
proc compare
base = a.online_retail
compare = germany_201110_clean
listvar
novalues;
run;
base = a.online_retail
compare = germany_201110_clean
listvar
novalues;
run;
The result shows that the two data sets are consistent in data set structure and variable attributes.
Now, we can update the online_retail_new data set to include the new sales data from Germany.
data online_retail_new;
set a.online_retail germany_201110_clean;
run;
set a.online_retail germany_201110_clean;
run;
A new online_retail data set (i.e. online_retail_new) again is created.
It combines the original online_retail data set with the germany_201110_clean data set.
Fill out my online form.