Task [3-4]
Now that we have a good understanding of the data set, we can look into how we can create the macro that aggregates the newly collected sales data into the online_retail data set.
Download the france_sales_201110 data set from the link below:
france_sales_201110.sas7bdat |
Save it in the same directory folder as the online_retail data set so that it can be accessed in the same library on SAS Studio:
The france_sales_201110 data set contains the data from the sales department in France.
We need to aggregate it into the master online_retail data set.
In order for us to combine the two data sets, the france_sales_201110 must be in the same structure as our master sales data set.
The two data sets should have the same number of columns and data structure.
Is this the case with the france_sales_201110 data set? We can find out by using the COMPARE procedure to compare the two data sets.
proc compare
base = a.online_retail
compare = a.france_sales_201110
listvar
novalues;
run;
base = a.online_retail
compare = a.france_sales_201110
listvar
novalues;
run;
The COMPARE procedure compares the online_retail data set with the france_sales_201110 data set.
There are a few things we noticed from the first table created:
There are 7 variables in common between the two data sets, which is good.
However, there is one variable in the France data set but not in the online_retail data set.
There are also three variables in the online_retail data set that are not in the France data set.
These variables have to get cleaned up.
In addition, there is one common variable in the two data sets that have conflicting type.
This means the variable is a numeric in one data set but character in the other.
Let's look at the second table and find out more details about the inconsistencies:
From the second table, we find out that INVOICEDATE variable appears only in the online_retail data set but not the France data set.
Instead of having the INVOICEDATE variable, the FRANCE data set has the four variables below:
- INVOICE_MON
- INVOICE_DAY
- INVOICE_YEAR
- INVOICE_TIME
The France data set essentially contains the same invoice information as the online_retail data set.
It just has the information split into four variables.
Now, we can verify the information by just looking at the data portion of each data set.
Below is the INVOICEDATE column from the online_retail data set:
The France data set contains the same information but split into the four columns:
The four columns have to be combined into one column before we can aggregate the France data set to the online_retail data set.
Now, the second table from the COMPARE procedure also displays the variable that has conflicting type between the two data sets:
The CUSTOMERID variable is a numeric in online_retail but a character in the France data set.
This also has to be cleaned up before we combine the two data sets into one.