SAS Project Training 1
[3-6]
[3-6]
The online_retail data set contains the sales records up to September 30, 2011.
Every month, the sales department in France will send over their monthly sales records to headquarters.
You will then need to update the online_retail data set to include the newly received sales data from France.
Download the france_sales_201110 data set from the link below:
france_sales_201110.sas7bdat |
Save it in the same directory folder as the online_retail data set so that it can be accessed in the same library in SAS Studio:
Before we can load the data set from France to the online_retail data set, we must ensure the two data sets have the same structure.
They should have the same number of columns, and each column should have the same attributes.
Does the france_sales_201110 data set have the same structure as the online_retail data set?
Let's find out.
A quick look at the france_sales_201110 data set tells us there are 11 columns in the data set:
This is more than we have in the online_retail data set, which has just eight columns.
A more complete comparison between the two data sets can be achieved by running the COMPARE procedure.
Below is an example.
proc compare
base = a.online_retail
compare = a.france_sales_201110
listvar
novalues;
run;
base = a.online_retail
compare = a.france_sales_201110
listvar
novalues;
run;
The COMPARE procedure compares the data set structure, column attributes and even the data points between the two data sets.
There are a few things we noticed from the first table created:
There are seven variables in common between the two data sets, which is good.
There is one variable in the online_retail data set that is not in the France data set.
There are also four variables in the France data set that are not in the online_retail data set.
Finally, there is one common variable in the two data sets that have conflicting types.
Let's look at the second table and find out more details about the inconsistencies.
From the second table, we find out that the INVOICEDATE variable appears only in the online_retail data set but not the France data set.
Instead of having the INVOICEDATE variable, the FRANCE data set has the four variables below:
- INVOICE_MON
- INVOICE_DAY
- INVOICE_YEAR
- INVOICE_TIME
The France data set essentially contains the same invoice information as the online_retail data set.
It just has the information split into four variables.
We can verify this by looking at the data portion of each data set.
Below is the INVOICEDATE column from the online_retail data set. It contains the invoice date and time.
The France data set contains the same information but split into the four columns:
Before we can concatenate the two data sets, these four columns have to be combined into one.
Now, the second table from the COMPARE procedure also displays a variable that has conflicting types between the two data sets:
CUSTOMERID is a numeric variable in online_retail, but a character variable in the France data set.
This also has to be cleaned up before we combine the two data sets into one.
In summary, below are the two things that need to be cleaned up before we concatenate the two data sets:
- Combine the four invoice variables from the france_sales_201110 data set into one variable.
- Convert the CUSTOMERID variable from the france_sales_201110 data set into a numeric variable.
Fill out my online form.