Project 2 section 15

Search the site...

Sentry Page Protection

Time Series Modeling
[15-15]

In this section, we will go back to the SALES data set we worked on in the beginning.

We will build an ARIMA model to forecast the total sales based on the sales record over the 60-day period.

If you haven't created the SALES3 data set, copy and run the code from the yellow box below:

data sales;
retain month;
input Week Day TotalSales OrdertypeA OrdertypeB OrdertypeC;
if _n_ <= 19 then month = 1;
else if _n_ <= 39 then month = 2;
else month = 3;
datalines;
1 4 539.577 61.543 175.586 302.448
1 5 224.675 38.058 56.037 130.58
1 6 129.412 21.826 25.125 82.461
2 2 317.12 41.542 113.294 162.284
2 3 210.517 37.679 56.618 116.22
2 4 207.364 30.792 50.704 125.868
2 5 263.043 43.304 66.371 153.368
2 6 248.958 38.584 85.961 124.413
3 2 344.291 33.973 148.274 162.044
3 3 248.428 36.399 43.306 168.723
3 4 281.42 45.706 111.036 124.678
3 5 243.568 43.851 66.277 133.44
3 6 308.178 43.339 136.434 128.405
4 2 363.402 46.241 120.865 196.296
4 3 336.872 56.519 136.709 143.644
4 4 246.992 56.167 78.101 112.724
4 5 308.88 51.66 92.272 164.948
4 6 233.126 47.717 71.474 113.935
5 2 404.38 59.135 157.681 187.564
1 3 298.56 90.476 80.509 127.575
1 4 229.249 42.904 43.962 142.383
1 5 236.304 47.331 72.444 116.529
1 6 297.174 32.077 127.358 137.739
2 2 409.401 58.721 139.034 211.646
2 3 231.035 36.017 75.813 119.205
2 4 238.826 35.576 79.997 123.253
2 5 235.598 54.401 75.613 105.584
2 6 242.112 37.656 59.907 144.549
3 2 490.79 57.81 236.248 196.732
3 3 289.657 43.359 89.382 156.916
3 4 298.459 45.555 148.718 104.186
3 5 323.603 45.55 120.548 157.505
4 3 616.453 67.884 267.342 281.227
4 4 346.035 70.376 154.242 121.417
4 5 307.645 71.068 100.544 136.033
4 6 253.847 64.137 109.062 80.648
5 2 530.944 118.178 260.632 152.134
5 3 333.359 51.199 124.66 157.5
5 4 306.356 47.002 99.892 159.462
1 6 416.83 109.888 131.165 175.777
2 2 415.187 77.388 154.863 182.936
2 3 268.002 46.295 96.87 124.837
2 4 234.503 53.366 69.15 111.987
2 5 234.724 47.399 77.61 109.715
2 6 230.064 48.081 72.826 109.157
3 2 357.394 59.042 130.098 168.254
3 3 259.246 44.809 99.072 115.365
3 4 244.235 39.025 110.74 94.47
3 5 402.607 39.6 240.922 122.085
3 6 255.061 57.467 88.462 109.132
4 2 342.606 41.418 135.189 165.999
4 3 268.64 34.193 115.536 118.911
4 4 188.601 32.653 81.576 74.372
4 5 202.022 51.985 51.93 98.107
4 6 213.509 36.748 71.353 105.408
5 2 316.849 59.131 92.639 165.079
5 3 286.412 54.224 115.746 116.442
5 4 303.447 58.378 142.382 102.687
5 5 304.95 76.763 96.478 131.709
5 6 331.9 107.568 121.152 103.18
;
run;

data sales2;
set sales;
cumweek = week + 4*(month-1);
time = day + 5*(cumweek-1)-3;
run;

data day_temp;
do time = 1 to 63;
output;
end;
run;

data sales3;
merge day_temp sales2;
by time;
drop cumweek;
run;

proc delete lib=work data=sales sales2 day_temp;
run;

We will first use the ARIMA procedure to check whether there is autocorrelation in the time series.

proc arima data=sales3;
identify var=totalsales;
run;
quit;

From the descriptive statistics, we see that there are 63 observations in the SALES3 data set with three missing values.

The average daily sales is around $300 with a standard deviation of 88.85.

The p-values from the Ljung-box test are above 0.05 at lag 6 and 12.

This concludes that the series is purely white noise:

Since the time series does not show any systematic pattern, we could simply assume the daily sales resolves around $300 with a standard deviation of $88.85.

The future daily sales are predicted to be $300 for the foreseeable future.

This is the same as fitting an ARIMA (0 0 0) model to the data with no differencing, AR or MA term.

We will now add an ESTIMATE statement to the ARIMA procedure and look at the results:

proc arima data=sales3;
identify var=totalsales noprint;
estimate;
run;
quit;

The parameter estimate for µ (mean) is $300.83.

This is the mean of the entire time series:

The AIC and SBC are 710.7096 and 712.8039, respectively.

Now, let's perform the forecasting and compute the RMSE:

proc arima data=sales3;
identify var=totalsales noprint;
estimate noprint;
forecast back=15 lead=15 out=out_pred1;
run;
quit;

data test;
set out_pred1;
if _n_ >=49;
run;

proc sql;
select mean(abs(totalsales-forecast)) format 20.10 as mae,
sqrt(mean((totalsales-forecast)**2)) format 20.10 as rmse from test;
quit;

Scroll down to the bottom of the output.

The MAE and RMSE are 49.53 and 60.46, respectively.

Are we done? Can we get a better model?

Let's look at the residual, which is the difference between the actual value and the predicted value.

proc sql;
select totalsales, forecast, residual
from test;
quit;

You can see that some residuals are fairly large.

Some of the sales predictions are off by more than $100.

Maybe there is a model that does a better job in the sales forecast.

In section 2, we have learned that Monday's sales are substantially higher than the rest of the week.

We can model this seasonal spike by having a seasonal differencing of 5 days.

proc arima data=sales3;
identify var=totalsales(5);
estimate;
forecast back=15 lead=15 out=out_pred2;
run;
quit;

Adding a seasonal differencing of 5 days to the model will change the outlook of the ACF and PACF plots.

Let's look at the tables generated.

The p-values from the Ljung-box test are still at an insignificant level:

This is good.

Let's look at the ACF and PACF plots:

Both plots show a spike at lag 5. However, based on the Ljung-box test, we cannot reject the null hypothesis that the residuals are white noise.

The residual plots look worse. However, let's look at the AIC and SBC:

The AIC and SBC are 635.0945 and 634.0457, respectively.

These are much lower (better) than the earlier model where AIC=710.7096 and SBC=712.8039.

Now, let's look at the residuals as well as the RMSE:

proc arima data=sales3;
identify var=totalsales(5) noprint;
estimate noprint;
forecast back=15 lead=15 out=out_pred2;
run;
quit;

data test;
set out_pred2;
if _n_ >=49;
run;

proc sql;
select mean(abs(totalsales-forecast)) format 20.10 as mae,
sqrt(mean((totalsales-forecast)**2)) format 20.10 as rmse from test;
quit;

Again, scroll to the very bottom.

The MAE and RMSE are much lower!

They are now at 5.51 and 5.95 as opposed to 49.53 and 60.46 from the earlier model.

The residuals also look great!

The residuals are mostly under 10. The sales forecast is off by less than 10 at each of the time points.

Note: the residuals look too clean to be real. We suspect the sales data set might have been systematically generated, as opposed to being obtained as an actual sales record of a business.

Done! The ARIMA model we have identified is the Seasonal ARIMA (0 0 0) (0 1 0)5.