Project 2 section 3 (temp)

Search the site...

Time Series Modeling
[3-10]

Properly setting up the time variable for the x-axis is crucial to your time series analysis.

Before we go ahead and create the time variable, let's take a look at how the data is structured.

If you haven't created the SALES data set, copy and run the code from the yellow box below:

data sales;
retain month;
input Week Day TotalSales OrdertypeA OrdertypeB OrdertypeC;
if _n_ <= 19 then month = 1;
else if _n_ <= 39 then month = 2;
else month = 3;
datalines;
1 4 539.577 61.543 175.586 302.448
1 5 224.675 38.058 56.037 130.58
1 6 129.412 21.826 25.125 82.461
2 2 317.12 41.542 113.294 162.284
2 3 210.517 37.679 56.618 116.22
2 4 207.364 30.792 50.704 125.868
2 5 263.043 43.304 66.371 153.368
2 6 248.958 38.584 85.961 124.413
3 2 344.291 33.973 148.274 162.044
3 3 248.428 36.399 43.306 168.723
3 4 281.42 45.706 111.036 124.678
3 5 243.568 43.851 66.277 133.44
3 6 308.178 43.339 136.434 128.405
4 2 363.402 46.241 120.865 196.296
4 3 336.872 56.519 136.709 143.644
4 4 246.992 56.167 78.101 112.724
4 5 308.88 51.66 92.272 164.948
4 6 233.126 47.717 71.474 113.935
5 2 404.38 59.135 157.681 187.564
1 3 298.56 90.476 80.509 127.575
1 4 229.249 42.904 43.962 142.383
1 5 236.304 47.331 72.444 116.529
1 6 297.174 32.077 127.358 137.739
2 2 409.401 58.721 139.034 211.646
2 3 231.035 36.017 75.813 119.205
2 4 238.826 35.576 79.997 123.253
2 5 235.598 54.401 75.613 105.584
2 6 242.112 37.656 59.907 144.549
3 2 490.79 57.81 236.248 196.732
3 3 289.657 43.359 89.382 156.916
3 4 298.459 45.555 148.718 104.186
3 5 323.603 45.55 120.548 157.505
4 3 616.453 67.884 267.342 281.227
4 4 346.035 70.376 154.242 121.417
4 5 307.645 71.068 100.544 136.033
4 6 253.847 64.137 109.062 80.648
5 2 530.944 118.178 260.632 152.134
5 3 333.359 51.199 124.66 157.5
5 4 306.356 47.002 99.892 159.462
1 6 416.83 109.888 131.165 175.777
2 2 415.187 77.388 154.863 182.936
2 3 268.002 46.295 96.87 124.837
2 4 234.503 53.366 69.15 111.987
2 5 234.724 47.399 77.61 109.715
2 6 230.064 48.081 72.826 109.157
3 2 357.394 59.042 130.098 168.254
3 3 259.246 44.809 99.072 115.365
3 4 244.235 39.025 110.74 94.47
3 5 402.607 39.6 240.922 122.085
3 6 255.061 57.467 88.462 109.132
4 2 342.606 41.418 135.189 165.999
4 3 268.64 34.193 115.536 118.911
4 4 188.601 32.653 81.576 74.372
4 5 202.022 51.985 51.93 98.107
4 6 213.509 36.748 71.353 105.408
5 2 316.849 59.131 92.639 165.079
5 3 286.412 54.224 115.746 116.442
5 4 303.447 58.378 142.382 102.687
5 5 304.95 76.763 96.478 131.709
5 6 331.9 107.568 121.152 103.18
;
run;

Except for the very first week, the sales data is organized into weeks of five days:

If you scroll down, you will notice that some weeks are missing a day or two.

Let's scroll down to the week 3 in month 2:

There are only four days in this week. There is no sales record for Friday.

In addition, the following week is missing sales record for Monday as well.

The missing records are likely due to holidays.

The store is closed and there is no sales record for these days.

Creating the Time Variable

Now, there are a number of ways to create the time variable.

The simplest way to create the variable is to treat each observation as an independent time point.

Let's look at an example.

data sales2;
set sales;
time = _n_;
run;

This creates a TIME variable that goes up from one to 60:

The problem with treating each observation as an individual time point is that the time variable created does not reflect the seasonal pattern of the data.

Let's look at time 29 to 34:

Time 29 is a Monday:

In the previous section, we learned that there is a seasonal spike (i.e. higher sales) on Monday.

When modeling the data, we would expect that there is also a spike at time 34 (which is five periods after day 29):

However, due to the missing values, time 34 is actually a Wednesday!

Treating each observation as an individual time point changes the seasonal pattern of the data.

This will affect the model that we are going to build.

A better way to create the time variable is to create a sequential date column that matches the weekday from (1=Sunday to 7=Saturday).

Let's look at an example.

data sales2;
set sales;
weekno = week + 4*(month-1);
time = day + 7*(weekno-1);
drop weekno;
run;

data time_temp;
do time = 4 to 90;
day = mod(time, 7);
if day = 0 then day = 7;
output;
end;
run;

data sales3;
merge time_temp sales2;
by time;
drop month week;
run;

The breakdown of the code can be found here.

The TIME column created begins at time 4.

It corresponds to the time point at Day 4 of the very first week.

Data are collected starting on a Wednesday.

Time 4-7 represents the sales record from Wednesday to Saturday in the very first week:

Note: Saturday is a holiday and the totalsales are set as missing.

This row is also included in the data set:

Similarly, time 8-14 represents the second week of the data.

It goes from (1=Sunday) to (7=Saturday):

Similarly, additional rows are added for the weekends:

The data set is now structured by weeks of 7 days (except for the very first week).

Let's scroll down to the weeks that contain holiday(s).

Time 43-49 represents the third week of month two:

The Friday in this week is a holiday. The store is closed.

However, an additional row is added to the data set as well:

The TIME column is structured in a group of 7 days (except for the very first week).

Whether the week contains weekends or holidays, there are 7 rows of data for the week.

The benefits of creating the TIME column in such a way is that the seasonal pattern can be modeled every seven periods.

The seasonal pattern is now incorporated into the TIME column.

Exercise

Are there any significant difference in sales between the different week of the month?

Create a frequency table for the total sales for each week of the month.

In addition, fit an ANOVA model and test the difference in sales between five weeks of the month.

Need some help?

Fill out my online form.