The following coursebook is produced by the team at Algoritma for Online Data Science Series: Time Series Analysis for Business Forecasting workshop The coursebook is intended for a restricted audience only, i.e. the individuals having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.
Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference, etc.
Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Training Objectives section we’ll outline the syllabus, identify the key objectives and set up expectations for each module. Under the Libraries and Setup section you’ll see some code to initialize our workspace and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here.
Time series data is one of the most common form of data to be found in every industry. It is considered to be a significant area of interest for most industries: retail, telecommunication, logistic, engineering, finance, and socio-economic. Time series analysis aims to extract the underlying components of a time series to better understand the nature of the data. In this workshop we will tackle a common industry case of business sales.
This 3-day online workshop is a beginner-friendly introduction to Time Series Analysis for Business Forecasting. By performing a time series analysis on your historical business data and comparing it with current trends, you will be able to make a more informed decision.
This is the first online data science series course of Time Series Analysis for Business Forecasting. The primary objective of this course is to provide a participant a comprehensive introduction about tools and software for performing a time series analysis using the popular open-source tools: R. The material will covers:
lubridate
By the end of this workshop we will have a Learn by Building section using xxx dataset. We have prepared an Rmd template with several section you need to fill in to complete the exercise. Using what you have learned, try to create a time series analysis report and complete the report with a business recommendations based on the result!
In this Library and Setup section you’ll see some code to initialize our workspace, and the packages we’ll be using for this project.
Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
You will need to use install.packages()
to install any packages that are not yet downloaded onto your machine. To install packages, type the command below on your console then press ENTER.
## DO NOT RUN CHUNK
packages <- c("dplyr", "lubridate", "ggplot2", "prophet")
install.packages(packages)
Then you need to load the package into your workspace using the library()
function. Special for this course, the rmarkdown packages do not need to be called using library()
.
Before we venture further into data analysis, we’ll take one step back to familiarize ourselves with the tools we will use. By now, you should be able to install R and R Studio in your machine. But what is the difference between the two? R and R Studio is a different application, and each has its role.
R is a programming language, where a set of instructions will be translated into computer language. Try to run the following chunk using the green play button on the top right or use the shortcut ctrl
+ shift
+ enter
:
#> [1] 2 3 4
The previous command written in R is to perform a vector operation ofthe following:
\(\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + 1 = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix}\)
Not to be confused with R language, R Studio, is a supplementary application made for streamlining the workflow of working with R, this type ofapplication is commonly known as Integrated Development Environment (IDE). The application you are opening right now is called R Studio and under the hood, R Studio is communicating with an R session by sending the command on the chunk above through the console Window on the bottom of the application. If you strip down R Studio to its very core, really the R console on the bottom is the same one as R console when you open the original R application on your machine.
R is a statistical programming language created by Ross Ihaka and Robert Gentleman at the Department of Statistics, at the University of Auckland (New Zealand). R is created for data analysis and,as such,is different from traditional programming languages. R is not just a statistical programming language, it is a complete environment for data analyst and the most widely used data analysis software today.
R provides numerous additional packages for which add out-of-the-box functionalities for various uses: statistical tests, time-series analysis, beautiful visualization, and various machine learning tasks such as regression algorithms, classification algorithms, and clustering algorithms. The R community is noted for its active contributions in terms of packages.
Part of the reason for its active and rapidly growing community is the open-source nature of R. Users can contribute packages–many of which packaged some most advanced statistical tools and customized templates for visualization that is not found in other commercials, proprietary statistical computing software. In the first section of Library and Setup, we have prepared 4 main packages where 3 of them are developed by the R community.
Dive deeper into R’s analytical capability, R has been used not only for descriptive analytics but also to develop machine learning and artificial intelligence project of major software companies in the world. The main package we will be usingin this courseisprophet
and is developed by the core data science team at Facebook1. It supports a procedure for forecasting time series data designed for an analyst to apply a set of intuitive model parameters and options based on their domain knowledge while retaining the ability to fall back on fully automated statistical forecasting.
RHadoop, ParallelR, Revolution R Enterprise, and a handful of other toolkits adds powerful big data support, allowing data engineers to create custom parallel and distributed algorithms to handle parallel / map-reduce programming in R. This makes R a popular choice for big data analytics and high performance, enterprise-level analytics platform.
As we have mentioned in the earlier section, R is one of the most popular tools in working with data. In R, you will store a data in an R object. The object is stored in a memory for each R session and is stored under an assigned name. This is an example of creating an R object:
If you run the chunk above, you shouldsee a newly created variable in theenvironment pane called an_obj
. This way, we could easily call every object we have created using the variable name:
#> [1] "This is a value"
So 3 things to note:
1. You need to name an object and use the <-
to assign a value
2. An object name needs to start with an alphabet, can contain alphanumerics, dots (.
), and underscores (_
)
3. A value could have a differentform than a sentence of: “This is a variable” which we will discuss in the next section
The most basic form of R object is a vector. In math, avector is aused to denotes magnitude and direction which is also implemented in R. The previous an_obj
stored a value of: “This is a value” and it stores the value as a character class. There are 5 basic class variables in R:
# character
a_char <- c("Algoritma", "Indonesia", "e-Commerce", "Jakarta")
# complex
a_comp <- c(1+3i, (1+3i)*2)
# numeric
a_num <- c(-1, 1, 2, 3/4, 0.5)
# integer
an_int <- c(1L, 2L)
# logical
a_log <- c(TRUE, TRUE, FALSE)
To check each of the 5 variable’s class, use the class()
function and see if each return the correct class names:
Later on, we’ll see exactly why understanding the type is important when we imported the dataset on the next section: Exploratory Analysis. We will also dive deeper into getting to know two other important classes later on: Factor and Date.
A vector, however, has a certain rule to be followed: Inside a single vector, all needs to be within one class. Let’s take alook at the following example:
#> [1] 1.0 1.0 3.0 0.8
See how some value of TRUE
is transformed into 1? This process is known as implicit coercion where vector’s values are forced into one type of class based on the most general value existing. Let’s take alook at the following illustration:
The illustration above shows the hierarchy of R object’s class from the most specific (inward) to the most general (outward). An implicit coercion will follow the rules of transforming the value into the most general ones, means the previous vector of c(TRUE, 1L, 3, 4/5)
will be coerced into the most general one, in this case, numeric:
#> [1] "numeric"
Knowledge Check:
Based on your understanding of vector’s class and implicit coercion, what do you think each of the vectors class are?
Since we have learned about vector and class we will now discuss other types of object. Consider this case:
You are working in a retail company and is given the invoices data from 2019 in tabular format. How would you store that in R?
To break the same-class rule on vectors, we can use another type of object, called list. A list has no same-class rule like vector,you can store several types of data into one object:
a_list <- list(
first_list = c(1,5,4,2,6),
second_list = c(TRUE, FALSE, FALSE),
third_list = c("O", "A", "B", "AB")
)
a_list
#> $first_list
#> [1] 1 5 4 2 6
#>
#> $second_list
#> [1] TRUE FALSE FALSE
#>
#> $third_list
#> [1] "O" "A" "B" "AB"
Let’s take this one step further: a tabular data set usually contains a set of records with a set of variables, a more general terms for this is: row and column. So how do we properly read a tabular data in R? We will need to use an object called data frame. A data frame is a special case of list, in which the amount of data stored in each list has to be the same length:
a_df <- data.frame(
id = 1:5,
name = c("Tiara", "Husain", "David", "Ajeng", "Tanesya"),
year = c(2017, 2018, 2018, 2018, 2019)
)
a_df
As seen on the output above, you have now stored 5 total records in 3 variables each having distinct class types. This form should also more recognizable for you: tabular. In the next section we will learn how to perform a data wrangling on the data frame to prepare a time series data.
In the previous section we have learned the data structure in R. Now let’s import some data to work with using read.csv()
function:
The read.csv()
function has one mandatory parameter to be passed in: the path relative to your markdown file. In this case, if you take a look at the project directory you will see data_input
folder and inside there is sales_train.csv
file. By passing the file’s path as a parameter, R will know where to look for the data then begin to import and store it into sales
object. Now let’s validate what we have learned so far: if we import a tabular data in R, it will be stored as a data frame. If we take the sales
object and check its class, we’ll get the following output:
#> [1] "data.frame"
Now let’s learn how to quickly inspect a data frame. If you’re used in using Spreadsheet application such as Microsoft Excel, you must be used in viewing the whole sheet. Consider the case when the data accounts for thousands and millions of rows, there are hardly any use in viewing the tabular data display. A very common limitation of Excel is: it limits the maximum viewable record. Viewing entire sheet of tabular data requires huge computational resources. Instead, a data frame in R can be inspected as follows:
R is equipped with a handy way to functions for a data frame. iI this course we’ll be using dplyr
package functions. The first one is glimpse()
. If we pass in our data frame, the function then will return a set of useful output:
#> Observations: 2,935,849
#> Variables: 6
#> $ date <fct> 02.01.2013, 03.01.2013, 05.01.2013, 06.01.2013, 15.0...
#> $ date_block_num <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ shop_id <int> 59, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, ...
#> $ item_id <int> 22154, 2552, 2552, 2554, 2555, 2564, 2565, 2572, 257...
#> $ item_price <dbl> 999.00, 899.00, 899.00, 1709.05, 1099.00, 349.00, 54...
#> $ item_cnt_day <dbl> 1, 1, -1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1, 2, 1, ...
The data we are using is provided by one of the largest Russian software firms - 1C Company and is made available through Kaggle platform. This platform is one of the most popular data science platform providing various data science competition and donated datasets. A resourceful platform to learn data science practice from the community. Take your time to sign up to the platform.
Now back to sales data. Our glimpse()
function is an improved version of str()
function from R base. To use str()
you don’t need to load dplyr
package as it is a built-in function that came with your R distribution. Try to run str(sales)
on the chunk to see its output. The key difference with glimpse()
is that it has a better readability, than the base str()
. Some insights we can get from the output are:
The data consist of 2,935,849 observations (or rows)
It has 6 variables (or columns)
The following are the glossary provided in the Kaggle platform:
date
is the date format provided in “dd.mm.yyy” formatdate_block_num
is a consecutive month number used for convenience (January 2013 is 0, February 2013 is 1, and so on)shop_id
is the unique identifier of the shopitem_id
is the unique identifier of the productitem_price
is the price of the item on the specified dateitem_cnt_day
is the number of products sold on the specified dateitem_cnt_day
, which we’ll see how to analyze the business demand from the sales record.At first, getting used with glimpse()
can be a bit frustating. So let’s take a glimpse of the data frame using another function, head()
and tail()
:
As the name suggested, head()
and tail()
will provide you with the top 6 and last 6 records of our data frame. This way, you can get a better sense of how your data looks like in a tabular form. Now let’s use another technique to get a better picture of our data. Say we want to know how many unique shop_id
are there. A very simple technique to create a contingency table in R is by using the table()
function which will count the occurrence of each unique value in a vector. Recall that sales
is stored as a data frame, which is a special case of list where the value of each list is the same length. Since we are focusing on the shop_id
column, we’d like to extract the vector within our data frame, to do that we shall use the dollar ($
) sign and provide the column name to take the entire value in shop_id
column:
#> [1] 59 25 25 25 25 25
The function above will take the first 6 value of our shop_id
column. Now let’s get back to returning the count value of each unique shop using table()
function:
#>
#> 0 1 2 3 4 5 6 7 8 9 10
#> 9857 5678 25991 25532 38242 38179 82663 58076 3412 3751 21397
#> 11 12 13 14 15 16 17 18 19 20 21
#> 499 34694 17824 36979 59511 52734 22950 53227 63911 1792 58133
#> 22 23 24 25 26 27 28 29 30 31 32
#> 45434 6963 53032 186104 53910 105366 142234 49225 50860 235636 7947
#> 33 34 35 36 37 38 39 40 41 42 43
#> 5027 5752 58445 306 39638 46013 13440 4257 41967 109253 39282
#> 44 45 46 47 48 49 50 51 52 53 54
#> 39530 35891 66321 56695 21612 15849 65173 44433 43502 52921 143480
#> 55 56 57 58 59
#> 34769 69573 117428 71441 42108
The output gives us a set of output showing the number of occurrence for each shop’s records. For example, the shop with the id 0 has a total of 9857 records, the shop with id 1 has a total of 5,678 records and so on. This could be handy to get to know which store has the most sales record in our overall data. Now let’s take the contingency table a further. I shall use the dplyr
grammar to extract the same result as our table()
function:
Now you could see there’s a new data frame that is showing the amount of record occurrence from our sales
data. Think of this as Excel’s pivot table equivalent of counting a column occurrence. The syntax is also fairly simple; we tried to take the sales
data frame and use a piping (%>%
) technique on them. The piping technique is another form of putting everything behind the pipe and passing it as the first parameter of the next function. So to illustrate:
So what we are trying to do is to take the sales
data frame, grouped it by the shop_id
column and perform a count based on the unique value of shop_id
using count()
function. Try running the following codes and it shall give you the same exact output as our dplyr
grammar counterpart:
An immediate advantage of using the piping operation is that we don’t need to create an intermittent object to store a temporary object like grouped_sales
on the chunk above. Now let’s try to see the most popular shop from our data frame:
The new function that I am introducing is the arrange()
function that could help us sort a certain data frame based on the number of a column. Since I wanted to sort this based on the n
column, I put in -n
as the parameter to sort the data frame based on the n
column value in descending order. If you were to remove the negative from the -n
you will get a data frame that has is sorted in ascending order.
Discussion: What do you need to do to be able to call the previous contingency table that we have created down the line? (Recall: Assigning R object)
Now before we moving on into preparing our time series data, I would like to introduce one more useful dplyr
function: filter()
. This function gives you the ability to filter your data frame using a set of conditional rules. For example, on the last exercise, we have gain the information that shop 31 has the most record sales and the least popular shop is shop 36. Say you would like to analyze the time series attribute of our shop 31 sales. To do that, we can apply the filter()
function as follows:
#> Observations: 235,636
#> Variables: 6
#> $ date <fct> 03.01.2013, 02.01.2013, 11.01.2013, 26.01.2013, 25.0...
#> $ date_block_num <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ shop_id <int> 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, ...
#> $ item_id <int> 4906, 4906, 4890, 4901, 4901, 4901, 4901, 4901, 4901...
#> $ item_price <dbl> 1794.00, 1789.00, 799.00, 1499.00, 1499.00, 1499.00,...
#> $ item_cnt_day <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1...
Now I have created a new object of sales_31
where it has a total of 235,636 records on the data frame instead of the original 2,935,849. Means we now have stored every record of the sales happening on shop 31 under the sales_31
object.
Knowledge Check:
Since we have learned some dplyr
functions, try to match each of the function name with its uses:
Function names:
A. glimpse()
B. group_by()
+ count()
C. arrange()
D. filter()
Time series data is defined as data observations that are collected at regular time intervals. Since we are talking about software sales data, we need to validate the assumption that the data is collected through regular time intervals. Notice how the date
column specifies the sales of the day for each items:
As seen on the head()
output, the first 6 rows do not seem to be in order. Since time series assumed the data to be in order, we need to make sure the date
column is in order. Let’s apply arrange()
function we have learned earlier:
Now see how the data frame is now sorted based on the date
column instead and all the first records are sales from the first February. This is an important part where understanding a column type will come in handy. When you tried to sort the column based on date
column that is read as a “factor” (a character type that has been transformed into R’s categorical type), it will try to sort it based on the alphabetical order. So if you have two dates: “01.02.2013” and “02.01.2013” on dd.mm.yyyy
format, R by default is reading it as character resulting in the first date (“01.02.2013” or the first February) is sorted as the first row, while (“02.01.2013” or the second of January) will be sorted as second row instead. You wouldn’t want this because if the date were to be sorted correctly you need to have the “02.01.2013” as the first entry instead.
To address this, we will need to transform our date
column not as “character” but as a “date” column instead. A date is another type in R used to store date types. Let’s try to transform our date
column into a date type. To do that, we shall use the dmy()
function from lubridate
package:
Observe two things in the previous chunk:
mutate()
function to alter the date
column and reassigning it with the date
column that has been applied with dmy()
functiondate
column is now displayed with the “yyyy-mm-dd” format which is the default date display format in Rdate
column that has been transformed into a date typeObserve how the date
column is now displayed with the format of yyyy-mm-dd
. This is a default format for date type in R. Once you have it as a date type, it will try to sort your data frame as how a date is supposedly sorted. The first record showing up on our data frame is the second of January instead of the first of February, which shows the “correct” way of sorting our data. Another useful implementation of a date type is that the date function operate on a date type the same way as numeric value. This means we could perform another useful function called range()
to caculate the minimum and maximum value of our date
column:
#> [1] "2013-01-02" "2015-10-31"
This, however is not applicable if we still store it as a “character” type. Try to run: range(sales$date)
like the chunk below and you should get an error implying the range()
function cannot determine the maximum and minimum number of the date
column stored in character (our sales
object still store date
column as a character):
Now let’s highlight again the most important definition of a time series: it is an observation that is recorded at a regular time interval. Notice that the records has a multiple sample of the same day. This must mean that our data frame violates the rules of a time series where the records is sampled multiple time a day. Based on the structure of our data, it is recording the sales of different items within the same day. An important aspect in preparing a time series is called a data aggregation, where we need to aggregate the sales from one day into one records. Now let’s take a look at the codes:
daily_demand <- sales_31 %>%
group_by(date) %>%
summarise(
demand = sum(item_cnt_day)
)
daily_demand
I am reusing the group_by()
function to perform a grouped operation on the date
column and pipe it with summarise()
function. The combination of group_by()
+ summarise()
is a common practice you will use repeatedly in data aggregation process. What it is that it takes all unique value of date
column and then summarized it in the new demand
column where it takes the item_cnt_day
column from different items of the same date and summed it. Now for exercise purposes, let’s try to create a monthly time series data. Since we need to group the data based on the monthly sales, we will need to engineer the date
column to specify the month of sales happening. Fortunately, lubridate
is equipped with a handy function floor_date()
that could help us to round our date based on the specified unit, in this case by month. Take a look at the following code:
#> [1] "2013-01-12"
If we were to use the floor_date()
function and passed in our 12th January of 2013, we’ll get to have the first month of January 2013 instead:
#> [1] "2013-01-01"
So to achieve an aggregated amount of sale each month I need to do the following:
floor_date()
sales_31 %>%
mutate(first_date_month = floor_date(date, unit = "months")) %>%
group_by(first_date_month) %>%
summarise(
demand = sum(item_cnt_day)
)
Observe how instead of using the date
column as the group key, we now use the newly created first_date_month
instead to summarise the total amount of month sales and gain the total demand of each month. Now let’s try another one by yourselve: Create an aggregated weekly amount of demand using the same technique, but instead of passing the unit = "months"
parameter on the floor_date()
function, use unit = "weeks"
instead:
Note that in performing data aggregation, you can only transform a more frequent data sampling to a more sparse frequency, for example:
On the other note, it is not possible to create a new data frame that has a more frequent sampling from a data frame that has a more sparse frequency, for example:
Knowledge Check:
Below is a list of an unordered steps in preparing raw data into a time series. How would you order the pre-processing steps?
group_by()
and summarise()
In this section we will try to perform a data preparation steps from scratch using a new data. Please open your dive-deeper.Rmd
file in your working directory. Take your time in reading the background of our analysis and complete the preprocessing steps under “Data Preparation” section.
One of an important aspect in time series analysis is performing a visual exploratory analysis. R is known for its graphical capabilities and has a very popular visualization package called ggplot2
. This package introduces a grammar of graphic for data visualization. Let’s take our daily demand data frame we have created earlier and observe the following codes:
The first function I am introducing is called the ggplot()
function where it takes our data frame and map the specified column into x and y axis under the aes()
function parameter. See how it takes our demand
and date
column and create a blank canvas based on the 2 columns. Next, I am going to put in a point geometry on top of our blank canvas specifying the amount of sales for each date:
The ggplot()
grammar of graphics give you an intuitive idea how to prepare a blank canvas based on our data frame and use the +
(Not to be confused with dplyr
’s piping syntax) annotation to add a geometry on top of it. The function will then use the data from our daily_demand
data frame and plot the point in each of its coordinates. This visualization alone could give us the idea of how the daily demand data in store 31 is distributed throughout the dates. We can see a significant increase of sales by the end of every year and a decrease of sales in the middle of the year. This component is known as seasonality component of a time series. Seasonality is a recurring effect of a time series happens on a given frequency of the observed natural period. Later down this course, we’d talk a more in-depth discussion about seasonality effect.
Let’s take the visualization technique we have learned earlier one step further. Fortunately, ggplot2
packages came with a set of useful features to create a pretty visualization. Let’s take a look at the following codes:
daily_demand %>%
ggplot(aes(x=date, y=demand)) +
geom_line(color = "tomato3") +
labs(
title = "Daily Software Sales",
subtitle = "Store 31",
caption = "1C Company",
x = "Date",
y = "Total Sales"
) +
theme_minimal()
A simple codes, yet a very powerful tools to produce a pretty visualization for your time series report. There are a lot of packages that provides a built-in theme for you to try play with. The one I’m using is theme_minimal()
that gives a miminalist accent on the visualization and provided them with annotations specified using labs()
function. If you edit the above code and type in: theme_
it will give you some recommendation on available themes on your machine. Other packages may include their own custom theme you can try on, but for now let’s stick with the default themes provided by ggplot2
package.
Let’s head back to our dive-deeper.Rmd
file. Please complete the second task of Dive Deeper under the “Visual Exploratory Analsysis” section. In this section, you are required to produce a plot of the data frame you have created in previous dive deeper session.
prophet
A very fundamental part in understanding time series is to be able to decompose its underlying components. A classic way in describing a time series is using General Additive Model (GAM) 2. This definition describes time series as a summation of its components. As a starter, we will define time series with 3 different components:
The idea of GAM is that each of them is added to describe our time series:
\(X_t = T_t + S_t + E_t\)
When we are discussing time series forecasting there is one main assumption that needs to be remembered: “We assume correlation among successive observations”. Means that the idea of performing a forecasting for a time series is based on its past behavior. So in order to forecast the future values, we will take a look at any existing trend and seasonality of the time series and use it to generate future values. In this course we will use Facebook’s prophet
package to perform a time series forecasting. Prophet enhanced the classical trend and seasonality components by adding a Holiday effect. It will try to model the effects of holidays which occur on some dates and has been proven to be really useful in many cases. Take, for example: Lebaran Season. In Indonesia, it is really common to have an effect on Lebaran season. The effect, however, is a bit different from a classic seasonality effect because it shows the characteristics of an irregular schedule.
prophet
Time SeriesTo use the prophet
package, we first need to prepare our time series data into a specific format data frame required by the package. The data frame requires 2 columns:
ds
, stored in POSIX.ct
type, a universal type of time stampy
Now let’s use other dplyr
function on our daily_demand
data frame called rename()
. The function is as straightforward as its name, it will specify new column names to our data frame:
#> Observations: 1,031
#> Variables: 2
#> $ ds <date> 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, 201...
#> $ y <dbl> 568, 423, 431, 415, 435, 312, 354, 262, 254, 315, 488, 313, 228,...
Next, let’s initiate a Prophet model using the prophet()
function and then piped it with the fit.prophet()
function with the prepared data frame as a parameter:
The model_31
object is now storing the information of the daily demand data in store 31. The idea of fitting a time series model is to extract the pattern information of a time series in order to perform a forecasting over the specified period of time. For example, based on the existing data, we’d like to perform a forecasting for 1 years into the future. To do that, we will need to first prepare a data frame that consist of the future time stamp range we’d like to forecast. Luckily, Prophet has provided make_future_dataframe()
function that help you to prepare the data:
#> Observations: 1,396
#> Variables: 1
#> $ ds <dttm> 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013-01-06, 201...
Now we have acquired a new future_31
data frame that consist of a date span of the beginning of a time series to 365 days into the “future” We will then use this data frame is to perform the forecasting using our model_31
we have created earlier:
Now, observe how our plot()
function take our model_31
, and newly created forecast_31
object to create a ggplot
object that shows the forecasting result. The black points in the plot shows the actual time series, and the blue line shows the “fitted” time series along with its forecasted values 365 days into the future. Now let’s take a look at the other view of our model:
Recall that in General Additive Model, we use time series components and perform a summation of all components. In this case you can see that the model is extracting 3 types of components: trend, weekly seasonality, and yearly seasonality. Means, in forecasting future values it will use the following formula:
\(X(t) = T(t) + S_w(t) + S_y(t)\)
Where \(T(t)\) is the trend in point t, \(S_w(t)\) is the weekly seasonality in point t, \(S_y(t)\) is the yearly seasonality in point t. If you take a look at the forecast_31
object you get from using the predict()
function on our model_31
, you’d get the values of each components on every t time point of our time series:
Now let’s try to manually validate the first forecasted value on the 2nd of January 2013 that is stored under yhat
column. Take a look at the following columns:
I am using a new select()
function from dplyr
that gives us the ability to select a specified column of interest. In this case, I’d like to take 5 of the time series components along with the given time stamp on ds
column. Now if we take the first row of our data we can see that yhat
column is in fact the summation of our trend
, weekly
, and yearly
column:
#> [1] 578.046
#> [1] 578.046
#> [1] 578.046
Let’s head back to our dive-deeper.Rmd
file. Please complete the third task of Dive Deeper under the “Baseline Prophet Model” section. In this section, you are required to create a simple Prophet model using the data you have created earlier.
The trend components of our model, as plotted using prophet_plot_components()
function is producing a decreasing trend over the year. I’ve mentioned in the beginning of this chapter that Trend is defined as a long term movement of average over the year. The methods that is implemented by prophet
by default is a linear model. Let’s make an illustration using ggplot2
visualization:
The linear model will take a straight line across the x axis using the ordinary least square method, means it tries to produce the least difference between the line and the actual demand value resulting in the long-term “average” value given a time point t accross the date. Linear model or linear regression is a common statistical tool to model a numerical value. The formula of a linear regression is as follow:
\(y = mx + C\)
In our time series context, where we try to model our trend component, y equals to the Trend, m equals the difference for every change of time point and C for the intercept. In a more specific manner we could say that:
\(T(date) = m(date) + C\)
To illustrate the slope part of our trend which is an important part in understanding the trend, we can create the model using lm()
function like follow:
#> (Intercept) date
#> 3039.4051283 -0.1687717
The coefficients of this model will be translated into:
\(\hat{demand} = -0.17 date + 3039.4\)
The first one is our slope or \(m\), which is calculated to be -0.17, so every increase of 1 unit of date, (in this case daily), the demand value decreases by 0.17. This shows a negative relationship between dates and demand and lead us to believe that there is indeed a downward trend for our demand time series. The prophet
however implements a changepoint detection which tries to automatically detect a point where the slope has a significant change rate. It will tries to split the series using several points where the trend slope is calculated for each range:
By default, prophet
specifies 25 potential changepoints which are placed uniformly on the first 80% of the time series. From the 25 potential changepoints, it will then calculate the magnitude of the slope change rate and decided the “significant” change rate:
In the plot above you could see that across the time series, the model detected 4 significant changepoints and separate the series into 5 different trend slopes. However, Prophet provided us a tuning parameter to adjust the detection flexibility using changepoint.prior.scale
parameter when creating the model.
before_2015 <- daily_demand %>%
mutate(
year = year(date)
) %>%
filter(year < 2015) %>%
rename(
ds = "date",
y = "demand"
)
after_2015 <- daily_demand %>%
mutate(
year = year(date)
) %>%
filter(year >= 2015) %>%
rename(
ds = "date",
y = "demand"
)
ggplot(before_2015, aes(x=ds, y=y)) +
geom_point() +
theme_minimal()
If we observe the first 2 years of our sales record, we can wouldn’t be able to say if the upcoming years will display a downward trend. In fact, if we were to focus onto the last semester of 2014, we could see a slightly increasing trend. To adjust the flexibility of our model, we can increase the value of changepoint.prior.scale
. By default, it is set on 0.01, let’s see how would it affect the model if we increase it into 0.5:
model_before_2015 <- prophet(yearly.seasonality = TRUE,
changepoint.prior.scale = 0.5) %>%
fit.prophet(before_2015)
future_before_2015 <- make_future_dataframe(model_before_2015, periods = 365)
forecaset_before_2015 <- predict(model_before_2015, future_before_2015)
plot(model_before_2015, forecaset_before_2015) +
add_changepoints_to_plot(model_before_2015) +
geom_point(data = after_2015, aes(x = as.POSIXct(ds), y=y), color = "tomato3")
Notice how the changepoint detection is now detecting an increasing trend on the last semester of 2014 due to the increase of the detection flexibility. This affects the model to detect a “slightly” significant trend change to predict future values. However, the actual future values, colored in red, can be observed that it did not follow the upward trend. In this case, increasing the changepoint prior scale parameter could lead to a poor estimation of the future value. In other cases, this parameter could be useful to adjust for various cases. For example: As the business analyst with the required domain knowledge, you actually know that the increasing trend by the end of 2014 is a legitimate trend that needs to be considered in the forecasting.
Now let’s talk about other time series component, seasonality. First, we will review the following plot:
Observe how by default it creates 2 different seasonality components: weekly and yearly. By default, prophet
will try to determine existing seasonality based on existing data provided. In our case, the data provided is the range of 2013 to end 2015. Any daily sampled data by default will be detected to have a weekly seasonality. While yearly seasonality, by default will be set as TRUE
if the provided data has more than 2 years of daily sample. The other regular seasonality Prophet are able to model is a daily seasonality which tries to model an hourly pattern of a time series. Since our data does not accommodate hourly data, by default the daily seasonality will be set as FALSE
.
To understand the seasonality, let’s try to visualize the weekly pattern of our time series:
daily_demand %>%
mutate(
wday = wday(date, label = TRUE)
) %>%
ggplot(aes(x=date, y=demand)) +
geom_point(aes(color=wday))
Notice how I used mutate()
function to create a new column called wday
. This column is generated using wday()
function from lubridate
package that will extract the name of the day of the given date. Next I am mapping the newly created wday
column as color aesthetic in geom_point()
function in ggplot
. This results into a plot that has a color-coded point for each different weekday. By viewing the plot, we can see that Saturday sales (point in yellow) and Friday sales (point in light green) seem to have a relatively higher sales compared to the other weekdays. While on Monday (points in navy blue), we can see a relatively smaller sales compared to other weekdays. This illustrates the seasonality effect: If a certain regular period of observation, in this case a weekly period, has a repeating seasonality effect per each frequency, we then assume that the seasonality effect is applicable to forecast any future value.
Now if you take the forecast_31
data frame and pay attention to the weekly
column, we can extract the unique seasonality effect for each weekday:
forecast_31 %>%
mutate(
weekday = wday(ds, label = TRUE),
weekly = round(weekly, 5)
) %>%
filter(ds <= max(daily_demand$date)) %>%
select(weekday, weekly) %>%
distinct() %>%
arrange(-weekly)
As we have observed in the previous plot, Friday and Saturday weekly seasonality has a relatively large effect towards the demand value compared to the other. Think of it as the following: Once you remove the trend component of the time series, you can average each of the weekday values across the year. Then, you normalized the value using it owns mean to recenter the value and get the following:
daily_demand %>%
mutate(
detrend = demand - forecast_31$trend[1:nrow(daily_demand)],
weekday = wday(date, label = TRUE),
) %>%
group_by(weekday) %>%
summarise(
average_effect = mean(detrend)) %>%
arrange(-average_effect)
As we can see the manually calculate weekly seasonality effect, the number approximates the seasonality effect that is extracted by Prophet. The difference we see here is because rather than averaging and normalizing the seasonality effect, Prophet uses a Fourier series to approximate the seasonality effect. It is trying to approximate a continuous function of t as seen in the Prophet’s components plot3. The Fourier Series, however, will be left out from the discussion of this coursebook. If you are confident in your algebraic and trigonometry, I recommend to read up on the Fourier Series article from Springer in the annotation.
The default provided seasonality modelled by prophet
for a daily sampled data is: weekly and yearly. Consider this case: a sales in your business is heavily affected by payday. Most customers tends to buy your product based on the day of the month. Since it did not follow the default seasonality of yearly and weekly, we will need to define a non-regular seasonality. To define a “custom” seasonality in prophet
, we can use the add_seasonality()
function piped in before we fit the data:
model_31_monthly <- prophet(changepoint.prior.scale = 0.05,
yearly.seasonality = FALSE) %>%
add_seasonality(name = "monthly", period = 30.5, fourier.order = 5) %>%
fit.prophet(train_daily)
future_31_monthly <- make_future_dataframe(model_31_monthly, periods = 365)
forecast_31_monthly <- predict(model_31_monthly, future_31_monthly)
prophet_plot_components(model_31_monthly, forecast_31_monthly)
See how the component changes from weekly and yearly seasonality into weekly and monthly seasonality instead. The components that is being extracted from our original data frame shifted from the effect of each day of the year into the effect of each day of the month. This is done by providing the add_seasonality()
function on the model and setting yearly seasonality as FALSE
. We provided period = 30.5
indicating that there will be non-regular 30.5 frequency in one season of the data. The 30.5 is a common frequency quantifier for monthly seasonality, since there are some months with a total of 30 and 31 (some are 28 or 29). The last parameter of the function is fourier.order
. This parameter is related to how the Fourier series is approximated. By adjusting the value, you can change the sensitivity of the seasonality series. The recommended and default order for weekly and yearly seasonality respectively is 3 and 10, however there are no “correct” way in specifying the order. For the monthly seasonality, I am using fourier.order=5
, feel free to change the value up or down to see the difference with its extracted seasonality effect.
Now let’s take a look at the forecast plot:
The plot we gain above has captured the effect of weekly and monthly value, but seems to fail in capturing a larger flux besides the two seasonality effect. This technique is better suited when you have a relatively smaller data that does not add up to a total of 2 years. Since the yearly seasonality cannot be calculated on a small sample of data, we would rather use a monthly and weekly seasonality instead. In our case, we can see that the model fails to capture the significant increase of sales by the end of each year, so a yearly.seasonality
will be preferable. So let’s try to create one last model that accommodates for weekly, monthly, and yearly seasonality effect:
model_31_monthly <- prophet(changepoint.prior.scale = 0.05,
yearly.seasonality = TRUE) %>%
add_seasonality(name = "monthly", period = 30.5, fourier.order = 5) %>%
fit.prophet(train_daily)
future_31_monthly <- make_future_dataframe(model_31_monthly, periods = 365)
forecast_31_monthly <- predict(model_31_monthly, future_31_monthly)
plot(model_31_monthly, forecast_31_monthly)
Discussion: Which model do you think works better in our case?
One of the advantage in using Prophet is the ability to model a holiday effect. This holiday effect is defined as a non-regular effect that needs to be manually specified by the user. This is fundamentally different from a seasonality effect, where the periods are expected to repeat until the end of the time series. Previously, I have mentioned that “Lebaran” season is one of the thing that can be modelled using the Prophet’s holiday component. This is because every year, we can observe an irregular date. In modelling the holiday component using Prophet, we will need to manually provide the “date of interest” to our model.
Now let’s take a better look for our data. You could see that every end of a year, there is a significant increase of sales. In some dates, it even exceed 900 sales a day. Let’s use the filter()
function to see the dates of any “anomaly” sales:
As we can see in the previous table, the relatively large sales mostly happened at the very end of a year between 27th to 31st December. Now let’s assume that this phenomenon is the result of the new year eve where most people spent the remaining budget of their Christmas or End year bonus to buy our goods. First, we’ll need to prepare a holiday data frame with the same ds
column as timestamp, the holiday unique name identifier, and the expected effect of the holiday for the given time series unit:
holiday <-
data.frame(
holiday = "newyeareve",
ds = dmy(c("31-12-2013", "31-12-2014", "31-12-2015", "31-12-2016")),
lower_window = -5,
upper_window = 0
)
holiday
Our lower_window
column specified how many time unit behind the holiday that is assumed to to be affected and upper_window
column specified how many time unit after the holiday that is assumed to be affected. The value of lower_window
needs to be smaller or equal than zero, while upper_window
needs to be bigger or equal to zero.
Once we have prepared our holiday
data frame, we can pass that into the prophet()
function:
model_31_holiday <- prophet(changepoint.prior.scale = 0.05,
holidays = holiday) %>%
add_seasonality(name = "monthly", period = 30.5, fourier.order = 5) %>%
fit.prophet(train_daily)
future_31_holiday <- make_future_dataframe(model_31_holiday, periods = 365)
forecast_31_holiday <- predict(model_31_holiday, future_31_holiday)
plot(model_31_holiday, forecast_31_holiday)
Observe how now it has more confidence in capturing the holiday effect on the end of the year instead of relying on the yearly seasonality effect. If you plot the components, you could also get the holiday components listed as one of the time series components:
See how the holidays are showing a significant spike at the 5 days window of new year eve. The height of the spike could also give us an idea of the effect of the seasonality. A rather “weak” effect of holiday effect could affect poorly on the model performance in the future.
Let’s head back to our dive-deeper.Rmd
file. Please complete the fourth task of Dive Deeper under the “Model Fine Tuning” section. In this section, you are required to propose a new imporvement over the baseline model. Use one of the model improvements you have learned in the previous section: holiday, non-regular seasonality, trend changepoint.
Recall how we performed a visual analysis on how the performance of our forecasting model earlier. The technique was in fact, a widely used technique for model cross-validation. It involves splitting our data into two parts: train and test. A train model is used to train our time series model in order to acquire the underlying patterns such as trend and seasonality. While the other, test, is purposely being kept for us to perform a cross-validation and see how our model perform on an unseen data. The objective is quite clear, is that we are able to acquire a glimpse of what kind of error are we going to expect for the model. Now let’s perform the splitting once more, recall that our data has the range of early 2013 to end 2015. Say, I am going to save the records of 2015 as a test data and use the rest for model training:
cutoff <- dmy("01-01-2015")
train <- daily_demand %>%
filter(
date < cutoff
) %>%
rename(
"ds" = date,
"y" = demand
)
test <- daily_demand %>%
filter(
date >= cutoff
) %>%
rename(
"ds" = date,
"y" = demand
)
ggplot(daily_demand, aes(x=date, y=demand)) +
geom_point(data = train, aes(x=ds, y=y)) +
geom_point(data = test, aes(x=ds, y=y), color="tomato3")
Note that the points in red will now be treated as unseen data and will not be passed in to our Prophet’s model. Now let’s create our model using previously tuned model we have created on the earlier section:
model_final <- prophet(changepoint.prior.scale = 0.05,
yearly.seasonality = TRUE,
holidays = holiday) %>%
add_seasonality(name = "monthly", period = 30.5, fourier.order = 5) %>%
fit.prophet(train)
future_final <- make_future_dataframe(model_final, periods = nrow(test) + 1)
forecast_final <- predict(model_final, future_final)
plot(model_final, forecast_final)
Since Prophet produces a ggplot2
output, we can simply add a geometry layer as well as passing our test
data frame as the data reference for geom_point()
:
plot(model_final, forecast_final) +
geom_point(data = test %>% mutate(ds = as.POSIXct(ds)), aes(x=ds, y=y), color="tomato3")
Based on the plot above, we can see that the model is capable in forecasting the actual future value. But for most part, you will need to quantify the error to be able to have a conclusive result. To quantify an error, you need to calculate the difference between actual demand and the forecasted demand. However, there are multiple metrics you can use to express the value. Some of the most common ones are:
\(RMSE = \sqrt{\frac{\Sigma{(y - \hat y)^2}}{n}}\)
\(SSE = \Sigma{(y-\hat y)^2}\)
\(MAPE = \frac{1}{n}\Sigma{\frac{| y-\hat y|}{y}} \times 100\%\)
Each of the metrics above are commonly used in evaluating forecasting model. The idea of choosing the error metrics is by understanding what is being quantified. Any of the metrics can be used to benchmark a forecasting model, as long as we are consistent in using it. The summary of the 3 models is listed as follow:
test
data frame using the forecasted value stored under yhat
column in forecast_final
data frame:eval <- test %>%
mutate(
ds = as.POSIXct(ds)
) %>%
left_join(forecast_final) %>%
select(ds, y, yhat, yhat_upper, yhat_lower)
eval
Observe how I use a left_join()
function to combine the test
and forecast_final
data frame using the key column: ds
. Think of it as the LOOKUP function where it tries to look up the dates in test
data frame to its respective values in forecast_final
. Last, I use select()
function to take only the columns I am interested in, which is the date, actua value, and forecasted value. We will discuss the yhat_upper
and yhat_lower
at the very end of our course. For now, let’s continue to calculate the error metrics. I have created a custom mape()
function for us to use, where it calculates the MAPE using 2 parameters: y and yhat:
#> [1] 0.217863
The figure we just gained above could be translated as follow: “In average, you will expect the forecasted value to have in average 20% error from its actual value”. This error value will give us the idea of good/bad is the performance of our model. Recall that we have perform the cross-validation technique where we purposely preserve the test
data set and use it to simulate the forecating power of our model.
A model that has a low forecasting power will be proven to be risky for any practical implementation. Most likely, a business user will use the model to create a decision. To do that, as someone responsible in proposing the model, you will need to provide an accountability of the model. A metrics error such as MAPE, could give us better insight of how does our model performing compared to the actual distribution of the data. For example, a certain business user wouldn’t allow a model that has MAPE bigger than 30% to be considered as decision making reference. Most times, you will need to evaluate the model by discovering which aspect of the model can be improved upon. This activity of figuring out model’s aspect that can be improved is commonly referred to as model tuning. In our case, in order to create a better forecasting model, we may need to consider external influences such as: new software releases, store campaigns, unrecorded holidays, etc. This kind of external regressor could be embed to the model using the holiday components.
Figuring out some “poorly forecasted points” is always a good start in fine tuning a time series model. Let’s see how does the forecast power of the forecasted months of 2015:
eval %>%
mutate(
month = month(ds, label = T)
) %>%
group_by(month) %>%
summarise(
mape = mean(abs(y-yhat)/y)
) %>%
ggplot(aes(x = month, y = mape)) +
geom_line(group=1) +
geom_point() +
theme_minimal()
As we can see above, the forecasted power of the first 2 months is showing a rather small MAPE (below 20%), and have a significant spike while forecasting month of May, June, and August. As an analyst, we required to fill in the “causal” problem statement that might not be identified by the existing model. To get a more detailed view of the problematic months, we are going to perform a more comprehensive evaluation on the model in the next section.
Recall that in our general additive model, we model the time series as follow:
\(X_t=T_t + S_t + H_t +E\)
In the previous section we have learned how to forecast using the existing components. We know that in performing forecasting, we put aside the error terms and use the extracted components:
\(X(t) = T(t) + S(t) + H(t)\)
The error in Prophet is accounted as part of the uncertainty interval. A very close reference to this term in statistics is commonly referred as confidence interval. The idea of the interval is instead of estimating a single point, it is much safer to estimate an interval of value. For example, say we are forecasting next month’s sales to have a total of 18,000 sales. By proposing an exact number, we have a huge probability of being wrong. Instead, we can use an interval to forecast and say: next month’s total sales should have a total of 15,000 to 20,000. This way we can have a larger probability of being right.
Usually, expressing a prediction interval is accompanied by confidence level. By default, Prophet uses the 80% confidence level in calculating the interval. A 80% confidence interval means that we can confidently say that 80% of the time our forecasted value should fall between the uncertainty interval. The interval in Prophet is stored under our forecast_final
data frame under yhat_upper
and yhat_lower
column. We could specify the parameter interval.width = 0.95
to change it into a 95% confidence interval that results into a wider prediction interval:
model_interval <- prophet(changepoint.prior.scale = 0.05,
yearly.seasonality = TRUE,
holidays = holiday,
interval.width = 0.95) %>%
add_seasonality(name = "monthly", period = 30.5, fourier.order = 5) %>%
fit.prophet(train)
future_interval <- make_future_dataframe(model_interval, periods = nrow(test) + 1)
forecast_interval <- predict(model_interval, future_interval)
plot(model_interval, forecast_interval)
On the previous section, we have learned about automatic detection of changepoints in trend. As quotes in the official documentation of Prophet:
We assume that the future will see similar trend changes as the history. In particular, we assume that the average frequency and magnitude of trend changes in the future will be the same as that which we observe in the history. We project these trend changes forward and by computing their distribution we obtain uncertainty intervals.
The uncertainty interval estimated by Prophet is gained from calculating the amount of changepints detected over the series. A time series with a lot of changepoints will results in a much wider interval since the uncertainty will be accounted for. Now let’s try to visualize one of the month that display a significantly larger error than the rest using a simple color-coded components:
eval %>%
mutate(
flag_extreme = ifelse(y > yhat_upper | y < yhat_lower, "1", "0"),
month = month(ds, label = T),
) %>%
ggplot(aes(x=ds, y=y)) +
geom_ribbon(aes(ymin=yhat_lower, ymax=yhat_upper), alpha = 0.3, fill="dodgerblue4") +
geom_line(aes(x=ds, y=yhat)) +
geom_point(aes(color=flag_extreme), alpha = 0.5, size = 2) +
facet_wrap(~month, scales = "free_x")
At the end, I am using face_wrap()
function to help us gain a better view for each of the month separately. Based on the MAPE we have calculated earlier, we could shift our focus to the forecasted value of: May, June, and August we could see that each of the facets shows a significant amount of data forecasting point that ranges beyond its uncertainty interval. Lastly, we are going to give a recommendation in fine-tuning the model based on the observed error of our test data:
Let’s head back to our dive-deeper.Rmd
file. Please complete the fifth task of Dive Deeper under the “Model Evaluation” section. In this section, we are providing you with a new set of data. Use this to calculate your model error and perform a cross-validation on both your baseline and second model. Put your final analysis on the forecasting model you have built.
Hastie, T. & Tibshirani, R. (1987), ‘Generalized additive models: some applications’, Journal of the American Statistical Association 82, 371–386.↩