Skip to main content

Time Series Forecasting: KNN vs. ARIMA

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).

We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) to forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

#building the time series data
library(readxl)

df_xautry <- read_excel("xau_try.xlsx")
xautry_ts <- ts(df_xautry$price,start = c(2013,1),frequency = 12)

KNN Regression

We are going to use tsfknn package which can be used to forecast time series in R programming language. KNN regression process consists of instance, features, and targets components. Below is an example to understand the components and the process.

library(tsfknn)

pred <- knn_forecasting(xautry_ts, h = 6, lags = 1:12,k=3)
autoplot(pred, highlight = "neighbors",faceting = TRUE)

The lags parameter indicates the lagged values of the time series data. The lagged values are used as features or explanatory variables. In this example, because our time series data is monthly, we set the parameters to 1:12. The last 12 observations of the data build the instance, which is shown by purple points on the graph.

This instance is used as a reference vector to find features that are the closest vectors to that instance. The relevant distance metric is calculated by the Euclidean formula as shown below:

\sqrt {\displaystyle\sum_{x=1}^{n} (f_{x}^i-q_x)^2 }

q_x denotes the instance and f_{x}^i indicates the features that are ranked in order by the distance metric. The k parameter determines the number of k closest features vectors which are called k nearest neighbors.

nearest_neighbors function shows the instance, k nearest neighbors, and the targets.

nearest_neighbors(pred)

#$instance
#Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2 
#272.79 277.55 272.91 291.12 306.76 322.53 345.28 382.02 384.06 389.36 448.28 
# Lag 1 
#462.59 

#$nneighbors
#  Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2
#1 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 306.76
#2 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12
#3 223.97 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91
#   Lag 1     H1     H2     H3     H4     H5     H6
#1 322.53 345.28 382.02 384.06 389.36 448.28 462.59
#2 306.76 322.53 345.28 382.02 384.06 389.36 448.28
#3 291.12 306.76 322.53 345.28 382.02 384.06 389.36

Targets are the time-series data that come right after the nearest neighbors and their number is the value of the h parameter. The targets of the nearest neighbors are averaged to forecast the future h periods.

\displaystyle\sum_{i=1}^k \frac {t^i} {k}

As you can see from the above plotting, features or targets might overlap the instance. This is because the time series data has no seasonality and is in a specific uptrend. This process we mentioned so far is called MIMO(multiple-input-multiple-output) strategy that is a forecasting method used as a default with KNN.

Decomposing and analyzing the time series data

Before we mention the model, we first analyze the time series data on whether there is seasonality. The decomposition analysis is used to calculate the strength of seasonality which is described as shown below:

#Seasonality and trend measurements
library(fpp2)

fit <- stl(xautry_ts,s.window = "periodic",t.window = 13,robust = TRUE)
seasonality <- fit %>% seasonal()
trend <- fit %>% trendcycle()
remain <- fit %>% remainder()

#Trend
1-var(remain)/var(trend+remain)
#[1] 0.990609
#Seasonality
1-var(remain)/var(seasonality+remain)
#[1] 0.2624522

The stl function is a decomposing time series method. STL is short for seasonal and trend decomposition using loess, which loess is a method for estimating nonlinear relationships. The t.window(trend window) is the number of consecutive observations to be used for estimating the trend and should be odd numbers. The s.window(seasonal window) is the number of consecutive years to estimate each value in the seasonal component, and in this example, is set to ‘periodic‘ to be the same for all years. The robust parameter is set to ‘TRUE‘ which means that the outliers won’t affect the estimations of trend and seasonal components.

When we examine the results from the above code chunk, it is seen that there is a strong uptrend with 0.99, weak seasonality strength with 0.26, because that any value less than 0.4 is accepted as a negligible seasonal effect. Because of that, we will prefer the non-seasonal ARIMA model.

Non-seasonal ARIMA

This model consists of differencing with autoregression and moving average. Let’s explain each part of the model.

Differencing: First of all, we have to explain stationary data. If data doesn’t contain information pattern like trend or seasonality in other words is white noise that data is stationary. White noise time series has no autocorrelation at all.

Differencing is a simple arithmetic operation that extracts the difference between two consecutive observations to make that data stationary.

y'_t=y_t-y_{t-1}

The above equation shows the first differences that difference at lag 1. Sometimes, the first difference is not enough to obtain stationary data, hence, we might have to do differencing of the time series data one more time(second-order differencing).

In autoregressive models, our target variable is a linear combination of its own lagged variables. This means the explanatory variables of the target variable are past values of that target variable. The AR(p) notation denotes the autoregressive model of order p and the \boldsymbol\epsilon_t denotes the white noise.

y_t=c + \phi_1 y_{t-1}+ \phi_2y_{t-2}+...+\phi_py_{t-p}+\epsilon_t

Moving average models, unlike autoregressive models, they use past error(white noise) values for predictor variables. The MA(q) notation denotes the autoregressive model of order q.

y_t=c + \theta_1 \epsilon_{t-1}+ \theta_2\epsilon_{t-2}+...+\theta_q\epsilon_{t-q}

If we integrate differencing with autoregression and the moving average model, we obtain a non-seasonal ARIMA model which is short for the autoregressive integrated moving average.

y_t' is the differenced data and we must remember it may have been first and second order. The explanatory variables are both lagged values of y_t and past forecast erros. This is denoted as ARIMA(p,d,q) where p; the order of the autoregressive; d, degree of first differencing; q, the order of the moving average.

Modeling with non-seasonal ARIMA

Before we model the data, first we split the data as train and test to calculate accuracy for the ARIMA model.

#Splitting time series into training and test data
test <- window(xautry_ts, start=c(2019,3))
train <- window(xautry_ts, end=c(2019,2))


#ARIMA modeling
library(fpp2)

fit_arima<- auto.arima(train, seasonal=FALSE,
                       stepwise=FALSE, approximation=FALSE)

fit_arima

#Series: train 
#ARIMA(0,1,2) with drift 

#Coefficients:
#          ma1      ma2   drift
#      -0.1539  -0.2407  1.8378
#s.e.   0.1129   0.1063  0.6554

#sigma^2 estimated as 86.5:  log likelihood=-264.93
#AIC=537.85   AICc=538.44   BIC=547.01

As seen above code chunk, stepwise=FALSE, approximation=FALSE parameters are used to amplify the searching for all possible model options. The drift component indicates the constant c which is the average change in the historical data. From the results above, we can see that there is no autoregressive part of the model, but a second-order moving average with the first differencing.

Modeling with KNN

#Modeling and forecasting
library(tsfknn)

pred <- knn_forecasting(xautry_ts, h = 18, lags = 1:12,k=3)



#Forecasting plotting for KNN
autoplot(pred, highlight = "neighbors", faceting = TRUE)

Forecasting and accuracy comparison between the models

#ARIMA accuracy
f_arima<- fit_arima %>% forecast(h =18) %>%
  accuracy(test)

f_arima[,c("RMSE","MAE","MAPE")]

#                 RMSE       MAE      MAPE      
#Training set  9.045488  5.529203  4.283023 
#Test set     94.788638 74.322505 20.878096 

For forecasting accuracy, we take the results of the test set shown above.

#Forecasting plot for ARIMA
fit_arima %>% forecast(h=18) %>% autoplot()+ autolayer(test) 
#KNN Accuracy
ro <- rolling_origin(pred, h = 18,rolling = FALSE)
ro$global_accu

#  RMSE       MAE      MAPE 
#137.12465 129.77352  40.22795 

The rolling_origin function is used to evaluate the accuracy based on rolling origin. The rolling parameter should be set to FALSE which makes the last 18 observations as the test set and the remaining as the training set; just like we did for ARIMA modeling before. The test set would not be a constant vector if we had set the rolling parameter to its default value of TRUE. Below, there is an example for h=6 that rolling_origin parameter set to TRUE. You can see the test set dynamically changed from 6 to 1 and they eventually build as a matrix, not a constant vector.

#Accuracy plot for KNN
plot(ro)

When we compare the results of the accuracy measurements like RMSE or MAPE, we can easily see that the ARIMA model is much better than the KNN model for our non-seasonal time series data.


References

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Time Series Forecasting: KNN vs. ARIMA first appeared on R-bloggers.



from R-bloggers https://ift.tt/3jjftZE
via IFTTT

Comments

Popular posts from this blog

Solving Van der Pol equation with ivp_solve

Van der Pol’s differential equation is The equation describes a system with nonlinear damping, the degree of damping given by μ. If μ = 0 the system is linear and undamped, but for positive μ the system is nonlinear and damped. We will plot the phase portrait for the solution to Van der Pol’s equation in Python using SciPy’s new ODE solver ivp_solve . The function ivp_solve does not solve second-order systems of equations directly. It solves systems of first-order equations, but a second-order differential equation can be recast as a pair of first-order equations by introducing the first derivative as a new variable. Since y is the derivative of x , the phase portrait is just the plot of ( x , y ). If μ = 0, we have a simple harmonic oscillator and the phase portrait is simply a circle. For larger values of μ the solutions enter limiting cycles, but the cycles are more complicated than just circles. Here’s the Python code that made the plot. from scipy import linspace from ...

Lawyer: 'Socialite Grifter' Anna Sorokin 'Had To Do It Her Way' (And Steal $275,000)

Opening statements were made in the "Socialite Grifter" trial on Wednesday, and both sides provided extremely different reasons why Anna Sorokin allegedly scammed a number of people and institutions out of $275,000. [ more › ] Gothamist https://ift.tt/2HXgI0E March 29, 2019 at 12:33AM

5 Massively Important AI Features In Time Tracking Applications

Artificial intelligence has transformed the future of many industries. One area that has been under- investigated is the use of AI in time tracking technology. AI is Fundamentally Changing the Future of Time Tracking Technology A time tracking software is a worthy investment irrespective of the size of your organization. It generates accurate reports based on the amount of time your team spends working on a task. These reports facilitate planning of budgets for upcoming projects. Many AI tools are changing the nature of time management. MindSync AI discussed the pivotal role of AI in time management in a Medium article . Why is time tracking software important? It helps with keeping track of the hours being invested on a given task. This sheds light on the timeline for the overall project. It also helps in determining the productivity levels of the employees. This is one of the many reasons that AI is driving workplace productivity . But how can employers utilize it effectively? ...