Conf42 Machine Learning 2021 - Online

Advanced machine learning time series forecasting methods review

Video size:


The presentation prepared by AI Investments and team.

We are working on time series forecasting for over 4 years and want to make a review of the latest and most advanced time series forecasting methods like ES-Hybrid, N-Beats, Tsetlin machine, and more.

We will provide also tips and tricks for forecasting difficult, noisy, and nonstationary time series, which can significantly improve the accuracy and performance of the methods. The complete time series forecasting methodology will be presented as well, along with the most efficient supporting tools. Also, a brief introduction to the ensembling of predictions will be done.


  • Most advanced and most efficient machine learning time series forecasting methods. Breakthrough was in the 2018 when the results of the M four competition was announced. Real time real hands on session how to forecast different time series using these methods.
  • The complete framework called glue on TS is the complete framework for the time series forecasting. It includes various models, also the most advanced neural networks architecture like transformer and different method of data transformation. You can download and compile it by yourself and start using.
  • Time, time series forecasting methods method. Based on the stohastic learning automata invented by the russian scientist. Can be used for both machine learning and also for the reinforcement learning. Having the more accurate method gives a significant edge in many areas.
  • I would like to present how we struggle with time series data on a daily basis. I will talk shortly about the data processing models, choice models, evaluation, boosting accuracy and explainable AI which can be used with such data.
  • Ensembling even models with the same architecture trained with different loss function, input length training, hyperparameters or transformation can contribute to score improvement in assembling. Both TFT and nbeats aim to give explainable predictions.


This transcript was autogenerated. To make changes, submit a PR.
My name is Pablo Skipper. I'm a chief of officer in the AI investments and today with Anna Warno, data scientist from Southern Booth, we will make a preview of the most advanced and most efficient, at least in our opinion, machine learning time series forecasting methods. I will make a short introduction with slides and then I will make a real hands on session how to use some of these methods on the real data sets. The brief again is as follow. I will tell a little bit more about the time series, what this is, and about the statistical methods which were very extensively used till the two years ago about the M four and five competition. Then I briefly go through the most advanced and effective machine learning time time time time time time series forecasting methods. The real time real hands on session how to forecast different time series using these presented methods. So for the long time, very long time, the domination of the statistical method was obvious. The machine learning methods achieve much lower results and the most popular was to use this listed statistical methods. I do not go into the details, but I only want to highlight the fact that the most typical and most popular way of using the statistical method was CTO ensembling different methods which is also used very extensively by machine learning methods. The breakthrough was in the 2018 when the results of the M four competition was announced. M competition is probably the most prestigious scientifically baked packet competition time series forecasting methods. It is organized by the professor Sparrows McLiakis from the University of Nicosia and the fourth edition of this competition called M four. The first and the second place was won by the so called hybrid method. So method which uses statistical approaches and machine learning and machine learning part was same. Much more important for this methods comparing to the statistical one and M five which results was presented last year was dominated by the only machine learning methods as in the M four competition, the goal was to predict over 1000 100,000 time series, so the very very big amount of the time series. We consider these results as a very comprehensive and reliable and the first method is the ES hybrid method by Swamik Smell. It is the winning method from M four. It uses statistical approach, exponential smoothing, hot winters for the data processing and also use very novel way of learning the neural network with the special architecture. I will tell a little bit more on the next slide together with the parameters of the exponential smoothing and this method uses model and sampling very extensively in a very unique and novel way. Another novelty of this method was to use LSTM network, but not a typical LSTM network but with the delays and residual connection both these concepts are very popular in the image recognition for the convolutional neural network and this application of that concept was the first one, at least on the peak scale for the LSTM and tiny time series forecasting methods results was obviously great. So Swallow Queen won the M four competition and yeah, this architecture was very heavily studied by other scientists and person working on the time series forecasting. One more very important thing about the eshybrid method is that it uses ten sampling in a very advanced way. So the models, the best models for the given time series, please remember that it was used for the 100,000 time series was collected and for the final predictions that these models were used to achieve the most robust and accurate results. The second method which is purely machine learning method is nbeats. It was published after the M four competition and claims to get the better results on the M four data sets comparing CTO ES hybrid this is purely machine learning methods. It has a unique stack block based architecture, has different type of the block transomal generic and also has some explainability and transfer learning features and as well uses advanced model enemy on a very big scale it ensembling over 100 models to make a predictions. This is the architecture of tap nbeats method and as you can see there are many stack. Each stack is built of the block and also these residual connections within the stack. So the input from the one layer is passed to the given layer and also skip this layer to go to the next layer. Very unique concept is also to pull the results of each block and combine them together and use them as the output from the stack and each stack is adding their output to the global forecasting which is ensampled inside the given model. So the nbeats have clearly on the end sampling in a very good way and is the fully pure machine learning method. The next methods which I wanted to mention is the complete framework called glue on TS. It is the complete framework for the time series forecasting. It includes various models, also the most advanced neural networks architecture like transformer and different method of data transformation. Also it allows for the probabilistic time series modeling to determine the distribution. It supports for the cloud computing training and inference and also has a very strong community support. And as I said this framework is ready to use. You can download that library and start using. It's not easy to use but it is easier to use than previous two methods which are available on the SS source code and you need to download and compile it by yourself and start using. Here you can see I'm not sure if it's the latest diagram, but it shows that how many components is already included in the gluon TS and the framework is still developed and used for time. Time, time series forecasting methods method, which I wanted to mention is settling machine. It is based on the stohastic learning automata invented by the russian scientist. Settling in the previous centuries are quite old, but for the long time this algorithm was used only for the scientific purposes, but now it's used for both machine learning and also for the reinforcement learning. So for supervised learning and reinforcement learning. And from my point of view, the biggest innovation of this approach is that it allows to create or to learn the stochastic distribution for each of the parameters. So this algorithm is learning the probabilistic distributions, which is learned in the supervised way and also is constantly updated after each predictions. So that's the reason that this approach is considered self learning and can be used for the reinforcement learning and also for predictions. And the advantage is that we do not need to retrain a model after the each prediction, but this model is somehow retrained after each prediction. So the weights of the probabilistic distribution are changed after each prediction. And yeah, very briefly, it works that way that we have the input, and for the, each parameter of the input, the separate stochastic distribution is created. And based on the rules in this settling machine, the probabilistic distribution is updated and for the prediction, the final prediction, for the final value for the output of each parameter is sampled for the currently learned distribution and finally it is ensampled in the given way. Disciplined machine is a very different approach. Comparing CTO, the, let's say traditional machine learning, which is based on the neural networks usually, and stochastic gradient destined and pet propagation process, because it's lent in a different way and it could be used as the one additional method, for example, to be included in the ensembling Anya enhanced session will present traditional machine learning method. So years, hybrid and bits and one more methods called temporal fusion transformer, which is considered as one of the most advanced currently available machine learning methods. And also Anya will introduce these methods. So I will skip this temporal fusion transformer for now. So that's all about reviewing the methods. Very short summary the highlights of the forecasting. From my point of view, the most important thing is that currently the time, time, time, time series forecasting methods are very dynamically developed and the new methods appears. They are not, say typical convolutional LSTM or transformer methods, but much more advanced and the efficiency is much higher in terms of the predictions. Accuracy is much higher than statistical methods. Of course, forecasting methods has many, many area of application in the AI investments. We are using them for the financial time series and we achieve over 60% accuracy on the long 110 years test periods. But of course, the application of time time time series forecasting methods used for the many different areas like business, sales, lead, retail, and also for social proposals like health, environment and mobility, and many more. So having the more accurate method gives a significant edge in many areas. So that's the reason that we are presenting it here. Okay, now it's time for the handsome session by Anya. As I said, Anya will show the nBeats method, TFD method and also some basic introduction, how to properly forecast time series using machine learning approaches. I hope you find our sessions valuable and can learn something interesting from that. So that's all from my side and now it's time for Anna Warno handsome session hello. After the theoretical introduction I would like to show something practical. I would like to present how we struggle with time series data on a daily basis. I will talk shortly about the data processing models, choice models, evaluation, boosting accuracy and explainable AI which can be used with time series data. And to have some examples, I chose a publicly available data set. The selection criteria were multidimensionality, difficulty and data size and I will briefly show what can be done with such data. So, as I mentioned before, these data are open sourced. There is around 40,000 of rows each for one timestamp. Frequency of the data is 1 hour and we have around 15 columns, six main air pollutants, six connected with weather and the rest express the date. Here we have some examples. Columns plotted. As we can see, data look messy. We have large amplitudes. After zooming data plots, it looks slightly better. However, there is no visible pattern at the first site. Only after aggregation example over a week you can see some regularities and normally now we would do some explanatory data analysis, et cetera. However, we don't have as much time so we will focus only on parts which are absolutely necessary, which are crucial for modeling. So one of the first things which needs to be done is handling the missing data. Firstly, we need to understand the source of missingness. Does it occur regularly? What are the largest gaps between consecutive not nuns values? Here I have plotted some missing data statistics starting from basic bar plot. As you can see, many columns do not contain any nuns, but there exist columns with significant amount of missing values such as carbon monoxide next heat map. Heat map helps us determine which occurrences of nuns in different columns are correlated. We can see a strong correlation of columns describing the weather such as pressure or temperature. Correlation of occurrences of missing values in different columns may be also expressed with dendogram here and apart from basic statistics and correlations, we can check the distribution for specific columns. We can select column and here we have length of consecutive nuns histogram. As we can see in this example, most of consecutive nuns sequences are short. However, series as long as 60 also exist. And the red plot shows the length of gaps between missing values. So if it was a straight line, that would mean that non missing values occur regularly. They do not in this case. So now we need to handle the missing data. We could apply standard basic methods like backward filling layer or polynomial interpolation. We could also use more advanced methods, for example based on machine learning models. Here we have examples in the plot we can see fragment of one of time series for visualization purposes we can artificially increase number of missileness. We can select the percentage of values which will be randomly removed and see how different simple inputation methods will fill these values. So starting from forward filling fraud in our interpolarization to spline with the higher order which will give us smoother curve. From analysis of missing values, we know that in our case, sometimes the gap between two not missing values is very large. Here we have plotted example. We will not insert anything in that case, but just split series into two shorter series. Second thing which needs to be done are data transformations. This step is crucial for some statistical models which require often series in specific format, for example stationary. For more advanced models it's often not essential, but may also help with numerical issues. Here we have listed some basic transformations which can be applied for time series and we can see how our series would look after the given transformation. We can also use more advanced transformations like embeddings. Example of simple but effective time transformation is encoding cyclical features like hours, days, et cetera in the unit circle like in the presented GIF, and for that we are using this formula. So before the modeling for our task, we will still be missing values with linear interpolation normalized features. Sometimes we use also box cock transformation and encode cyclical features into a unit circle, in our case hours and days. And for modeling we will choose one column nitrogen dioxide and prediction horizon equal to six. And firstly, we will train baselines and simpler statistical models to have some point of reference. And then we will move to neural network methods. And before the models results, a few words about the training setup, we'll use train validation and test split. And for evaluation we'll use rolling window. Here we have plotted train validation, test splits and we will start with extremely simple models. Nave predictors it's good to always look at them in time series forecasting. They are very easy to use and it often happens that the metrics graphs results. Statistics of our model look okay at the first glance, but then it turns out that the naive prediction is better and our model is worthless. So it's a good practice to start first with naive prediction. And it's worth to mention that some alternatives for naive predictions would be usage of metrics like mean absolute scaled error. And apart from baby baselines, we also train some classical models. For example, sarima proffered Tibet's or exponential time series. It's moving and these models will be fine tuned with rolling window, with hyperparameters grid search or biasion based hyperparameter search. Okay, so as we have seasonal data, we use two naive predictors, last value repetition and repetition of values from the previous day with the same hour. And here are the results. We'll compare them later with other methods. And for advanced neural network models, we'll choose two methods, temporal fusion transformer and nbids. Both of them are the state of the art models, but they have different advantage and complete each other very well. In this picture we can see architecture of temporal current transformer and as you can see it's quite complicated, but we will not talk about the technical details. We will focus on advantage of this model. So first of all, good results. According to the paper comparing CTO statistical and neural network models, the very, very big advantage of temporal sugar transformers is the fact that it works with multivariate time series with different types of data, like categorical, continuous or static temporarily. Transformer has also implemented variable selection network, so it allows us save time during the data proprietary. Its results are interpretable thanks to attention mechanism. It also works with known feature inputs and that allows us creating conditional predictions. In general, it's applicable without modification CTO a wide range of problems and we could obtain some explainable predictions. And the second chosen model was NBITs. And nbeats outperformed the winner model from prestigious and for competition. It means that it achieved the highest scores on 100,000 time series from different domains. So it's a bunch of quality. It's designed for univariate time series. Its results are also interpretable. But thanks to special models which try to explain the trend and seasonality, and to sum up. TFT and NBIT both have very good scores and try to deliver interpretable predictions but are optimized for different types of data. NBIT is optimized for universe time series and TFT is optimized for any type of time series with any types of data. Okay, so for neural network training we use early stopping learning rate scheduler and guardian clipping and sometimes but not in this case, we use also biological based framework like optuna for hyperparameters and networks architecture optimization. Okay, so let's move to results. The accurate metrics results will be presented later in the table, but here we have some GIF for NBID performance on the test set and this gray rectangle presented the prediction horizontal so its width is equal to six because our horizon is equal to 6 hours and the same GIF was delivered for TFT. And data are noisy but predictions sometimes look okay. Model often correctly predicts future forecast direction which is good. And here are some predictions for PFT from test set which are actually very good and they were selected randomly but luckily we can very good samples for sure. There are also worse examples in this test set. And here we have table with different experiments with tnt and nbeats and different loss functions different loss functions for regression problem and here we have also metrics results like typical metrics for regressions like mean absolute error or mean average percentage error et cetera and the best scores are highlighted in green. And as we can see, temporary fusion transformer with quantile loss scored the best mean average error for naive prediction for for naive predictions were around 25. So our neural networks clearly learn anything and are significantly better than nave predictions. And the next question is can we do better? Of course we can try to optimize caper parameters or network architecture, but there is one thing which requires less time and is extremely effective. Ensembling even models with the same architecture trained with different loss function, input length training, hyperparameters or transformation can contribute to score improvement in assembling. And here we have a proof. These are experiments with TFT or nbeats and differing only in loss function. So for example we used quantity loss or mean absolute error loss or root mean square error loss or new loss function delayed loss which is different than which is significantly different than these other loss losses and we end up with over 15 percentage of mean absolute error improvement over the best single model and even single models with low score like this one. TFT with delayed loss, the worst model from all experiments with TFT contributes positively to score improvement. As I mentioned, both TFT and nbeats aim to give explainable predictions and here we have results obtained from TFT. First lot shows the value's importance in time. So the higher is the value here, the more important was that time point during the predictions. In this case, the most influential data were measured 168 hours, so seven days before the prediction time. So it means that it suggests that we can have weak seasonality here and here we have features importance from variable selection submodel as expected, the most important feature is nitrogen dioxide. So our target and decoded variable importance plot for feature known values like those connected with time, so it shows which features were the most important from the known feature inputs. With small modification of architecture, we can also see which features were the most important for specific timestamp. As I mentioned, to obtain such result, we need to slightly change architecture. So there is no guarantee that model will work as good as the original one on any type of data. But for some examples it also works. It relies on the same mechanism like the original safety, so it uses attention layer for explainability. Okay, and that's all. Thank you all for attending tension.

Pawel Skrzypek


Pawel Skrzypek's LinkedIn account

Anna Warno

Data Scientist @

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways