Thursday, October 24, 2019
Our Data
Look at the data visualization above, it's quite evident that the number of hits for the English language is more than the rest. We can use this information to predict peak values in the time-series which is our goal. We want to optimize the use of servers. Here we can say that if the language of the webpage we are predicting for is not English then the numbers of servers required to host the website is less.
Other classifications:
PACF and ACF cutoff for tunable Parmeters in ARIMA model
We look at the cutoff points of PACF for Autoregressive and ACF for Moving Average models must be accounted for optimal results.
Here we can't go on to look at each time-series thus we have taken a random page and we try to visualize what actually is happening.
As we move forward preparing our model we plan to tune p,d,q parameters in the ARIMA model by minimizing the error in the whole dataset.
Here's a link you can refer to for more information:-
https://people.duke.edu/~rnau/411arim3.htm
Here we can't go on to look at each time-series thus we have taken a random page and we try to visualize what actually is happening.
As we move forward preparing our model we plan to tune p,d,q parameters in the ARIMA model by minimizing the error in the whole dataset.
Here's a link you can refer to for more information:-
https://people.duke.edu/~rnau/411arim3.htm
Data Visualization Part-3: Seasonality and Trend Analysis
One major task performed before we start building our model for forecasting was Smoothening. We need to detrend the data, take care of the seasonality component and thereby taking care of the noisy data. As we saw there was a peak seen when we looked at the original data, that peak in the data can hamper our results hence we need to take care of that.
For better understanding refer: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
![]() |
Decomposition of Time-series Data |
Data Visualization Part-2
A basic assumption taken into consideration when performing analytics on time series data using various models is that the series is stationary. If the original data series is found to be non-stationary then we proceed to apply transformations to get optimum stationary series data. We took a random time series and applied first and second-order differentiation to show how transformations can affect stationarity.
Explore more about the importance of stationarity in time series prediction follow the link below:
https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322
https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322
Data Visualization Part-1
The Dataset is Huge, we can't make visualizations for each and so we decided that we are going to make visualization for a single time series and generalize it for others. Though analysis will be done on them separately.
Here is the original data for a time series plotted against date-time. Here, in this case, you see that the variation is more. In terms of the number of hits, values fluctuating from 0-200 is considerable and the peak that we see can be considered an outlier.
Here is the original data for a time series plotted against date-time. Here, in this case, you see that the variation is more. In terms of the number of hits, values fluctuating from 0-200 is considerable and the peak that we see can be considered an outlier.
Friday, October 18, 2019
Handling Missing values
Handling missing values:
We noticed that there were a lot of missing values but the placement of those missing -NA- values were at the beginning of the data for individual time series. This meant that the web page was added to the domain after the given date and thus it was the best to replace them with 0.
We noticed that there were a lot of missing values but the placement of those missing -NA- values were at the beginning of the data for individual time series. This meant that the web page was added to the domain after the given date and thus it was the best to replace them with 0.
Subscribe to:
Posts (Atom)
Working Progress 8: Random Forest
Random Forest: It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly ...
-
As discussed earlier, RNN and LSTM can be used to predict future time series values. Here, we split the data into train and test where the ...
-
Here, for the given problem statement we need to have a single model configuration that would give the best results. For SARIMAX: We ran ...
-
Adaboost Regressor: Adaboost combines multiple classifiers to increase the accuracy of classification. It is an iterative ensemble meth...