Blogs

Climatological Spatial Data Fitting using Machine Learning Techniques

#

Performance of military equipment gets significantly affected by the instantaneous environmental & atmospheric conditions. It is important that the corresponding training simulator should also have the capability to localize the simulated equipment performance based on the input of its geo-coordinates & time. Hence, a module that predicts climate for given geo-coordinate by learning from the climatological data of nearby stations, becomes a critical part of a training simulator. This article shares an experience of developing Machine Learning technique-based module that caters to large geographical & climate variations. Daily average temperature data of the year 2018 of 125 Indian location was picked. The source was the National Oceanic and Atmospheric Administration (NCEI). Codes were developed for basic statistical analysis and data wrangling including removing inconsistency in format & data imputation to compensate for the missing values. The pre-processed data was split into training, validation and testing sets. The 10-Fold validation process was used.

#

For implementing the learning algorithm, SVR (Support Vector Regression) function of sklearn module in python was used. The underlying optimization algorithm was Quadratic Programming from CVXOPT. Following kernel were used & compared – linear, 2nd order polynomial, 3rd order polynomial, 4th order polynomial & Gaussian Radial Basis Function (RBF). Finally, the RBF kernel was selected as the difference between the mean of approximation error from training and mean of the prediction error from testing was smaller than with other kernels. Hyper-parameters i.e. constant C & deviation epsilon for optimization problem were selected by performing Cross-Validation. The final training resulted in a VC dimension of 27 & an in-sample error of 1.80 with a variance of 0.0116.

A comparison was also made on five more machine learning techniques – Gaussian Process Regression (GPR), Kernel Ridge Regression (KRR), K-Nearest Neighbours (KNN) and Random Forest Regression (RFR) with their own processes of Cross-Validation for Model Selection. The results indicated that the generalization error was fairly low for SVR & KNN methods. Gaussian Process Regression method resulted in near perfect fitting with almost no generalization. KRR method could neither fit not generalize over learning data, while RFR was extremely time-taking as the number of estimators was large. In conclusion, while multiple ML methods were successful in minimizing generalization error, availability of more data would have resulted in minimizing fitting errors as well. Additional variables that affect large trends like normalized difference vegetation index, modified normalized difference water index, albedo and solar radiation would have also made a considerable change in the errors.

Keywords: Data Wrangling, Climate Data, Machine Learning, Support Vector Regression, Radial Basis Function Kernel