FORCE 2020 Machine Learning Lithology Prediction Winning Solution

Ibrahim Olawale
7 min readNov 20, 2020

Exactly this time last year I had just begun my Data science journey so machine learning and programming were still quite new to me at the time. Over the past year, I had spent a lot of time and resources learning and practicing the various machine and deep learning techniques for solving problems. It also came as a shock when I received the news that my submitted model for the FORCE 2020 machine learning lithology prediction model made it to the top 10 submissions. My solution ranked 24th on the open test leaderboard. The shock came as I had lost confidence in my model just a few days to the end of the competition. This was so because scores from the LB top teams seemed like a data hack had been discovered. I relentlessly attempted to find the hack and hence, tried new and “strange” techniques during those final hours but all to no significant model improvement.

The objective of the competition was to correctly predict lithology labels for the provided well logs, NPD lithostratigraphy and well X, Y position. The provided dataset contains well logs, interpreted lithofacies and lithostratigraphy for 90+ released wells from offshore Norway. The well logs include the well name (WELL), the measured depth, x,y,z location for the wireline measurement as well as the well logs CALI, RDEP, RHOB, DHRO, SGR, GR, RMED, RMIC, NPHI, SP and other logs.

Having been a victim of overfitting several public leaderboard competitions in the past, extra care was put into the competition to avoid that. All models used in making predictions were extensively validated using different validation set wells. Also, due to the volatile nature of the default evaluation metric (the penalty matrix provided), there was also a need for a proper and confident validation routine. Different validation sets and train sets were prepared from the full train data set that were provided. Each train and validation sets consisted of 78 train wells, 2 validation sets with 10 randomly selected wells in each.

Initial Approaches

I approached the problem by testing different algorithms to find one that most suits the challenge and data provided. Random Forest, Catboost, Xgboost, Lightgbm were the initial models used. These models were used without any cross validation technique as only a base model was what I had set out to get at first. Due to the large nature of the dataset, it was difficult to get the training processes to converge for the gradient boosting algorithms.

Different model ensemble methods were also tried. Several stacking styles were used e.g creating different subsets of the data, based on the Group and Formation which each data point belonged to, using different algorithms on the full train datasets and stacking the predictions, using different features based on logs occurrence in the wells, using different features based on their feature importance. Xgboost, catboost, random forest were used as the meta learners for the data subsets. A logistic regression model was employed as the metal learner. The problem was also approached as a binary classification, and by building 12 different models for the 12 different labels, the probabilities of each model were used as features for a meta learner for predictions.

I resorted to using the random forest model for making predictions as this gave a better performance compared to every other initial approach. Other algorithms were later introduced for modelling with the KFold CV technique. Here, their performances were compared with the random forest model, and the xgboost model had the best performance.

Final Approach

My approach focused on finding the right features/logs to be used for training my xgboost model, and tweaking the model hyperparameters to get the optimal performance.

The final submitted solution was a highly regularized xgboost model, implemented with a 10-fold cross validation technique. The model underwent a lot of hyperparameter tweaking which was done manually. Every hyperparameter change was validated on the validation sets and the results were compared with the open test leaderboard. All solutions scores submitted were documented along with their validation scores. This allowed for an effective comparison of the models and aided the selection of the final model chosen for submission.

Final Model Hyperparameter Tuning

After discovering the xgboost model as the best available to me for the competition, I resorted to fine tuning the model to get the best performance out of it. Due to the big nature of the dataset and limited computation resources, it was difficult to perform the GridSearchCV technique for tuning model hyperparameters. So manual tuning was done for the hyperparameters and their corresponding performances were validated on the validation scores. The default metric, accuracy and F1 score were used for performance evaluation.

The max_depth was tuned using small (3,4,5) and large values (8,9,10,11,12). The larger values showed a better performance while a max depth of 11 had the best performance on the validation sets but also very little improvement on the open test wells compared to a depth of 10. Due to this close difference, and the fear of overfitting from using a large max depth value, 10 was selected as the final max depth to be used. The model was also highly regularized by using a large regularization lambda value of 1500. This significantly improved the model. The values were tuned between 500–2000 manually with a step of 200, and 1500 was gotten as the optimal value.

The col_sample_bytree and subsample hyperparameters were also tuned between 0.5 and 1 to regularize the training process. 0.9 was the optimal value for both hyperparameters. The shuffle parameter for StratifiedKFold technique used was set to True instead of the default (False) to prevent the training process from converging quickly. Then the number of trees and learning rate were tuned to get the best number of iterations and steps needed. This greatly improved the validation scores but with only little improvement on the open test wells.

Features selection, Engineering and Augmentation

All features were initially used with the final model aside for the SGR log which was totally missing from all open tests. Other features with lesser occurrence in both training and test wells were also removed and model performances were compared. The performance on the validation sets aided the selection of features used for the model. The SGR, DTS, ROPA, RXO logs as well the confidence logs were removed from all wells — train wells and validation wells. This was also done to act as a form of regularisation.

The GROUP, FORMATION and WELL columns were label encoded. The three features greatly improved all model performances both on validation sets and open test sets. Several feature generation techniques were employed which slightly improved validation sets performance but failed to have a better performance on the open test sets. Also, noteworthy to mention is that the open test performance was also employed for model validation as overfitting the validation sets was also a concern. The X and Y columns were used to create cluster features fed into the model but resulted in a worse performance.

Paolo Bestagini’s feature augmentation technique from 2016 SEG ML competition was also employed. This also significantly improved the model’s performance. The functions were copied and used directly with no alterations. Credits and thanks to the ISPL team for the work done and their code.

Initial Challenges

The competition was the first multiclass competition I’d be participating in. It also coupled to be my first geoscience ML competition. Also, the data is what I refer to as “big data” judging it by the number of data points present (over a million rows). Training the model was both time consuming and computationally expensive. To access GPU usage, I resorted to using Google Colab to train my models (thanks to “Generous Google” for sparing their GPUs). Training with GPU was also limited to about five hours per day after which a usage limit would be placed. I also had to spend most of the time training my model and improving my code on my mobile phone due to poor power supply and a bad system battery.

Final Challenges

The main final challenge in the competition was attempting to optimize the loss function using the penalty matrix and default score provided to improve the model performance. This was attempted by using neural networks and the xgboost model. However, I was unable to get this done by the deadline specified for model submission. While still looking into implementing this, it would be great to know if any of the participants were able to implement this.

Thanks to the organizers and everyone who shared notebooks for one thing or the other.

Challenge Github link is provided below (still messy though and should undergo refactoring soon):

https://github.com/olawaleibrahim/2020_FORCE_Lithology_Prediction/blob/main/FORCE_Submission_File.ipynb

Final Submission File:

https://github.com/olawaleibrahim/2020_FORCE_Lithology_Prediction/

Link to competition/github page:

https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition

As per contest rules: The well log labels used in this repo are licensed CC-BY-4.0. The well log data used in this repo is licensed as Norwegian License for Open Government Data (NLOD) 2.0. Any publication involving the well log data must cite “Lithofacies data was provided by the FORCE Machine Learning competition with well logs and seismic 2020”. For citation please use: Bormann P., Aursand P., Dilib F., Dischington P., Manral S. 2020. 2020 FORCE Machine Learning Contest.

--

--