JITE (Journal of Informatics and Telecommunication Engineering)

cross-validation


INTRODUCTION
Classification is the process of studying the structure of a collection of sample data that has been partitioned into groups known as categories or classes.The learning process of this category is usually achieved by making a model.The model is then used to predict the type of group (or class label) from one or more examples of data that were previously invisible and whose labels are unknown (Aggarwal, 2015).some of the data sets are then separated into different classes and given class identities which are called class labels.So that the classification is also referred to as supervised learning (Chauhan et al., 2021).One method in the classification is Extreme Gradient Boosting (XGBoost).XGBoost is an algorithm that is often implemented in machine learning and used in various Kaggle competitions.Several studies have shown that the XGBoost technique has better results than some other boosting techniques.(T.Chen & Guestrin, 2016) A popular approach to estimate the performance of a classification algorithm or to compare the performance of two classification algorithms on a data set is cross validation.(Wong, 2015)Cross Validation randomly divides the data set as many as n-folds (or k-folds are often used) with almost the same number.In the case of classification, there is something that needs to be considered in dividing the data set into a number of k partitions, namely having to do stratification which means we will partition or divide the data set into k-multiples with a balanced class composition in each partition.In other words, the class distribution of each partition must be the same between classes, which means it is also the same as the class distribution in the original data set.
The existing literature primarily focuses on the performance of either cross-validation or extreme gradient boosting algorithms individually, often overlooking the specific challenges posed by imbalanced data.Consequently, there is a need for comprehensive research that specifically investigates the combined impact of these two techniques on unbalanced data classification tasks.Understanding how crossvalidation techniques can optimize model performance and improve generalization on imbalanced datasets, in conjunction with the powerful ensemble learning capabilities of extreme gradient boosting, can provide valuable insights for developing robust and accurate classification models.Ultimately, the results of this study aim to contribute to the existing body of knowledge and bridge the gap in understanding the combined analysis of cross-validation and extreme gradient boosting for unbalanced data classification in real-world applications.

A. Unbalance Data
There are large number of datasets in today's world, implying that data can no longer be handled directly by humans or manual applications.Technologies such as the World Wide Web, engineering and science applications and networks, business services, and more are driving data in exponential growth thanks to the development of robust storage and connectivity tools.Organized knowledge and information cannot be easily obtained due to the management of Big Data and cannot be easily understood or extracted automatically.(Fernández et al., 2018) It is known that data itself can produce information but not knowledge.Its real value lies in the possibility of extracting knowledge that is useful for decision-making or the exploration and understanding of phenomena that generate data.At present, the current volume of data managed by our system has exceeded the processing capacity of traditional methods, and this also applies to Data Mining.
The emergence of Big Data also brings new problems and challenges to the problem of class imbalance.First, the standard approaches to cost-sensitive pre-processing and learning had to be redesigned (sometimes, entirely) to adapt their procedures to the new MapReduce-style distributed framework.Second, the data partitioning associated with this type of process can result in a lack of data from minority class examples, and/or the generation of minor disjunctions.Finally, we must refer not only to the increasing volume of data, but also to the nature of the problem itself.Regarding the current application of Big Data, the incoming data may be heterogeneous and/or atypical, i.e. what is known as Variety of data.This problem can force algorithms to be able to handle these graphic-based structures (for social networks) or video sequences (for computer vision).

B. Extreme Gradient Boost (XGBoost)
XGBoost is an algorithm that is very often applied to machine learning and Kaggle competitions for structured or tabular data.XGBoost is a gradient boosted decision tree implementation designed for speed, performance and scalability.

Figure 1 Decision Tree XGBoost
In the research conducted (J.Chen, 2018) implemented a gradient boosting machine which is an innovative system in terms of features and algorithm optimization which is 10 (ten) times faster than other machine learning algorithms.XGBoost is a machine learning technique for solving regression and classification problems that produces a predictive model in the form of a weak ensemble predictive model, and is usually called a decision tree.It aims to build the model the same way other boosting techniques do.

C. Cross Validation
Cross Validation is a model validation technique to assess how the model performs against other independent datasets (Nurzhaputra, 2017).There are 2 types of cross validation, namely K-Fold Cross Validation and Monte-Carlo Cross Validation.The model validation method used in this study is k-fold cross-validation.In k-fold cross-validation, a dataset D will be randomly partitioned as many as kpartitions.The machine learning model will carry out a training process with k-1 partitions, then its generalization ability will be tested by using data partitions that were not used in the previous training process based on an evaluation metric.This process will be repeated until every data partition has been used as a tester for the model.In this study the K-Fold Cross Validation method was used with a value of k = 10.

A. Research Workflow
The figure 3 shows a general research flowchart.The initial stage is inputting the dataset, then preprocessing the data to convert non-numeric data into numeric data.After the data is changed, a new numerical dataset is formed which is ready to be processed.The new datasets do not contain non-numeric data, so that the learning stages can be carried out using the XGBoost model and optimizing the parameters in the XGBoost model.The best parameter in XGBoost is used to determine the optimal accuracy of the XGBoost model.

B. Data and Research Equipment Used
In this research, the dataset used is the Telco Customer Churn Dataset obtained from the IBM Repository

Male
This dataset contains information about telecommunications customer data that has stopped subscribing and customers that are still subscribing.In this dataset, there are 27% of subscribers who unsubscribe and as many as 73% of subscribers who unsubscribe.This data belongs to the unbalanced data category because it has a majority class and a minority class.(Desprez et al., 2022).

C. Identifying Missing Value
The next dataset is identified for empty data or missing data (missing value).Research shows that the XGBoost method can work on data containing missing values without handling the data.Another solution is to change all blank data to a zero value (Rusdah & Murfi, 2020) Missing Value in the dataset can be found in attributes that have no value.Then the missing value data is converted to a value of 0 or Not a Number (NAN).

D. Dependent Data dan Independent Data
The next step is to separate dependent data and independent data and divide it into training data and test data.The data in the training data class is first changed to format using the One Hot Encoding process, namely changing categorical data into numeric data (Rodríguez et al., 2018).

E. Data Stratification
Data stratification is the separation of data into smaller and clearer amounts based on a predetermined set of criteria.This aims to make a data variable that is consistent when repeated experiments are carried out.

F. Train XGBoost Model
XGBoost can then start training to create models.By using the confusion matrix, the performance accuracy of the generated XGBoost model is then evaluated.The confusion matrix only presents information in the form of numbers.(Luque et al., 2019).For this reason, in this data classification, if you want to display information on the performance of the classification algorithm in graphical form, you can use the Receiver Operating Characteristic (ROC) or Precision-Recall Curve.(Cantarino et al., 2019).

G. Model Evaluation with Cross Validation
Before XGBoost starts training the data, Cross Validation is carried out beforehand to divide Cross Validation is a model validation technique to assess how the model performs against other independent datasets.In this paper using the K-Fold Cross Validation method with a value of k = 10.

H. XGBoost Hyperparameter Optimization
Machine learning models have several kinds of hyperparameters.Hyperparameter is a parameter whose value must be determined in advance by the model user before the training process takes place.Hyperparameter optimization is an important process because the same model may require different hyperparameter values for different data to have good performance, but selecting optimal hyperparameter values is a complex problem and takes a long time if the model used has many hyperparameters.(Clesen & De Moor, 2015).Hyperparameters are different from parameters, where hyperparameters are parameters whose value is set or determined before carrying out the training or learning process, while parameters are values obtained after carrying out the training/learning process.The hyperparameters in XGBoost consist of learning_rate, gamma, max_depth, min_child_weight, max_delta_step, subsample, sampling_method, colsample_bytree, lambda, alpha, and so on.The hyperparameters optimized in this study are in accordance with the paper by Martinez-de-Pison, et al. ( 2019) are as follows.

A. Dataset Telco Customer Churn
The data used in this study is Customer Churn from telecommunications companies.This data was obtained from the site https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=samples-telcocustomer-churn accessed on 25 May 2022, a site that provides datasets for data science and machine learning.The following details the data that will be used in this study Data is 7,043 x 33 in .csvformat.namely the composition of 7,043 is an observation or individual, and 33 is a feature that describes the telco customer attributes.Feature names follow the format of the dataset.i.

Numerical
In the dataset 2 types of churn scores are 0 and 1. 0 means the customer does not leave the service while 1 indicates the customer left the service.

B. Training XGBoost model
This test was conducted to see the effect of the Extreme Gradient Boost (XGBoost) model on performance results.The Telco Customer Churn dataset is partitioned into 2 (two), namely for training data and testing data.The test results for the XGBoost classification algorithm according to the test parameters are as follows  Based on Figure 4 and Figure 5 it is found that the XGBoost algorithm values the accuracy or accuracy of the model in predicting data with a value of 80.18%.

C. Optimization Hyperparameter using Cross Validation
XGBoost has Hyperparameters which can be optimized to determine the best parameter values of the model by evaluating the models that have been trained in the next training process to see whether the trained models are good or not.After optimizing all parameters in the XGBoost model, then parameter optimization is carried out based on the optimal parameter values to evaluate the initial testing data, so the optimal parameter values are obtained.Table below shows the parameter value that is optimized with bolded elements in a row is the optimal parameter value.

D. Training XGBoost Model Using Cross Validation
The cross validation used is 10-fold cross validation so that the data is divided into 10 equal parts, 1 part is used as test data and the other part as train data.The model training process was repeated 10 times to obtain 10 model evaluation results.The five evaluation results are averaged and used as a general evaluation description.The cross validation results are calculated based on the ROC AUC score

E. Discussion
The test procedure is used to test whether the system being built is running properly by comparing the results obtained from the system with several parameters.To test the system that was built, a system test was carried out which consisted of the learning process and model testing according to the method above.
From the algorithms that have been tested and then combined with the hyperparameter optimization method to get 4 (four) algorithm tests, then the level of accuracy will be compared based on the AUC value as shown in the following table.The test continued with hyperparameter optimization on XGBoost using GridSearchCV and RandomizedSearchCV.In these two optimization experiments, the accuracy results were no better than XGBoost with standard parameters.However, In this study, an evaluation was conducted on the optimized XGBoost model for classifying individuals who have left the company.The results showed a significant improvement in the model's ability to accurately identify individuals who have left.Out of the total of 467 individuals who left the company, the optimized model correctly identified 390 individuals (84%), compared to only 239 individuals (51%) correctly identified before optimization.
However, this performance improvement came at the expense of the model's ability to accurately classify as many individuals who have not left the company.Before optimization, the model correctly identified 1,178 individuals (91%) who stayed in the company.However, after optimization, the number of correctly classified individuals as non-leavers decreased to 937 (72%).There was a notable decrease in the accuracy of classifying individuals who remained in the company.
In the case of Telco Customer Churn, the focus is on identifying customers who are likely to stop using the company's services.In this context, recall becomes a more crucial metric than accuracy.The main objective is to correctly identify as many potential churners as possible, as missing out on these customers can have significant consequences for the business.False negatives, where churners are incorrectly classified as non-churners, can result in lost opportunities for retention efforts.By prioritizing recall, the company can improve its ability to recognize and address customers who are at risk of churning.On the other hand, high accuracy may lead to a higher number of false positives, where non-churners are mistakenly classified as potential churners.This could result in wasted resources and ineffective retention strategies targeting the wrong customers.Therefore, in Telco Customer Churn cases, a higher recall rate is more desirable as it helps minimize the number of missed churners and ensures that appropriate actions are taken to retain at-risk customers, ultimately improving customer retention and reducing business losses.

V. CONCLUSION
Based on the results of the research and analysis that has been carried out in this study, the following conclusions are obtained:: 1.In conclusion, it can be stated that using the XGBoost method for data classification yields satisfactory results.The incorporation of Cross Validation and parameter optimization techniques (either through GridSearchCV or RandomizedSearchCV) can slightly enhance the model's accuracy.However, it is important to note that the increase in accuracy is not always proportional to the complexity of the applied optimization.Therefore, it is necessary to consider efficiency and resource limitations when selecting the most suitable method for a specific classification scenario.
2. Optimization of hyperparameters with GridSearchCV and RandomSearchCV does not increase the accuracy of the XGBoost model with unbalanced datasets.
To improve the results mentioned in the statement, several approaches can be considered.Firstly, ensemble methods can be explored, such as bagging or boosting, to combine multiple XGBoost models.This can harness the diversity of individual models and potentially lead to improved predictive performance.the trade-off in the performance improvement of the XGBoost model can provide greater benefits for the company in managing employees and ensuring the company's operational sustainability.
Secondly, conducting thorough feature engineering on the dataset could be beneficial.This involves creating new features, transforming existing ones, or selecting the most relevant features for the churn prediction task.Feature engineering can provide more informative representations of the data and help the model capture underlying patterns more effectively.By exploring and potentially combining these approaches, it is possible to further enhance the accuracy of the XGBoost model in classifying unbalanced data related to telco customer churn problems.

Table 2
Dataset Example from Dataset Telco Customer Churn

Table 3
Dataset Composition

Table 4
XGboost Hyperparameter e. CustomerId (id for one Customer), Count, Country, State, City, Zip Code, Lat Long, Latitude, Longitude, Gender, Senior Citizen, Partner, Dependents, Tenure Months, Phone Service, Multiple Lines, Internet Service, Online Security , Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies, Contract, Paperless Billing, Payment Method, Monthly Charges, Total Charges, Churn Label, Churn Value, Churn Score, CLTV, Churn Reason.Data features used by IBM Telcu Customer Churn can be seen in table below.

Table 5
Features and Data Types in the dataset

Table 9
Test Scores with Cross Validation

Table 10
Test Scores XGboost with RandomizedSearchCV

Table 11
Test Scores XGboost with GridSearchCV

Table 12
Comparison of Accuracy and Recall based on AUC scoreFrom the table 12, it can be seen that the accuracy of XGBoost which has not been optimized and Cross Validation is 80.18%.Then in the XGBoost model, which was carried out by Cross Validation, it obtained a better accuracy of 80.67%.