Development of Credit Scoring Model for Borrowers Using Machine Learning Techniques

Financial organizations such as banks have experienced an increase in demand for loans from borrowers over the years. These organizations are highly interested in knowing whether a borrower can pay back if granted the loan requested. Granting loans to defaulters can cripple the business, hence, these financial organizations are compelled to evaluate credit worthiness of clients using the credit history of borrowers. Credit scoring is a technique used in predicting the probability that a borrower will default. Several techniques have been adopted over the years such as statistical and machine learning techniques, however, Machine learning techniques have been found to perform better than the statistical techniques because they solve the challenges faced by credit analyst by automating the processing and extraction of knowledge from data. The objective of this work is to improve upon the Artificial Neural Network machine learning technique by adopting a better feature extraction technique. The methodology adopted in this research is to use Rough Set Theory (RST) for relevant and efficient feature selection and Multi-Layer Perceptron (MLP) Neural Network for classification. To test the models, the Australian and the German credit datasets were used in the Anaconda machine learning platform. The results obtained from the research was compared with some other machine learning models such as: Support Vector Machine, Random Forest, Decision Tree, Logistic Regression, Naive Bayes, K-Nearest Neighbour and Artificial Neural Network using standard evaluation metric to ascertain its performance on the two datasets. The results show that this work outperforms all other models in any of the metrics considered. This research therefore has been able to show that the model is good for credit scoring and has improved performance.


INTRODUCTION
Due to inadequacy of resources, the concept of credit became paramount for individuals, families, governments, and businesses in the realization of their objectives. The lending agencies assist borrowers to meet up expenses by providing loans to creditworthy borrowers. The decision of granting loans is very critical and are analysed carefully to ascertain whether these lending agencies can survive and make profit. The Subjective approach and Objective approach are the credit-granting methodologies that financial organizations might employ to assess credit applicants' repayment potential (Bougard, 2017). The subjective approach uses judgment by the credit analyst to analyse a borrower's character, collateral and ability to repay (Pabuccu & Ayan, 2017). Also, recommendations from the employer of the borrower or previous lender can be used to make a decision on who is entitled to receive credit. Consistent choices and inaccuracies made by various credit analysts for almost the same application plague the subjective method. This strategy is also beset by high training expenses, which arise when a credit analyst is required to complete training before being able to approve an applicant. Financial institutions have been motivated to explore objective methodologies and seek to precisely anticipate the chance of default due to these flaws, rising demand for loans, and the advancement of computer technology (Wu & Pan, 2019).
The urgency for a credit scoring model became important because of the rapid growth of industry which has caused an increase in the workload of the credit analysts. Individual borrower's credit worthiness should be ascertained before credit is given to them because failure to pay back can be very risky. Some financial institutions continue to use the traditional approach because of the unavailability of suitable data for machine learning techniques (Moradi & Mokhatab Rafiei, 2019). Also experts found it easier with the rules they defined themselves rather than the machine learning models over which they had little or no control over. These were the two major reasons why expert systems were preferred over machine language.
Machine learning technique has ability to solve these challenges faced by credit analyst by automating the processing and extraction of knowledge from data. The size of data has increased tremendously over the years which make it suitable for using machine learning techniques to learn the data. However, these large datasets have much of noisy, irrelevant or misleading features which resulted in high computational cost (İlter, Kocadağlı & Ravishanker, 2019). In order to reduce the cost of computation, feature extraction has become necessary when dealing with large datasets.
There are several machine learning models for classifying credit applicants, and choosing the right one can enhance credit scoring accuracy significantly (Hilscher & Wilson, 2017). There is no existing model without shortcomings, therefore, the combination of models can produce an improved credit score (Nyoni & Matshisela, 2018). As regards this issue, six data machine learning algorithms are selected. The goal is to use these machine learning approaches on publicly available credit datasets from Germany and Australia to see which categorization methodology produces the best performance using standard evaluation metrics.
The rest of this paper is structured as follows: Section 2 is the review of the related literature. Section 3 subsequently describes our approach in implementing the credit scoring model. Section 4 describes the implementation procedure. Section 5 is the result and discussion from our findings, and the final section concludes the study.
Credit scores have been calculated using a variety of methods. It incorporates both statistics and machine learning methodologies. Two researchers in 2020 carried out a review on analysing credit scoring using machine learning. They pointed out that the traditional method of credit scoring which is basically statistical has become less popular due to the difficulties faced by credit analysts and the current trend is the use of machine learning to evaluate a credit score of an individual. Their analysis was derived from related works from previous researchers. The result showed the efficacy of machine learning over statistical models. The research gap was from their source of judgment which was drawn solely from existing works. They did not carry out any experiment analysis to validate their claim (Rudra Kumar & Kumar Gunjan, 2020).
A paper published in 2017 showed the suitability of Extreme Machine Learning (ELM) as a predictive model for credit scoring. The result of their research proved that ELM when compared to other models used in their research produced a better result. However, only three metrics were used in evaluating the model performances. In our research, six performance metrics was employed to investigate the suitability of our credit scoring model for prediction (Bequé and Lessmann, 2017).
In a previous research published in 2018, the researchers investigated how commercial banks could predict loans using machine learning techniques. They developed a credit model using KNN classifiers. They discovered that there was no perfect model from their previous research. The credit dataset was analysed and their classifier was implemented using R. The result of using a combination of KNN classifiers produced an accuracy of 75% (Arutjothi & Senthamarai (2018). The limitation of their research is their accuracy result was low. An improved accuracy is necessary to minimize the credit risk of financial institutions and maximize profit.
In another publication in 2018, the authors carried out a prediction analysis on borrowers defaulting using machine learning model and deep learning model. The paper highlighted key components in approving credits to borrowers as feature selection algorithms, classification models, evaluation metrics and credit analysts. They observed that the tree models were more stable than deep learning models (Addo, Guegan, & Hassani, 2018).
Three authors in Zhang, Yang & Zhou, (2018) also researched credit rating using three machine learning models for feature selection. The models are C5.0, CHAID and CART. The authors found out that one of the challenges of low accuracy was the presence of redundant attributes in their credit dataset. They proposed a credit model to optimize the Decision Tree model. The result showed C5.0 was a better option. The setbacks in their approach was that they only used accuracy for performance evaluation.
Furthermore, previous researches used Australian and German credit dataset to analyse credit risk (Thanawala, 2019;Pandey, Jagadev, Mohapatra & Dehuri, 2017). These authors focused only on accuracy as a measure to ascertain which model did better. Although classification accuracy is widely used by many researchers because it is easy to understand and compute, it is not a reliable measure when dealing with an imbalanced dataset. In this research, the proposed model performance was evaluated using various metrics.

RESEARCH METHODS
The architectural diagram for our proposed credit scoring model for this research is presented in Figure 1. This model was divided into three parts namely, the data preparation, the feature selection and the classification. In the data preparation part, data was collected from an online repository, cleaned from missing values by replacing the missing value with corresponding mean or median of all instances and the obtained samples were discretized using Decision Tree model. Attribute reduction was carried out in the feature selection part using Rough Set Theory (RST) and then the result obtained from performing feature selection was divided into training set and test set. The classification part is used to ascertain the creditworthiness of a borrower using Multilayer Perceptron (MLP) Neural Network. The dataset used in this paper are German and Australian credit dataset which was retrieved from the UCI Machine Learning Repository (https://archive.ics.uci.edu/) which has been used in most scientific papers available to predict credit score. The major features of datasets are mentioned in Table 1. Once data was collected, the datasets were queried to check if there were any missing values. There were 37 occurrences in the Australian sample with one or more missing variables. Replacement method was used in replacing the missing value with corresponding mean or median of all instances. If the attribute is nominal, we used median but if numerical we used the mean. Also, the numerical variables contained in the credit dataset were transformed into nominal variables by discretization while the nominal variables were represented in numerical code to aid in easy interpretability of the feature. Normalization is then performed to obtain a range of values between 0 and 1.
A large number of attributes in datasets may be unnecessary. The computing time for categorization will increase if these duplicate attributes are not deleted. As a result, feature selection is utilized to choose crucial features that are required while creating models. In this research, RST was used to perform feature selection. Although there are numerous concepts in RST (Skowron & Dutta, 2018)   (2) Where and respectively reflect the ranking vector's greatest and minimum values. The goal of this technique is to provide a measure independent of the actual distribution of the priority values and to approximate the selection of features having their priority values larger than the median value. The result is used to form reduct, which is a minimal subset of attributes providing the same equivalence class structure as the original dataset. There exist multiple reducts for a dataset. A set of features ⊆ is called a reduct of . The set of all the attributes indispensable in is denoted by ( ).
Where ( ) is the set of all reducts of .

Attribute reduction pseudocode for RST
Input: -the set of all conditional attributes; -the set of decision attributes t -total number of instances Process: RED(C) ← {} P ← {} Foreach c є C Get the values of c denoted c ← / 4. cmax ← get maximum of C 5. cmin ← get minimum of C 6. P← all features with cmax 7. C ← C-P 8. Check if Indiscernibility in C does not exist using equation (1) 9. set a new threshold using equation (2) and remove all features above threshold 10. Add to P 11. then do (7) and (8) until indiscernibility exist 12. R← P 13. continue till all the minimal reducts are found 14. Use equation (3) to get the CORE 15. return CORE. Output: CORE -the feature subset Data Split. Before classification was done, the pre-processed dataset was divided into two sections: a training dataset and a test dataset with 70:30 as the ratio. The training dataset was utilized for classification, whereas the test dataset was used to assess the classification accuracy.
Model Selection. The multi-layer perceptron (MLP) model trained with the back propagation was employed in this study. Let = ( 1, 2, … , ) as a set of input and ℎ = (ℎ 1 , … , ℎ ) as the set of nodes in the hidden layer. There can be more than one hidden layer in MLP (Bekesiene, Smaliukiene, & Vaicaitiene, 2021). In this research, two hidden layers were used and its derivation is expressed in equation 4 to 6.
Equation (4) and (5) shows how the nodes are being calculated from the set of input x with their corresponding arbitrary weights and the result (nk,t) is passed into the activation function. The activation function used is sigmoid function.
Equation (6) is the result of the calculation of all the nodes in the input and hidden layers while 0 is the bias.
Model Performance Evaluation. Accuracy, Precision, F1 Score, Recall, Type I error and Type II error were among the performance evaluation techniques employed in this study. Each criterion has its strengths and weaknesses. In this study, it was found that using a combination of these indicators rather than a single measure to evaluate the performance of credit scoring models is preferable. The confusion matrix is the source of all of these performance evaluation criteria. The confusion matrix is a form of square matrix table used in showcasing the classification model result on test datasets. Confusion matrix gives an idea of what classification models predict correctly and what they predict incorrectly.  Table 2 shows the structure of the confusion matrix where TP is observations classified as bad whose true class membership is bad, FP is observations classified as bad whose true class membership is good, TN is observations classified as good whose true class membership is good and FN is observations classified as good whose true class membership is bad.

RESULT & DISCUSSION
A total of 12 out of 20 conditional attributes met the criteria when feature selection was performed on the German credit dataset. These features are Existing checking account status; Month-long duration; History of credit; Bonds and savings accounts; Rate of installments expressed as a proportion of disposable income; Job; Other installment plans; Other guarantors / debtors; property; current residential address, age in years and Number of existing credits at this bank. For the Australian dataset, 7 out of 14 attributes were selected. The attributes are A2, A3, A5, A6, A7, A10 and A14. Figure 2 represents the confusion matrix of the model on the Australian dataset. A total of 90% accuracy, 93% precision, 89% for recall and 91% for f1 score were obtained. Figure 3 shows the confusion matrix of the model on the German dataset. A total of 87% for accuracy, 84% for precision, 88% for f1 score and 91% for recall were obtained. In order to ascertain how well our model performed, the model was benchmarked with 6 other models on the Australian and German credit dataset. The models used were K-Nearest Neighbour (KNN), Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), and Multi-Linear Perceptron (MLP) Neural Network Decision Tree (DST) and Support Vector Machine (SVM). It was observed that the proposed model performed best in all the six performance metrics.
The proposed model has an accuracy of 90% and 87% for the Australian and German credit dataset respectively. This shows that there is a significant improvement with a minimum of 4% in terms of accuracy, precision, f1 score and recall for each of the dataset used when compared to other machine learning techniques. Furthermore, MLP+RST was compared with the result obtained from a previous research by Aithal & Jathanna (2019) and it outperformed the other models as shown in Table 4. Our feature extraction methodology produced a higher accuracy when applied to a larger dataset compared to the novel credit data model (Zhang, et al., 2018). It is important to note that the credit history of the borrowers was not considered as an important feature. This might be misleading because it is important to study a borrower's history before such an applicant is granted a loan. The error results equally reveal that the Type I and Type II errors are the lowest when compared with all the models used. Table 5 and Table 6 shows the performance of the proposed model on Australian and German datasets respectively compared to some existing models.

CONCLUSION
MLP with RST was considered in this paper for credit scoring. The models were compared with six machine learning models using the Australian and German credit datasets. The aim of this study was to improve credit scoring with improved performance while saving the computational costs. This has become necessary due to the increase in size of data making it suitable for using machine learning techniques to learn the data. However, these large datasets have an abundance of noisy, irrelevant or misleading features. Processing these large datasets comes with high computational cost. The results of these tests revealed that MLP credit scoring model with rough set features surpassed all other models in terms of accuracy, precision, recall and f1 score on the two credit datasets. The Australian and German credit scoring models had the highest accuracy at 90% and 87 % respectively using 70:30 splitting ratio of the training and test dataset.
MLP with RST is a better method in credit scoring applications. To ascertain this, the suitability of MLP+RST was established practically. Unlike previous studies that considered solely the predictive accuracy (Sharifi et al., 2021;Siregar & Wanto, 2017), other performance metrics were also considered in this work such as precision, recall and the f1 score. We tested MLP on all the features and also on the reduced features captured by RST. In every performance metrics MLP+RST outperformed the other models considered in this study. The finding also shows that MLP when fed with appropriate data shows a more established classifier in the application area of credit scoring. It is recommended that future research studies should consider the use of other feature selection algorithms.