Βιβλιοθήκη ΟΠΑ - Ψηφιακό Αποθετήριο

ΠΥΞΙΔΑ Ιδρυματικό Αποθετήριο
και Ψηφιακή Βιβλιοθήκη

Όνομα χρήστη
Κωδικός πρόσβασης

Συλλογές :	Ιδρυματικό Αποθετήριο ΟΠΑ / AUEB Institutional Repository Σχολή Διοίκησης Επιχειρήσεων / School of Business Τμήμα Διοικητικής Επιστήμης και Τεχνολογίας / Department of Management Science and Technology Μεταπτυχιακές Εργασίες / Postgraduate dissertations

Τίτλος :	Default prediction using machine learning methods

Εναλλακτικός τίτλος :	Εκτίμηση πιθανότητας πτώχευσης με τη χρήση μεθόδων μηχανικής μάθησης

Δημιουργός :	Pavlogeorgatos, Dionysios Παυλογεωργάτος, Διονύσιος

Συντελεστής :	Ntzoufras, Ioannis (Επιβλέπων καθηγητής) Karlis, Dimitrios (Εξεταστής) Chatziantoniou, Damianos (Εξεταστής) Athens University of Economics and Business, Department of Management Science and Technology (Degree granting institution)

Τύπος :	Text

Φυσική περιγραφή :	97p.

Γλώσσα :	en

Αναγνωριστικό :	http://www.pyxida.aueb.gr/index.php?op=view_object&object_id=10840

Περίληψη :	This thesis examines parametric and non-parametric models in distinguishing between good andbad credit applicants. Additionally, the significance of input variables is assessed with twodifferent approaches (WoE and Random Forest variable’s importance) in our attempt to find theoptimal and most efficient model. Furthermore, we address the problem of imbalanced data withthe Synthetic Minority Over-Sampling technique (SMOTE), which is a widely used algorithm foraddressing class imbalance, developed by Chawla et al. (2002). The empirical study was conductedusing a data set obtained from the Kaggle website and contains information about “Home Credit”,which is an international consumer finance provider, founded in 1997 in the Czech Republic. Theinitial dataset consists of 307,511 observations and 122 different variables, which incorporatesinformation about the invoice of the applicant and the credit decision process as well asinformation about the applicant. Statistical and machine learning algorithms are employed togenerate predictions and predictive power is evaluated based on area under the ROC curve andother evaluation metrics. The four algorithms used are logistic regression, random forest, lightgbm, and naive Bayes. With an AUC value of approximately 72%, some patterns have beenidentified that can differentiate between customers that are expected to pay their loan obligationsand those that are not. The statistical logistic regression model was found to perform as well asmore sophisticated models with a limited number of inputs, regardless of the approach chosen forthe variable selection. Η παρούσα εργασία εξετάζει παραμετρικά και μη παραμετρικά μοντέλα για τη διάκριση μεταξύ καλών και κακών αιτήσεων για πίστωση. Επιπλέον, η σημαντικότητα των μεταβλητών αξιολογείται με δύο διαφορετικές προσεγγίσεις (WoE και Random Forest) στην προσπάθειά μας να βρούμε το βέλτιστο μοντέλο. Η εμπειρική μελέτη διεξήχθη χρησιμοποιώντας δεδομένα από τον ιστότοπο Kaggle και περιέχει πληροφορίες σχετικά με τον οργανισμό "Home Credit Group", ο οποίος είναι ένας διεθνής πάροχος καταναλωτικής χρηματοδότησης, που ιδρύθηκε το 1997 στην Τσεχική Δημοκρατία. Το αρχικό σύνολο δεδομένων αποτελείται από 307.511 παρατηρήσεις και 122 διαφορετικές μεταβλητές, οι οποίες ενσωματώνουν πληροφορίες σχετικά με την αίτηση καθώς και πληροφορίες σχετικά με τον αιτούντα. Χρησιμοποιούνται αλγόριθμοι στατιστικής και μηχανικής μάθησης και η προβλεπτική ικανότητα αξιολογείται με βάση την AUC ROC curve. Οι τέσσερις αλγόριθμοι που χρησιμοποιούνται είναι logistic regression, random forest, light gbm, and naive Bayes. Με μια τιμή AUC περίπου 72%, ορισμένα πρότυπα μπορούν να διαφοροποιήσουν τους πελάτες που αναμένεται να πληρώσουν τις δανειακές τους υποχρεώσεις και εκείνων που δεν πρόκειται να πληρώσουν. Το μοντέλο στατιστικής λογιστικής παλινδρόμησης βρέθηκε να αποδίδει εξίσου καλά με περιορισμένο αριθμό μεταβλητών, ανεξάρτητα από την προσέγγιση που επιλέχθηκε για την επιλογή των μεταβλητών.

Περίληψη :

This thesis examines parametric and non-parametric models in distinguishing between good andbad credit applicants. Additionally, the significance of input variables is assessed with twodifferent approaches (WoE and Random Forest variable’s importance) in our attempt to find theoptimal and most efficient model. Furthermore, we address the problem of imbalanced data withthe Synthetic Minority Over-Sampling technique (SMOTE), which is a widely used algorithm foraddressing class imbalance, developed by Chawla et al. (2002). The empirical study was conductedusing a data set obtained from the Kaggle website and contains information about “Home Credit”,which is an international consumer finance provider, founded in 1997 in the Czech Republic. Theinitial dataset consists of 307,511 observations and 122 different variables, which incorporatesinformation about the invoice of the applicant and the credit decision process as well asinformation about the applicant. Statistical and machine learning algorithms are employed togenerate predictions and predictive power is evaluated based on area under the ROC curve andother evaluation metrics. The four algorithms used are logistic regression, random forest, lightgbm, and naive Bayes. With an AUC value of approximately 72%, some patterns have beenidentified that can differentiate between customers that are expected to pay their loan obligationsand those that are not. The statistical logistic regression model was found to perform as well asmore sophisticated models with a limited number of inputs, regardless of the approach chosen forthe variable selection.
Η παρούσα εργασία εξετάζει παραμετρικά και μη παραμετρικά μοντέλα για τη διάκριση μεταξύ καλών και κακών αιτήσεων για πίστωση. Επιπλέον, η σημαντικότητα των μεταβλητών αξιολογείται με δύο διαφορετικές προσεγγίσεις (WoE και Random Forest) στην προσπάθειά μας να βρούμε το βέλτιστο μοντέλο. Η εμπειρική μελέτη διεξήχθη χρησιμοποιώντας δεδομένα από τον ιστότοπο Kaggle και περιέχει πληροφορίες σχετικά με τον οργανισμό "Home Credit Group", ο οποίος είναι ένας διεθνής πάροχος καταναλωτικής χρηματοδότησης, που ιδρύθηκε το 1997 στην Τσεχική Δημοκρατία. Το αρχικό σύνολο δεδομένων αποτελείται από 307.511 παρατηρήσεις και 122 διαφορετικές μεταβλητές, οι οποίες ενσωματώνουν πληροφορίες σχετικά με την αίτηση καθώς και πληροφορίες σχετικά με τον αιτούντα. Χρησιμοποιούνται αλγόριθμοι στατιστικής και μηχανικής μάθησης και η προβλεπτική ικανότητα αξιολογείται με βάση την AUC ROC curve. Οι τέσσερις αλγόριθμοι που χρησιμοποιούνται είναι logistic regression, random forest, light gbm, and naive Bayes. Με μια τιμή AUC περίπου 72%, ορισμένα πρότυπα μπορούν να διαφοροποιήσουν τους πελάτες που αναμένεται να πληρώσουν τις δανειακές τους υποχρεώσεις και εκείνων που δεν πρόκειται να πληρώσουν. Το μοντέλο στατιστικής λογιστικής παλινδρόμησης βρέθηκε να αποδίδει εξίσου καλά με περιορισμένο αριθμό μεταβλητών, ανεξάρτητα από την προσέγγιση που επιλέχθηκε για την επιλογή των μεταβλητών.

Λέξη κλειδί :	Μηχανική μάθηση Πιστωτικός κίνδυνος Πιθανότητα αθέτησης υποχρέωσης Επιλογή μεταβλητών Aξιολόγηση μεθόδων Machine learning Credit risk Probability of default Variable selection Evaluation methods

Διαθέσιμο από :	2023-11-13 23:19:48

Ημερομηνία έκδοσης :	2023

Ημερομηνία κατάθεσης :	2023-11-13 23:19:48

Δικαιώματα χρήσης :	Free access

Άδεια χρήσης :

Αρχείο: Pavlogeorgatos_2023.pdf

Τύπος: application/pdf

Είσοδος