You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.
What should you do?
A. Increase the size of the dataset by collecting additional data.
B. Train a linear regression to predict a credit default risk score.
C. Remove the bias from the data and collect applications that have been declined loans.
D. Match loan applicants with their social profiles to enable feature engineering.
Answer should be D
B is a classic malpractice for a junior data scientist:
https://thestatsgeek.com/2015/01/17/why-shouldnt-i-use-linear-regression-if-my-outcome-is-binary/
Answer should be D
B is a classic malpractice for a junior data scientist:
https://medium.com/analytics-vidhya/insiders-view-on-logistic-regression-and-how-do-we-deploy-regression-model-in-gcp-as-batch-c62a64563210
Answer is B.
You have to work with what data you have. So A & D are incorrect.
C is an optimization, not a solution.
You can use a linear regression model to predict the probability score that a loan will be default (unpaid) using the historical data.
A and B dont make sense.
C doesnt either, ¿What should we do with the applications? 🙂
D looks correct, in order to find new features for the model.