EDA of Lending Club Data - II

In the last post we looked at some initial cleanup of the data. We will start from there by loading the pickled dataframe.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_pickle("/home/ssingh/LendingClubData/Part1.pickle")

Lets first check what all columns are remaining in our dataframe. As there are still more than 100 variables left, we will initially focus on the first 25 ones only.

In [2]:
print(df.columns)
print(df.columns.shape)

Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
       ...
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens',
       'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit',
       'total_il_high_credit_limit'],
      dtype='object', length=111)
(111,)

From the data dictionary, we can see that funded_amnt is total amount committed till now, and funded_amnt_inv is the amount funded by investors. It is difficult to think of a direct correlation between the charged interest rate and the actual funded amount. However, this amount can give us a range of risk that one will be taking when investing. Given the two committed amounts are very similar, we will drop the the “funded_amnt” column. The installment column gives us feel of how much burden the loan will be on the borrower. However, this will be direct function of term and rate of the loan and hence should be dropped from any further analysis. the “grade” and “sub_grade” are LC assigned grades to the loan. We can keep these as secondary variables to check the liability of models used by LC.

In [3]:
df.drop(['funded_amnt', 'installment', "pymnt_plan"],1, inplace=True)

In [4]:
df.ix[:4,11:19]

Out[4]:
urldescpurposetitlezip_codeaddr_statedtidelinq_2yrs
0https://www.lendingclub.com/browse/loanDetail....NaNdebt_consolidationDebt consolidation235xxVA12.030
1https://www.lendingclub.com/browse/loanDetail....NaNcredit_cardCredit card refinancing937xxCA14.920
2https://www.lendingclub.com/browse/loanDetail....NaNdebt_consolidationDebt consolidation850xxAZ34.810
3https://www.lendingclub.com/browse/loanDetail....NaNcarCar financing953xxCA8.311
4https://www.lendingclub.com/browse/loanDetail....NaNdebt_consolidationDebt consolidation077xxNJ25.810

For our purpose, we will not be going into any kind of natural language processing, hence, the description and the url variables are of no use to us.

In [5]:
df.drop(['url', 'desc'],1, inplace=True)

Let us check what are typical “purpose” used for requesting loans. We can view this as a histogram plot.

In [6]:
sns.set()
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
total = float(len(df.index))
ax = sns.countplot(x="purpose", data=df, palette="Set2");
ax.set(yscale = "log")
plt.xticks(rotation=90)
plt.show()

We can also look for any kind of correlation between the purpose and the interest rate of loan using a box plot. We can clearly see this could be useful for building our model!

In [7]:
sns.boxplot(x="purpose", y="int_rate", data=df)
plt.xticks(rotation=90)
plt.show()

Let also look for any kind of correlations between “employment length”, “rate” and “status” of loans. Status here, if you remember from the previous post refers to the risk factor involved with the loan.

In [8]:
sns.set(style="ticks", color_codes=True)
sns.pairplot(df, vars=["int_rate", "emp_length"], hue="loan_status", diag_kind="kde")

Out[8]:
<seaborn.axisgrid.PairGrid at 0x7f2e52116278>

As expected, we find good loans to have larger employment length. Interestingly, interest rate tends to be all over the place for high risk loans. But, if you think about it, that is what we are trying to fix here!

Analyzing tile of loans could be tricky. Again, due to lack of any kind of natural language processing, let us drop this as well.

The location address of borrowers can say interesting pattern about the interest rates. First three letters of zip code can give much more information than states. However, if the zip info is missing, state can provide a reasonable approx. of the data. Lets check if we have any data where zip data is missing. If none, we can simply drop the state information.

In [9]:
df['zip_code'] = df['zip_code'].str.replace('xx','')

In [10]:
df.drop(['title'],1, inplace=True)

In [11]:
df.zip_code.isnull().sum()

Out[11]:
0

In [12]:
df.drop(['addr_state'],1, inplace=True)

The “dti” column in the data dictionary has been described as - “A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income”. Based on this information, Debt_to_Income ratio is a direct measure of the loan risk.

Lets check effects of delinquency over last 2 years on interest rate using a box plot:

In [13]:
sns.boxplot(x="delinq_2yrs", y="int_rate", data=df)

Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e502c3a58>

We can see visualize effects of delinquency over last 2 years. Let us bin this data into three bins - Low, Medium and High. We will now move on to the next set of columns.

In [14]:
df["delinq_2yrs"] = pd.cut(df.delinq_2yrs, bins=3, labels=["Low", "Medium", "High"], include_lowest = True)

In [15]:
df.ix[:4,15:23]

Out[15]:
earliest_cr_linefico_range_lowfico_range_highinq_last_6mthsmths_since_last_delinqmths_since_last_recordopen_accpub_rec
0Aug-19947507540NaNNaN60
1Sep-1989710714242.0NaN170
2Aug-20026856891NaNNaN110
3Oct-2000665669017.0NaN80
4Nov-19926806840NaNNaN120

Earliest credit line should play an important role in determining the rate. We will replace this column by something more quantitative - credit_age.

In [16]:
now = pd.Timestamp('20160501')
df["credit_age"] = pd.to_datetime(df.earliest_cr_line, format="%b-%Y")
df['credit_age'] = (now - df['credit_age']).dt.days.divide(30).astype("int64")
df.drop(['earliest_cr_line'],1, inplace=True)

Let us try to find a trend between interest rate, fico ranges and loan status.

In [17]:
sns.pairplot(df, vars=["int_rate", "fico_range_low", "fico_range_high"], hue="loan_status", diag_kind="kde")

Out[17]:
<seaborn.axisgrid.PairGrid at 0x7f2e5239cef0>

We find 2 FICO scores to be highly collinear. Further, high risk loans have much larger lower values of fico scores. We can safely replace these with the mean values of fico scores.

In [18]:
df['fico'] = 0.5*(df['fico_range_high'] + df['fico_range_low'])
df.drop(['fico_range_high'],1, inplace=True)
df.drop(['fico_range_low'],1, inplace=True)

Similar to the 2 year delinquency, let us also look at the 6 month inquiry data. Other data like mths_since_last_delinq and mths_since_last_record can be safely removed, as they will be correlated to 2 year delinquency data.

In [19]:
sns.boxplot(x="inq_last_6mths", y="int_rate", data=df)

Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e50ff59b0>

Let us find correlations between many of these similar variables.

In [20]:
sns.pairplot(df, vars=["int_rate", "pub_rec", "open_acc", "inq_last_6mths"], hue="delinq_2yrs", diag_kind="kde")

Out[20]:
<seaborn.axisgrid.PairGrid at 0x7f2e50dea1d0>

Both open_acc and inq_last_6_mnths have a strong correlation with delinq_2year, and hence can be safely dropped. pub_rec too has a distinct shape for each levels of delinq_2yrs showing interdependence and hence we can drop this as well.

In [21]:
df.drop(['pub_rec'],1, inplace=True)
df.drop(['open_acc'],1, inplace=True)
df.drop(['inq_last_6mths'],1, inplace=True)
df.drop(['mths_since_last_delinq'],1, inplace=True)
df.drop(['mths_since_last_record'],1, inplace=True)

We will now move on to the next set of columns.

In [22]:
df.ix[:4,15:25]

Out[22]:
revol_balrevol_utiltotal_accinitial_list_statusout_prncpout_prncp_invtotal_pymnttotal_pymnt_invtotal_rec_prncptotal_rec_int
013800829%17w12484.9912484.994364.644364.642515.011849.63
1613331.6%36w6892.586892.584163.944163.943507.42656.52
21682291.9%20f0.000.002281.982281.98704.38339.61
35753100.9%13w10868.6710868.674117.574117.571931.332186.24
41638859.4%44f0.000.009973.439973.439600.00373.43

Revolving balance and revolving utilization, is a measure of “how leveraged your credit cards are”. revol_util should provide a relative measure of leverage, whereas revol_bal should provide an absolute measurement. Before we proceed, we need to convert ‘%’ data to fraction.

In [23]:
df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float)
df.revol_util = df.revol_util * 0.01

In [24]:
g = sns.pairplot(df, vars=["revol_bal", "revol_util", "total_acc"], hue="loan_status", diag_kind="kde")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=90)

None of these variables seem to make any direct correlation with the risk levels of the loan. Given their direct use in the FICO score calculation, we will keep these in our analysis.

Let us take a look at the initial listing status of the loan. Then, we can find a correlation between these and the risk level.

In [25]:
sns.countplot(x="initial_list_status", hue="loan_status", data=df)

Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e4fe23898>

For high risk loans as well low risk ones, there does not seem to be any significant difference among two types of initial listing of the loan and hence we can drop it.

In [26]:
df.drop(['initial_list_status'],1, inplace=True)

Following variables remaining in the list refer to the current state of the loan and hence will not be playing any effect on the general state or risk level of the loan, therefore should be dropped from our analysis. We will also not consider any joint data for this analysis.

In [27]:
df.drop(['out_prncp'],1, inplace=True)
df.drop(['out_prncp_inv'],1, inplace=True)
df.drop(['total_pymnt'],1, inplace=True)
df.drop(['total_pymnt_inv'],1, inplace=True)
df.drop(['total_rec_prncp'],1, inplace=True)
df.drop(['total_rec_int'],1, inplace=True)
df.drop(['total_rec_late_fee'],1, inplace=True)
df.drop(['recoveries'],1, inplace=True)
df.drop(['collection_recovery_fee'],1, inplace=True)
df.drop(['last_pymnt_d'],1, inplace=True)
df.drop(['last_pymnt_amnt'],1, inplace=True)
df.drop(['next_pymnt_d'],1, inplace=True)
df.drop(['policy_code'],1, inplace=True)
df.drop(['application_type'],1, inplace=True)
df.drop(['annual_inc_joint'],1, inplace=True)
df.drop(['dti_joint'],1, inplace=True)
df.drop(['verification_status_joint'],1, inplace=True)

In [28]:
df.ix[:4,18:24]

Out[28]:
last_credit_pull_dlast_fico_range_highlast_fico_range_lowcollections_12_mths_ex_medmths_since_last_major_derogacc_now_delinq
0Feb-20166846800NaN0
1Feb-2016679675059.00
2Dec-20155395350NaN0
3Feb-2016704700036.00
4Feb-20166846800NaN0

First we need to convert, last credit pull day to a numeric value as days since lst credit pull. Let us find if there are any NA values.

In [29]:
print("No. of Data with NA values = {}".format(len(df.last_credit_pull_d) - df.last_credit_pull_d.count()))

No. of Data with NA values = 27

We will replace these NA values with, Day corresponding with the oldest date of their account, i.e. now - credit history date.

In [30]:
df.last_credit_pull_d.fillna("Jan-1980", inplace=True)

In [31]:
df["last_credit_pull_d"] = pd.to_datetime(df.last_credit_pull_d, format="%b-%Y")
df['last_credit_pull_d'] = (now - df['last_credit_pull_d']).dt.days.divide(30).astype("int64")
df[df['last_credit_pull_d'] >= 7000].last_credit_pull_d = df[df['last_credit_pull_d'] >= 7000].credit_age

Let us compare last fico score to the overall fico score.

In [32]:
sns.pairplot(df, vars=["last_fico_range_high", "last_fico_range_low", "fico"], hue="loan_status")

Out[32]:
<seaborn.axisgrid.PairGrid at 0x7f2e5004d828>

As before, last fico high and low scores are correlated, and also with overall fico score, and hence we can get rid of these.

In [33]:
df.drop(['last_fico_range_high'],1, inplace=True)
df.drop(['last_fico_range_low'],1, inplace=True)

We can also get of “collections_12_mths_ex_med” column as this corresponds only to the current state of loan. Other two variables, “mths_since_last_major_derog” and “acc_now_delinq” should have no additional impact than ones already considered.

In [34]:
df.drop(['collections_12_mths_ex_med'],1, inplace=True)
df.drop(['mths_since_last_major_derog'],1, inplace=True)
df.drop(['acc_now_delinq'],1, inplace=True)

In [35]:
df.ix[:4,19:29]

Out[35]:
tot_coll_amttot_cur_balopen_acc_6mopen_il_6mopen_il_12mopen_il_24mmths_since_rcnt_iltotal_bal_ilil_utilopen_rv_12m
00149140NaNNaNNaNNaNNaNNaNNaNNaN
10162110NaNNaNNaNNaNNaNNaNNaNNaN
2064426NaNNaNNaNNaNNaNNaNNaNNaN
30261815NaNNaNNaNNaNNaNNaNNaNNaN
4038566NaNNaNNaNNaNNaNNaNNaNNaN

Again, we can go on and delete all the columns that are related to only the current states of loans, including the ones with large amount of missing data.

In [36]:
df.drop(['tot_coll_amt'],1, inplace=True)
df.drop(['open_acc_6m'],1, inplace=True)
df.drop(['tot_cur_bal'],1, inplace=True)
df.drop(['open_il_6m'],1, inplace=True)
df.drop(['open_il_12m'],1, inplace=True)
df.drop(['open_il_24m'],1, inplace=True)
df.drop(['mths_since_rcnt_il'],1, inplace=True)
df.drop(['total_bal_il'],1, inplace=True)
df.drop(['il_util'],1, inplace=True)
df.drop(['open_rv_12m'],1, inplace=True)

In [37]:
df.ix[:4,19:29]

Out[37]:
open_rv_24mmax_bal_bcall_utiltotal_rev_hi_liminq_fitotal_cu_tlinq_last_12macc_open_past_24mthsavg_cur_balbc_open_to_buy
0NaNNaNNaN184500NaNNaNNaN529828.09525.0
1NaNNaNNaN19400NaNNaNNaN79536.07599.0
2NaNNaNNaN18300NaNNaNNaN65857.0332.0
3NaNNaNNaN5700NaNNaNNaN232727.00.0
4NaNNaNNaN27600NaNNaNNaN83214.06494.0

Out of these variables, only “avg_cur_bal” is viable additional feature for our model. We will also look at the distribution of average current balance. However, in order to use it correctly, we need to if there are any NA values in to and replace them correctly.

In [38]:
df.drop(['open_rv_24m'],1, inplace=True)
df.drop(['max_bal_bc'],1, inplace=True)
df.drop(['all_util'],1, inplace=True)
df.drop(['inq_fi'],1, inplace=True)
df.drop(['total_cu_tl'],1, inplace=True)
df.drop(['inq_last_12m'],1, inplace=True)
df.drop(['acc_open_past_24mths'],1, inplace=True)
df.drop(['bc_open_to_buy'],1, inplace=True)
df.drop(['total_rev_hi_lim'],1, inplace=True)

In [39]:
print("No. of Data with NA values = {}".format(len(df.avg_cur_bal) - df.avg_cur_bal.count()))

No. of Data with NA values = 6

In [40]:
df.avg_cur_bal.fillna(df.avg_cur_bal.min(), inplace=True)

In [41]:
g = sns.pairplot(df, vars=["avg_cur_bal", "int_rate"], hue="loan_status")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=90)

In [42]:
df.ix[:4,20:28]

Out[42]:
bc_utilchargeoff_within_12_mthsdelinq_amntmo_sin_old_il_acctmo_sin_old_rev_tl_opmo_sin_rcnt_rev_tl_opmo_sin_rcnt_tlmort_acc
04.700103.0244110
141.50076.0290111
293.200137.0148880
3103.20016.017021165
469.200183.02652330

Similar to before, we can again get rid of variables that will not make significant impact on our model. Then look at the pair-wise effect of rest of them. We will also replace NAs with the mean values.

In [43]:
df.drop(['chargeoff_within_12_mths'],1, inplace=True)
df.drop(['delinq_amnt'],1, inplace=True)
df.drop(['mo_sin_old_il_acct'],1, inplace=True)
df.drop(['mo_sin_old_rev_tl_op'],1, inplace=True)
df.drop(['mo_sin_rcnt_rev_tl_op'],1, inplace=True)
df.drop(['mo_sin_rcnt_tl'],1, inplace=True)

In [44]:
df.bc_util.fillna(df.bc_util.min(), inplace=True)

In [45]:
sns.pairplot(df, vars=["bc_util", "int_rate"], hue="loan_status")

Out[45]:
<seaborn.axisgrid.PairGrid at 0x7f2e4e1ab550>

In [46]:
df.ix[:4,21:27]

Out[46]:
mort_accmths_since_recent_bcmths_since_recent_bc_dlqmths_since_recent_inqmths_since_recent_revol_delinqnum_accts_ever_120_pd
0047.0NaNNaNNaN0
115.042.01.042.04
2017.0NaN3.0NaN0
3521.017.01.017.01
4024.0NaN17.0NaN0

In this list only variable of our interest is number of mortgage accounts.

In [47]:
sns.pairplot(df, vars=["mort_acc", "int_rate"], hue="loan_status")

Out[47]:
<seaborn.axisgrid.PairGrid at 0x7f2e4dd0ed30>

In [48]:
df.drop(['mths_since_recent_bc'],1, inplace=True)
df.drop(['mths_since_recent_bc_dlq'],1, inplace=True)
df.drop(['mths_since_recent_inq'],1, inplace=True)
df.drop(['mths_since_recent_revol_delinq'],1, inplace=True)
df.drop(['num_accts_ever_120_pd'],1, inplace=True)

In [49]:
df.ix[:4,22:31]

Out[49]:
num_actv_bc_tlnum_actv_rev_tlnum_bc_satsnum_bc_tlnum_il_tlnum_op_rev_tlnum_rev_acctsnum_rev_tl_bal_gt_0num_sats
0141285946
16971821432917
214141248411
3353515758
44751617826712

All of these variables are related to some kind of number of accounts. Lets take a look at their inter-dependence.

In [50]:
sns.pairplot(df, vars=["num_actv_bc_tl", "num_actv_rev_tl", "num_bc_sats", "num_bc_tl",
                       "num_op_rev_tl"], hue="loan_status")

Out[50]:
<seaborn.axisgrid.PairGrid at 0x7f2e4de95208>