EDA of Lending Club Data

We will first look at various aspects of the LendingClub data using techniques of Exploratory Data Analysis (EDA). Please look at my past post for finding further details on EDA techniques. Different data files for this analysis have already been downloaded in the current folder.

In [4]:
import warnings
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Let’s also take a quick look at the data via shell scripts (file size, head, line count, column count).

In [2]:
!du -h /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
#!tail -3 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
#!head -3 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv

181M   /home/ssingh/LendingClubData/LoanStats3c_securev1.csv

Examining the data we see that most of feature names are intuitive. We can get the specifics from the provided data dictionary.

In [3]:
!wc -l < /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
!head -2 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv | sed 's/[^,]//g' | wc -c


Based on the above analysis, we find that we have a total of 235633-2-2 = 235659 rows and 116 - 1 = 115 columns of data! Let us first look at the detailed description of columns from the dictionary of the data.

In [5]:
df = pd.read_csv("/home/ssingh/LendingClubData/LoanStatsDict.csv", sep=",", engine='c', encoding = "ISO-8859-1", na_filter=False)
df = df.ix[1:,0:2]
from IPython.display import HTML


1acc_open_past_24mthsNumber of trades opened in past 24 months.
2addr_stateThe state provided by the borrower in the loan…
3all_utilBalance to credit limit on all trades
4annual_incThe self-reported annual income provided by th…
5annual_inc_jointThe combined self-reported annual income provi…
6application_typeIndicates whether the loan is an individual ap…
7avg_cur_balAverage current balance of all accounts
8bc_open_to_buyTotal open to buy on revolving bankcards.
9bc_utilRatio of total current balance to high credit/…
10chargeoff_within_12_mthsNumber of charge-offs within 12 months
11collection_recovery_feepost charge off collection fee
12collections_12_mths_ex_medNumber of collections in 12 months excluding m…
13delinq_2yrsThe number of 30+ days past-due incidences of …
14delinq_amntThe past-due amount owed for the accounts on w…
15descLoan description provided by the borrower
16dtiA ratio calculated using the borrowerÕs total …
17dti_jointA ratio calculated using the co-borrowers’ tot…
18earliest_cr_lineThe month the borrower’s earliest reported cre…
19emp_lengthEmployment length in years. Possible values ar…
20emp_titleThe job title supplied by the Borrower when ap…
21fico_range_highThe upper boundary range the borrowerÕs FICO a…
22fico_range_lowThe lower boundary range the borrowerÕs FICO a…
23funded_amntThe total amount committed to that loan at tha…
24funded_amnt_invThe total amount committed by investors for th…
25gradeLC assigned loan grade
26home_ownershipThe home ownership status provided by the borr…
27idA unique LC assigned ID for the loan listing.
28il_utilRatio of total current balance to high credit/…
29initial_list_statusThe initial listing status of the loan. Possib…
30inq_fiNumber of personal finance inquiries
31inq_last_12mNumber of credit inquiries in past 12 months
32inq_last_6mthsThe number of inquiries in past 6 months (excl…
33installmentThe monthly payment owed by the borrower if th…
34int_rateInterest Rate on the loan
35issue_dThe month which the loan was funded
36last_credit_pull_dThe most recent month LC pulled credit for thi…
37last_fico_range_highThe upper boundary range the borrowerÕs last F…
38last_fico_range_lowThe lower boundary range the borrowerÕs last F…
39last_pymnt_amntLast total payment amount received
40last_pymnt_dLast month payment was received
41loan_amntThe listed amount of the loan applied for by t…
42loan_statusCurrent status of the loan
43max_bal_bcMaximum current balance owed on all revolving …
44member_idA unique LC assigned Id for the borrower member.
45mo_sin_old_il_acctMonths since oldest bank installment account o…
46mo_sin_old_rev_tl_opMonths since oldest revolving account opened
47mo_sin_rcnt_rev_tl_opMonths since most recent revolving account opened
48mo_sin_rcnt_tlMonths since most recent account opened
49mort_accNumber of mortgage accounts.
50mths_since_last_delinqThe number of months since the borrower’s last…
51mths_since_last_major_derogMonths since most recent 90-day or worse rating
52mths_since_last_recordThe number of months since the last public rec…
53mths_since_rcnt_ilMonths since most recent installment accounts …
54mths_since_recent_bcMonths since most recent bankcard account opened.
55mths_since_recent_bc_dlqMonths since most recent bankcard delinquency
56mths_since_recent_inqMonths since most recent inquiry.
57mths_since_recent_revol_delinqMonths since most recent revolving delinquency.
58next_pymnt_dNext scheduled payment date
59num_accts_ever_120_pdNumber of accounts ever 120 or more days past due
60num_actv_bc_tlNumber of currently active bankcard accounts
61num_actv_rev_tlNumber of currently active revolving trades
62num_bc_satsNumber of satisfactory bankcard accounts
63num_bc_tlNumber of bankcard accounts
64num_il_tlNumber of installment accounts
65num_op_rev_tlNumber of open revolving accounts
66num_rev_acctsNumber of revolving accounts
67num_rev_tl_bal_gt_0Number of revolving trades with balance >0
68num_satsNumber of satisfactory accounts
69num_tl_120dpd_2mNumber of accounts currently 120 days past due…
70num_tl_30dpdNumber of accounts currently 30 days past due …
71num_tl_90g_dpd_24mNumber of accounts 90 or more days past due in…
72num_tl_op_past_12mNumber of accounts opened in past 12 months
73open_accThe number of open credit lines in the borrowe…
74open_acc_6mNumber of open trades in last 6 months
75open_il_12mNumber of installment accounts opened in past …
76open_il_24mNumber of installment accounts opened in past …
77open_il_6mNumber of currently active installment trades
78open_rv_12mNumber of revolving trades opened in past 12 m…
79open_rv_24mNumber of revolving trades opened in past 24 m…
80out_prncpRemaining outstanding principal for total amou…
81out_prncp_invRemaining outstanding principal for portion of…
82pct_tl_nvr_dlqPercent of trades never delinquent
83percent_bc_gt_75Percentage of all bankcard accounts > 75% of l…
84policy_codepublicly available policy_code=1 new products …
85pub_recNumber of derogatory public records
86pub_rec_bankruptciesNumber of public record bankruptcies
87purposeA category provided by the borrower for the lo…
88pymnt_planIndicates if a payment plan has been put in pl…
89recoveriespost charge off gross recovery
90revol_balTotal credit revolving balance
91revol_utilRevolving line utilization rate, or the amount…
92sub_gradeLC assigned loan subgrade
93tax_liensNumber of tax liens
94termThe number of payments on the loan. Values are…
95titleThe loan title provided by the borrower
96tot_coll_amtTotal collection amounts ever owed
97tot_cur_balTotal current balance of all accounts
98tot_hi_cred_limTotal high credit/credit limit
99total_accThe total number of credit lines currently in …
100total_bal_ex_mortTotal credit balance excluding mortgage
101total_bal_ilTotal current balance of all installment accounts
102total_bc_limitTotal bankcard high credit/credit limit
103total_cu_tlNumber of finance trades
104total_il_high_credit_limitTotal installment high credit/credit limit
105total_pymntPayments received to date for total amount funded
106total_pymnt_invPayments received to date for portion of total…
107total_rec_intInterest received to date
108total_rec_late_feeLate fees received to date
109total_rec_prncpPrincipal received to date
110total_rev_hi_lim ÊTotal revolving high credit/credit limit
111urlURL for the LC page with listing data.
112verification_statusIndicates if income was verified by LC, not ve…
113verified_status_jointIndicates if the co-borrowers’ joint income wa…
114zip_codeThe first 3 numbers of the zip code provided b…

Lets choose some of the most important variables from these.

The Response Variable:

  • Interest Rate (int_rate)

And some of possible important factors are:

  1. Annual Income (annual_inc)
  2. State (addr_state)
  3. Purpose (purpose)
  4. Description for Loan (desc)
  5. Amount Requested (loan_amount)
  6. Amount Funded (funded_amnt)
  7. Loan Length (term)
  8. Debt Income Ratio (dti)
  9. Home Ownership status (home_ownership)
  10. FICO high (fico_range_high)
  11. FICO low (fico_range_low)
  12. Last FICO low (last_fico_range_low)
  13. Last FICO high (last_fico_range_high)
  14. Average current balance (avg_cur_bal)
  15. Charge Offs in last Year (chargeoff_within_12_mths)
  16. Number of 30+ days past-due incidences (delinq_2yrs)
  17. Employment Length (emp_length)
  18. No. of Credit Inquiries (inq_last_6mths)
  19. Maximum current balance owed on all revolving (max_bal_bc)
  20. Total credit revolving balance (revol_bal)
  21. LC Verification status (verification_status)
  22. Revolving line utilization rate (revol_util)
  23. Percentage of account never delinquent (pct_tl_nvr_dlq)
  24. Months since most recent 90-day or worse rating (mths_since_last_major_derog)
  25. Total Credit Balance (total_bal_ex_mort)

We will first look at effects of some of these variables using EDA.

Later, if we find any need to use some additional variables, we will revisit this list.

First, lets load our data as a Pandas data frame:

In [5]:
df = pd.read_csv("/home/ssingh/LendingClubData/LoanStats3c_securev1.csv", skiprows=1, skipfooter=2)
df.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235629 entries, 0 to 235628
Columns: 115 entries, id to total_il_high_credit_limit
dtypes: float64(44), int64(47), object(24)
memory usage: 206.7+ MB

In [6]:

0380981144086082715000150001500060 months12.39%336.64CC1...04100.00.0001965001491401000012000
1368055483955826410400104001040036 months6.99%321.08AA3...0483.314.300179407150301300011325
2376622244042532176507650765036 months13.66%260.20CC3...02100.0100.0008233164426490064031

3 rows × 115 columns

Lets us remove columns/data that is redundant for prediction of interest rates. For example, id, member_id for sure are of no importance to us. Let’s work through the columns in batches to keep the cognitive burden low:

In [7]:
# .ix[row slice, column slice] 

0380981144086082715000150001500060 months12.39%
1368055483955826410400104001040036 months6.99%
2376622244042532176507650765036 months13.66%
3376123544037547312800128001280060 months17.14%
4378221874058525196009600960036 months13.66%

We won’t need id or member_id as it has no real predictive power so we can drop them from this table int_rate was loaded as an object data type instead of float due to the ‘%’ character. Let’s strip that out and convert the column type. And Also do similar transformation for term variable to get rid of months.

In [8]:
df.drop(['id','member_id'], 1, inplace=True)
df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float)
df['term'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
df['term'] = df['term'].convert_objects(convert_numeric=True)

Moving on to next columns:

In [9]:

0336.64CC1MANAGEMENT10+ yearsRENT78000.0Source VerifiedDec-2014Current
1321.08AA3Truck Driver Delivery Personel8 yearsMORTGAGE58000.0Not VerifiedDec-2014Current
2260.20CC3Technical Specialist< 1 yearRENT50000.0Source VerifiedDec-2014Charged Off
3319.08DD4Senior Sales Professional10+ yearsMORTGAGE125000.0VerifiedDec-2014Current
4326.53CC3Admin Specialist10+ yearsRENT69000.0Source VerifiedDec-2014Fully Paid

At first, it seems we employment title should be important. Let us first have a look at how many unique values we have for these. We would like to convert emp_length into an integer variable.

In [10]:

Teacher             4569
Manager             3772
Registered Nurse    1960
RN                  1816
Supervisor          1663
Name: emp_title, dtype: int64
Care Aid                         1
Deputy Probation                 1
Front office staff               1
factor                           1
Independent Contractor/Driver    1
Name: emp_title, dtype: int64

This is just too many. Unless, we do some semantics based grouping of these titles, we would not be able to get any meaningful data out of this. If you think harder, this should be highly correlated with income. Just for the purpose of loan, it is highly unlikely (not impossible though!) that a high paying job title would be different than another!

In [11]:
df.drop(['emp_title'], 1, inplace=True)

Let first look at unique value of emp_legth variable.

In [12]:

10+ years    79505
2 years      20487
3 years      18267
< 1 year     17982
1 year       14593
4 years      13528
7 years      13099
5 years      13051
n/a          12019
8 years      11853
6 years      11821
9 years       9424
Name: emp_length, dtype: int64

Let us make this variable simple integers. We will replace na entries with 0, and all non numeric entries.

In [13]:
df.replace('n/a', np.nan, inplace=True)
df['emp_length'].replace(to_replace='^<', value=0.0, inplace=True, regex=True)
df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
df['emp_length'] = df['emp_length'].astype(int)

Lets us look at unique emp_length entries again:

In [14]:

10    79505
0     30001
2     20487
3     18267
1     14593
4     13528
7     13099
5     13051
8     11853
6     11821
9      9424
Name: emp_length, dtype: int64

We should convert verification status into three ordinal values. Lets see what are different possible values of the verification status. We will convert these to ordinals accordingly.

In [15]:

Source Verified    97741
Not Verified       70659
Verified           67229
Name: verification_status, dtype: int64

We will assign the lowest rating to Not Verified and the highest rating to Verified.

In [16]:
df["verification_status"] = df["verification_status"].astype('category')
df["verification_status"] = df["verification_status"].cat.set_categories(["Not Verified", "Source Verified", "Verified"], ordered = True)

Loan status and issue dates are not of any interest to us, as we only plan to use this data for making models that can predict interest rate. However, let us do some digging into this data to see some statistics of loans on Lending Club! Let us look at a histogram of different states of loans.

In [23]:
import seaborn as sns
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
total = float(len(df.index))
ax = sns.countplot(x="loan_status", data=df, palette="Set2");
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x(), height+18, '%2.2f'%(height*100/total)+"%")

We can also look at a histogram of number of loans issued based on the month of the year. Just for fun, we will also look for any correlation between the issue month of loans and their states.

In [18]:
df['issue_d'].replace(to_replace='[^A-Z,a-z]+', value='', inplace=True, regex=True)
ax = sns.countplot(x="issue_d", data=df, palette="Set2");
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x(), height+18, '%2.1f'%(height*100/total))

First, we find that more loans are issued during the holidays seasons: Oct and July! Nothing surprising there.

In [19]:
g = sns.factorplot("loan_status", col="issue_d", col_wrap=4, data=df,kind="count", aspect=1.25)
(g.set_axis_labels("", "")

<seaborn.axisgrid.FacetGrid at 0x7fe6351a72b0>

We do not find any obvious correlation between the issue date and their states at this time. Not interesting. Lets drop both issue date from our further analysis. Moving on to next set of columns.

We will also divide loan status into two categories: Low, and High risks.

In [20]:
defaultList = ["Default", "Charged Off", "Late (31-120 days)", "Late (16-30 days)", "In Grace Period"]
df.loc[df.loan_status.isin(defaultList), "loan_status"] = "High"
goodList = ["Current", "Fully Paid"]
df.loc[df.loan_status.isin(goodList), "loan_status"] = "Low"

In [21]:
df.drop(['issue_d'],1, inplace=True)

Let us stop here for this post. We will continue our EDA analysis in the next. We will save our pandas object as pickle and then catch up from there.

In [22]:

comments powered by Disqus