There are many people in the p2p lending space, myself included, that like to use the big data that is given by the lending platforms to analyze trends and help to better understand who our borrowers are and how they behave to try to limit our risk in the p2p lending marketplace. Since many of the early retail investors are/were tech oriented due to the online nature of the lending platform, there may be a misunderstanding out there that you need to understand and manipulate big data in order to be a successful investor. This is not true, not even a little bit, if you are willing to educate yourself on Credit and finance techniques in order to make better investment decisions in the p2p lending arena.
Big data cannot do everything and here are 4 important factors you need to be aware of about big data in p2p lending.
Factor 1: Big Data Only Looks at Past Performance
Anyone who has ever been sold (or attempted to sell) a mutual fund, stock, bond or other investment from a broker has heard or seen the term that 'Past Performance is not indicative of future results'. And it's true it isn't. The important info that we do get from historical information like Pay History, FICO credit score and Derogatories on the loan listing tells us something about how our borrower has behaved in the past. Are they someone who has a history of paying their bills on time? It doesn't mean they will pay our loan on time but it does increase the likelihood, especially if no major changes take place like job loss or major illness, etc. No one knows what will happen in the future but it's important to remember that this information is not predictive, its reflective on the past.
Factor 2: The Pool of Loans is Small and Recent
Using Lending Club as an example, it took from its inception in 2006 to November of 2012 to issue its first $1 billion in loans. By May of 2013 (6 months later), they had issued their 2nd billion of loans. As of November 2013, 1 year after their first billion in loans issued, they issued their 3rd billion for a $3 billion total of loans issued. This means another $1 billion of loans was issued between May and November of this year. All of this is found in the Lending Club monthly volume analysis on Lend Academy. What this means is that 2/3 of the total loans are 1 year old or less. Some would say there is no meaningful data we can gather at all but many would agree that the most meaningful data we can gather from all of this historical information is the likelihood of an early payment default. The pool of loans with long term data we can really use is very small, if at all available.
The biggest enemy to our loan portfolio is the early payment default, a default within the first 6 months. We now have data on $2 billion worth of loans that are at least 6 months old to see who has paid late and who has defaulted early on in the term. Much beyond this, it is tough to get data that is considered meaningful given the recent-ness of the originated loans.
Factor 3: Big Data Isolates Factors that are dependent on each other
The availability of data is so great that you can research Debt Consolidation loans for CA borrowers for 750 FICO scores for loans that were originated on a Tuesday. One of the supposed benefits of big data is that you can use it to research every meaningful or potentially meaningful variable. Some of these variables that we have discussed here include FICO score, derogatories, public filings, state, homeownership, years of employment and the list goes on. It's good and it's bad. The good is pretty obvious as we can research based on factors we believe are important and will reduce our risk.
The problem with big data working with all of these variables is that it isolates these variables for us, like if we wanted to check by public filings or not. This doesn't sounds like it's bad at first and actually sounds like it's a good thing. What's wrong with it is that a number of these factors are dependent upon each other so isolating them does not make any sense and it devalues the variable by isolating it.
Credit score, for instance, is dependent on many factors that we have discussed before. Some of them include recent pay history and derogatories, public filings, credit inquiries, revolving credit available and Debt to Income ratio. We are already factoring these things in when we filter by credit score so further isolation of these variables is redundant at best and maybe even harmful to quality filter selection. So if we want high credit score and no derogatories and we already know that no derogatories will raise a credit score then we are double counting these factors as opposed to other credit related factors.
Factor 4: Correlation Isn't Causation
Scientists and researchers are well versed in this phrase. What is means is that just because 2 things are related does not mean that one thing happening causes the other to happen. One good example here where we talked about this is my post on filtering by state. Here we mentioned that CA has the highest rate of default. It also has by far the highest number of loans issued so you are reducing your pool of available loans significantly if you remove CA from your filter. It also means that all the good loans in CA that meet your filtering standards would also be excluded. It's throwing the baby out with the bath water.
Credit scoring is a great example of this. It's natural to think that a public filing would reduce a borrower's credit score, and it does. However, the impact can be anywhere from very little to very great depending on what and when. A small medical collection from 5 years ago is not the same as a judgement against a borrower 6 months ago, but both are public filings. One affects the borrower's credit score significantly since it was within 6 months and the other has a minimal effect on score, as well as on our perception of the borrower's ability to pay us back.
Not only do you not have to be an excellent manipulator of data, it's probably better if you aren't because it's easy to fall into some of these traps that we all can fall into when we look at all the big data that is available. If we remember all of these factors: that big data is reflective and not predictive, that our pool of loans is small and pretty recent, that big data isolates factors that are already dependent upon each other and that correlation does not equal causation when we analyze the data we will be a step ahead of other investors in the p2p lending space and very well aware of the risks that are out there to our loan portfolios.