# Pseudo Facebook Data - Plots in Python

In this post, we will learn about EDA of single variables using simple plots like histograms, frequency plots and box plots.

Data sets used below are part of a project from the UD651 course on udacity by Facebook.
The data from the project corresponds to a typical data set at Facebook. You can load the data through the following command. Notice that this is a `<TAB>`

delimited *csv* file. This data set consists of 99000 rows of data. We will see the details of different columns using the command below.

```
import pandas as pd
import numpy as np
```#Read csv file
pf = pd.read*csv**("https://s3.amazonaws.com/udacity-hosted-downloads/ud651/pseudo*facebook.tsv", sep = '\t')

#summarize data
pf.describe(include='all', percentiles=[]).T.replace(np.nan,' ', regex=True)

We need convert some of the variables from numeric to category.

```
cats = ['userid', 'dob_day', 'dob_year', 'dob_month']
for col in pf.columns:
if col in cats:
pf[col] = pf[col].astype('category')
```#summarize data
pf.describe(include='all', percentiles=[]).T.replace(np.nan,' ', regex=True)

The goal of this analysis is to understand user behavior and their demographics. We want to understand what they are doing on the Facebook and what they use. Please note this is not a real Facebook dataset.

Our goal is to do some basic EDA (Exploratory Data Analysis) to understand any underlying patterns in the data. We will first look at a histogram of User's Birthdays.

```
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline
```ax = sns.countplot(x="dob_day", data=pf)

We see some peculiar behavior of the data on the 1st of the month. Let us plot this data in more detail, in per month basis.

```
g = sns.factorplot("dob_day", col="dob_month", col_wrap=4, data=pf, kind='count', size=2.5, aspect=.8)
g.set(xticklabels=[])
```

This explains the above plot. Because of the default settings, or users privacy concerns, numerous people have 1/1 as their birthdays!

Now, let us explore the distribution of friend counts in this data.

```
ax = sns.distplot(pf["friend_count"], kde=False, bins=100)
plt.xlim(0,1000)
```

We see the data has some outliers near 5000. This is an example of a long tail data. We want our analysis to be focused on the bunch of Facebook users, so we need to limit the axes of these plots. Additionally, we also want to look at these data as a function of gender. However, We also want to remove any data where gender is NA.

```
df = pf[pf.gender.notnull()]
g = sns.FacetGrid(df, col="gender")
g = g.map(plt.hist, "friend_count", bins=100, color="b")
plt.xlim(0,1000)
```

If we want to know, mean statistics of our data, we can use the 'value_counts' command.

```
pf.groupby('gender').friend_count.describe()
```

Let us know look at the tenure of usage (measured in Years) of Facebook.

```
df = pf[pf.tenure.notnull()]
ax = sns.distplot(df["tenure"]/365, kde=False, bins=36)
plt.xlim(0,7)
plt.xlabel('Number of years using Facebook', fontsize=12)
plt.ylabel('Number of users in sample', fontsize=12)
```

We will now look at any pattern in the ages of Facebook users in this dataset.

```
ax = sns.distplot(pf["age"], kde=False, bins=100)
plt.xlim(13,113)
plt.xlabel('Age of Users in Years', fontsize=12)
plt.ylabel('Number of users in sample', fontsize=12)
```

One general theme of observation here is that most of the data have a long tail. In these circumstances, it is better to look at such data after certain types of transformation. Let us do such an analysis of “friend_count”.

```
ax = sns.distplot(pf["friend_count"], kde=False, hist_kws={"alpha": 0.9})
```

```
ax = sns.distplot(pf["friend_count"], kde=False, bins=np.logspace(0,4), hist_kws={"alpha": 0.9})
plt.xscale('log')
```

Let us try to compare distribution of male vs female friend counts.

```
def plotDensity(x, color=None, label=None, bins=np.linspace(0,1000,200), **kws):
w = 100*np.ones_like(x)/x.size
plt.hist(x, bins=bins, alpha=0.4, histtype='step', linewidth=2, label=label, color=color, weights=w, **kws)
return
```g = sns.FacetGrid(df, col=None, hue='gender', size=6.0, xlim=(6,600), ylim=(0,5), legend*out**=True)
g = (g.map(plotDensity, 'friend*count')).add*legend**()
g = g.set*axis_labels('Friend Count', '% of users')

Similarly, we can compare distributions of *www* likes.

```
g = sns.FacetGrid(df, col=None, hue='gender', size=6.0, xlim=(1,15000))
g = (g.map(plotDensity, 'www_likes', bins=np.logspace(0,5,50))).add_legend()
g = g.set_axis_labels('www Likes Count', '% of users')
plt.xscale('log')
```

We cal also look at the total number of likes numerically per gender, as follows:

```
pf.groupby('gender').www_likes.sum()
```

We can also compare two distributions graphically using “box plots”. We can also look at the actual value using the by command. Here, we are trying to understand which gender initiated more friendships.

```
ax = sns.boxplot(x='gender', y='friendships_initiated', data=df)
plt.ylim(0,200)
```

```
pf.groupby('gender').friendships_initiated.describe()
```

Next, we want to understand if users have used certain features of Facebook or not. If we look at the summary of mobile_likes variable, median is close to 0, indicating a lot many users with 0 values for this variable. We can look also look at the logical value if value of this quantity is non-zero. We can additionally create a new variable called mobile_check_in that takes a value 1 if mobile_likes is non-zero.

```
pf.mobile_likes.describe()
```

```
(pf.mobile_likes > 0).value_counts()
```

```
pf['mobile_check_in'] = pd.Series(np.where(pf['mobile_likes'] > 0, 1, 0)).astype('category')
pf.mobile_check_in.value_counts()
```

We can find percentage of people who have done mobile check in.

```
frac = (pf.mobile_check_in == 1).sum()/pf.mobile_check_in.size
print("Fraction of Mobile Check-ins = ", frac)
```

We find that about 65% of people have used mobile devices for check in and hence it would be a good decision to continue development of such products.

In summary, here we have learned to make inferences about single variable data using a combination of plots - histograms, box plots and frequency plots; along with various numerical data.

### Stay in touch

Like the posts you see here? Sign up to get notified about new ones.