Reddit Survey: Introduction to Pandas

The data set used here is part of a project from UD651 course on udacity by Facebook.

The data from the project corresponds to a survey from reddit.com. You can load the data through the following command. We will first look at the different attributes of this data using the summary() and describe() pandas methods.

In [45]:
import pandas as pd
import numpy as np

#Read csv file
reddit = pd.read_csv("https://s3.amazonaws.com/udacity-hosted-downloads/ud651/reddit.csv").astype(object)
#summarize data
reddit.describe(include='all', percentiles=[]).T.replace(np.nan,' ', regex=True)

Out[45]:
countuniquetopfreq
id32754.032754.0327561.0
gender32553.02.0026418.0
age.range32666.07.018-2415802.0
marital.status32749.06.0Single10428.0
employment.status32603.06.0Employed full time14814.0
military.service32749.02.0No30526.0
children32535.02.0No27488.0
education32610.07.0Bachelor's degree11046.0
country32577.0439.0United States20967.0
state20846.052.0California3401.0
income.range31139.08.0Under $20,0007892.0
fav.reddit28393.01833.0askreddit2123.0
dog.cat32749.03.0I like dogs.17151.0
cheese32749.011.0Other6563.0

The describe() method helped us get an overview of all the data available to us. We also ensured that all the data read was a categorical data.

Let us look at the age.range variable in more detail. We can look at the different levels of this variables using the cat.categories property of a Pandas Series.

In [46]:
reddit["age.range"].astype('category').cat.categories

Out[46]:
Index(['18-24', '25-34', '35-44', '45-54', '55-64', '65 or Above', 'Under 18'], dtype='object')

This shows there are 7 possible values of this variable and some where no data is available (NA).

A more pictorial view of this can be seen using a histogram plot of this.

In [57]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline

newOrder = ["Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"] ax = sns.countplot(x="age.range", data=reddit, order=newOrder)

Similarly, we can also plot a distribution of income range.

In [51]:
ax = sns.countplot(x="income.range", data=reddit)
locs, labels = plt.xticks()
ax = plt.setp(labels, rotation=90)

One problem with the above plots is that the different levels are not ordered. This can be fixed using ordered Factors, instead of regular factor type variables. Additionally, We need to use a more reasonable x-label for plotting income.range.

In [52]:
newLevels = ["100K", ">150K", "20K","30K", "40K", "50K", "70K", "<20K"]
reddit["income.range"] = reddit["income.range"].astype('category')
reddit["income.range"] = reddit["income.range"].cat.rename_categories(newLevels)

In [55]:
newOrder = ["<20K", "20K","30K", "40K", "50K", "70K", "100K", ">150K"]
ax = sns.countplot(x="income.range", data=reddit, order=newOrder)

comments powered by Disqus