Classify Unbalanced Cases with Harvard edX MOOC Dataset

2016/05/16 · 6 minute read

data-viz education highered machine learning python

I’ve run a couple classification ML algorithm on the dataset. What makes this problem interesting is that most of the students did not pass the course. I’ve re-sampled the positive cases multiple times to make the algorithms punish the false positive cases more severely.

Datasource¶

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/26147

There's definition for each of the variables here.

Goal¶

Now, I simply want to understand, base on the attributes in this dataset, can I succesfully predict whether the student will pass/fail (certified == 1/0) the course?

Common Pitfall: Unbalanced Cases!¶

In machine learning - classification especially, a common pitfall is to benchmark classification algorithm by accuracy when dealing with unbalanced cases, even if your algorithm don't work, if your data is extremely unbalanced, you could still get 97% accuracy just by returning true to all cases (assuming 3% false cases). It's often advised to compare precision and recall (here, I look at ROC - tpr vs fpr) instead of accuracy. However, what's more complicated is how to improve the performance - Here, I am demonstrating one way of dealing with it - increase the probability (punishment) of drawing minor cases by repeating sampling.

This will be a fun exercise to explore different classification algorithms and do some visualization around.

In [1]:

# import requirement
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import *
%matplotlib inline
import seaborn as sns

In [2]:

# load data
df = pd.read_csv(‘data/HMXPC13_DI_v2_5-14-14.csv’)
print(df.shape)
df.head()

(641138, 20)

Out[2]:

	course_id	userid_DI	registered	viewed	final_cc_cname_DI	LoE_DI	YoB	gender	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters	roles	incomplete_flag
0	HarvardX/CB22x/2013_Spring	MHxPC130442623	1	0	United States	NaN	NaN	NaN	2012-12-19	2013-11-17	NaN	9.0	NaN	NaN	NaN	1.0
1	HarvardX/CS50x/2012	MHxPC130442623	1	1	United States	NaN	NaN	NaN	2012-10-15	NaN	NaN	9.0	NaN	1.0	NaN	1.0
2	HarvardX/CB22x/2013_Spring	MHxPC130275857	1	0	United States	NaN	NaN	NaN	2013-02-08	2013-11-17	NaN	16.0	NaN	NaN	NaN	1.0
3	HarvardX/CS50x/2012	MHxPC130275857	1	0	United States	NaN	NaN	NaN	2012-09-17	NaN	NaN	16.0	NaN	NaN	NaN	1.0
4	HarvardX/ER22x/2013_Spring	MHxPC130275857	1	0	United States	NaN	NaN	NaN	2012-12-19	NaN	NaN	16.0	NaN	NaN	NaN	1.0

Whoo, not many missing values!

In [3]:

def quick_profile(df):
    f, ax = plt.subplots(1,2, figsize=(10,3))
    p = sns.barplot(y=np.sum(df.isnull(),axis=0)/df.shape[0], x= df.columns, ax = ax[0])
    p.set_xticklabels(df.columns, rotation=90);
    p.set_title(‘% of missing values by column’);
    sns.barplot(y=np.bincount(df[‘certified’]), x=[0,1], ax = ax[1]);
    ax[1].set_title(‘certified freq’);

quick_profile(df)

Ohoh, Less than 3% negative cases! This means, we need to boost negative cases if possible.

Now, I am not interested to classify those that does nothing beyond registration. By definition, that’s nevents is not null.

In [4]:

df = df[~df.nevents.isnull()]
quick_profile(df)

Now, let’s profile¶

Discrete Variables

In [5]:

f, ax = plt.subplots(1,2, figsize=(10,4))
sns.countplot(data=df[[‘LoE_DI’, ‘certified’]], x = ‘LoE_DI’, hue = ‘certified’, ax = ax[0]);
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=30);
sns.countplot(data=df[[‘gender’, ‘certified’]], x = ‘gender’, hue = ‘certified’, ax = ax[1]);

In [6]:

plt.figure(figsize=(12,4))
p = sns.countplot(data=df[[‘final_cc_cname_DI’, ‘certified’]], x = ‘final_cc_cname_DI’, hue = ‘certified’)
p.set_xticklabels(p.get_xticklabels(), rotation=90);

Numeric Variables

In [7]:

df.grade.replace(‘nan’, np.nan, inplace=True)
df.grade.replace(’ ‘, np.nan, inplace=True)
df.grade = df.grade.astype(‘float64’)

sns.pairplot(
    data=df[[‘YoB’, ‘grade’, ‘certified’, ‘nevents’, ‘ndays_act’, ‘nplay_video’, ‘nchapters’, ‘nforum_posts’]].dropna(),
    hue=“certified”)

Out[7]:

As what we’ve seen, some data processing needs to take place

In [8]:

# drop nplay_videio since 60% of rows are missing
num = df[[‘certified’, ‘YoB’, ‘nevents’, ‘ndays_act’, ‘nchapters’]]
quick_profile(num)

In [9]:

# dummy code nominal variables
def dict_lookup(df,f):
    unique = df[f].unique()
    lookup = {}
    i = 0
    for x in unique:
        lookup[x] = i
        i += 1
<span class="k">return</span> <span class="n">df</span><span class="p">[</span><span class="n">f</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">lookup</span><span class="p">[</span><span class="n">x</span><span class="p">]),</span> <span class="n">lookup</span>
lookups = {}
for x in [‘final_cc_cname_DI’, ‘LoE_DI’, ‘gender’]:
v, lookups[x] = dict_lookup(df, x)
df[‘%s_coded’%x] = v

In [11]:

data = df[[‘course_id’, ‘userid_DI’, ‘certified’, ‘YoB’, ‘nevents’, ‘ndays_act’, ‘nchapters’, ‘final_cc_cname_DI_coded’, ‘LoE_DI_coded’, ‘gender_coded’]]

# from the profiling, you might notice some poeple are born in the future ???
# clean those out
data.loc[data[‘YoB’] >= 2013, ‘YoB’] = 2013

# use most freq value to replace missing value
for x in [‘YoB’, ‘nevents’, ‘ndays_act’, ‘nchapters’]:
    replace_val = (np.bincount(data[x].dropna().astype(‘int64’)).argmax())
    data.loc[data[x].isnull(), x] = replace_val

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [12]:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
disc = enc.fit_transform(data[[‘final_cc_cname_DI_coded’, ‘LoE_DI_coded’, ‘gender_coded’]])

# stack onehotcoded + numeric values
num = data.drop([‘final_cc_cname_DI_coded’, ‘LoE_DI_coded’, ‘gender_coded’], axis=1)
d3 = np.hstack([num.values, disc])

In [13]:

# generate headers so we don’t lost track of the data transformation afterwards
new = True

for x in [‘final_cc_cname_DI’, ‘LoE_DI’, ‘gender’]:
    t = []
    for y in lookups[x].keys():
        t.append([x, y,lookups[x][y]])
    t2 = pd.DataFrame(t).sort_values(2)
    if new:
        headers = t2.copy()
        new = False
    else:
        headers = pd.concat([headers, t2], axis = 0)

header = [‘course_id’, ‘userid_DI’, ‘certified’, ‘YoB’, ‘nevents’, ‘ndays_act’, ‘nchapters’]
header.extend(headers[1].values.tolist())
header

Out[13]:

['course_id',
 'userid_DI',
 'certified',
 'YoB',
 'nevents',
 'ndays_act',
 'nchapters',
 'United States',
 'France',
 'Australia',
 'India',
 'Canada',
 'Other North & Central Amer., Caribbean',
 'Other South Asia',
 'Japan',
 'Other Africa',
 'Colombia',
 'Russian Federation',
 'Other Europe',
 'Germany',
 'Other Middle East/Central Asia',
 'Other East Asia',
 'Bangladesh',
 'China',
 'United Kingdom',
 'Spain',
 'Ukraine',
 'Poland',
 'Egypt',
 'Other South America',
 'Pakistan',
 'Greece',
 'Portugal',
 'Unknown/Other',
 'Brazil',
 'Nigeria',
 'Mexico',
 'Other Oceania',
 'Indonesia',
 'Philippines',
 'Morocco',
 nan,
 "Bachelor's",
 'Secondary',
 "Master's",
 'Doctorate',
 'Less than Secondary',
 nan,
 'm',
 'f',
 'o']

In [14]:

# best practive: create a hold out set
from sklearn.cross_validation import train_test_split
train, hold_out = train_test_split(d3, test_size=.3, random_state=0)

# balance the classification class in training set
# Repeated sample positive cases 24 times!
n = train[train[:,2] != 1]
y = train[train[:,2] == 1]
y2 = np.vstack([y for i in range(24)])
d4 = np.vstack([y2, n])

# take a look what I did
f, ax = plt.subplots(1,2, figsize=(8,4))
sns.barplot(y=np.bincount(d4[:,2].astype(‘int’)), x = [0,1], ax = ax[0]);
ax[0].set_title(‘train’);
sns.barplot(y = np.bincount(hold_out[:,2].astype(‘int’)), x = [0,1], ax = ax[1]);
ax[1].set_title(‘hold_out’);

Train the classification model¶

Now that we’ve seen enough, let’s try to train the model.

In [15]:

# create train/test within training set
train_tr, train_x = train_test_split(train, test_size=.3, random_state=0)

Decision Tree

In [52]:

# first, let’s try a decision tree
from sklearn.tree import DecisionTreeClassifier

# let’s draw a training curve and see if this works
clss = dict()
test_err = []
train_err = []

for x in tqdm(range(5,50,5)):
    clss[x] = DecisionTreeClassifier(max_depth=x)
    clss[x].fit(train_tr[:,3:], train_tr[:,2].astype(‘int64’))
    test_err.append(1-clss[x].score(train_x[:, 3:], train_x[:, 2].astype(‘int’)))
    train_err.append(1-clss[x].score(train_tr[:,3:], train_tr[:,2].astype(‘int’)))

100%|████████████████████████████████████████████| 9/9 [00:04<00:00,  1.97it/s]

In [53]:

def training_curve(test, train, param, ax):
    ax.plot(param, test_err, ’-o’);
    ax.plot(param, train_err, ’-o’);
    ax.legend([‘test error’, ‘train error’]);
    ax.set_title(‘error curve’);

def roc(model, true, data, ax):
    from sklearn.metrics import roccurve
    fpr, tpr,  = roc_curve(true, model.predict(data));
    ax.plot(fpr, tpr, ’-o’);
    ax.set_xlabel(‘fpr’);
    ax.set_ylabel(‘tpr’);
    ax.set_title(‘hold_out roc curve’);

f, ax = plt.subplots(1,2, figsize=(10,4))
training_curve(test_err, train_err, range(5,50,5), ax[0])
roc(clss[10], hold_out[:,2], hold_out[:,3:], ax[1])

Random Forest

In [54]:

# it does an okay job on hold_out, so let’s see how to improvee this even more!
# unfortunately, there’s no existing prune module in sklearn
# let’s try DT’s ensemble version -> RF

from sklearn.ensemble import RandomForestClassifier

# clss2 = dict()
test_err = []
train_err = []

for x in tqdm(range(5,50,5)):
    clss2[x] = RandomForestClassifier(max_depth=x)
    clss2[x].fit(train_tr[:,3:], train_tr[:,2].astype(‘int64’))
    test_err.append(1-clss2[x].score(train_x[:, 3:], train_x[:, 2].astype(‘int’)))
    train_err.append(1-clss2[x].score(train_tr[:,3:], train_tr[:,2].astype(‘int’)))

100%|████████████████████████████████████████████| 9/9 [00:07<00:00,  1.19it/s]

In [55]:

f, ax = plt.subplots(1,2, figsize=(10,4))
training_curve(test_err, train_err, range(5,50,5), ax[0])
roc(clss2[15], hold_out[:,2], hold_out[:,3:], ax[1])

SVM

In [45]:

# alright, not really helping
# let’s try SVM

from sklearn.svm import LinearSVC

clss3 = dict()
test_err = []
train_err = []

for c in tqdm(range(80,120,5)):
    clss3[c] = LinearSVC(C=(c*.01))
    clss3[c].fit(train_tr[:,3:], train_tr[:,2].astype(‘int64’))
    test_err.append(1-clss3[c].score(train_x[:, 3:], train_x[:, 2].astype(‘int’)))
    train_err.append(1-clss3[c].score(train_tr[:,3:], train_tr[:,2].astype(‘int’)))

100%|████████████████████████████████████████████| 8/8 [06:40<00:00, 49.96s/it]

In [51]:

f, ax = plt.subplots(1,2, figsize=(10,4))
training_curve(test_err, train_err, range(80,120,5), ax[0])
roc(clss3[115], hold_out[:,2], hold_out[:,3:], ax[1])