College Enrollment Forecast (Institution Level)

Using ARIMA to forecast college enrollment for 2015 and 2016 at institution level.


Postsecondary School Enrollment Forecast (Institution Level)

NCES publishes its school enrollment data every year here: https://nces.ed.gov/ipeds/datacenter/

However, the latest you data can find is always 1 year behind, in order to get the latest estimate, below shows my implementation using ARIMA to forecast current and next year's enrollment number.

NCES has forecast on national level and few demographic attribute, however, there's no state level forecast.

This is a follow up from the state level forecast. Now we wanted to see if there's a way to forecast institutional level enrollment.

I am going to try 2 approaches:

  1. Use the state level forecast and weight the one year and two year ahead forecast by $w_i^s$ as defined here: $$ \begin{align} w_i^c &= \frac {\displaystyle \sum_{y \in year} count_{y, i}} {\displaystyle \sum_{y \in year} \displaystyle \sum_{i_s \in inst \forall s \in state} count_{y, i_s}}\\ \text {where i }&\text{ = institute}\\ \text {year }&\text{ = [1991,2014] (actual enrollment data available)}\\ \end{align} $$ This approach doesn't work, might need a different weight factor.
  2. Cluster-Trend-Forecast (see details below)

Approach 1

1. Read State Level Forecast

See how I forecasted state level enrollment.

In [1]:
# load package
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from statsmodels import api as sm
import numpy as np
import pickle
# load mpld3 for drawing
import mpld3
mpld3.enable_notebook()

%autoreload 1
%aimport vaildation_mod

# turn off warning
import warnings
warnings.filterwarnings('ignore')

# Load the institution info
intinfo = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='INST_CHAR')
lab = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='INST_CHAR_COLUMNS')
# Load Data
enrollment = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='EnrollmentByInstitution')
inst_state = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='Institution')
state_level_forecast = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='STATE_FORECAST_1516')

inst_state.columns = ['UnitId', 'Name', 'City', 'State']
enrollment = enrollment.merge(inst_state.drop('Name', axis = 1))
enrollment = enrollment.replace('-', 0)

schools = enrollment[['UnitId', 'Institution Name']]
In [2]:
# overall trend
plt.plot(np.sum(enrollment.iloc[:,2:-2]).values, '-o');

2. Calculate $w_i^s$

In [3]:
# total school enrollment
er1 = enrollment.copy()
er1['total_school_enrollment'] = np.sum(er1.drop(['UnitId', 'Institution Name', 'City', 'State'], axis = 1), axis = 1)

# count per state
state_sum = pd.DataFrame(np.sum(state_level_forecast.drop([2015,2016], axis = 1), axis = 1))
state_sum.columns = ['state_sum']
state_sum['State'] = state_sum.index
er1 = er1.merge(state_sum)

# calculate wis as defined above
er1['wis'] = er1.total_school_enrollment/er1.state_sum
# preview
er1.head()
Out[3]:
UnitIdInstitution NameFall 1990Fall 1991Fall 1992Fall 1993Fall 1994Fall 1995Fall 1996Fall 1997...Fall 2010Fall 2011Fall 2012Fall 2013Fall 2014CityStatetotal_school_enrollmentstate_sumwis
0100654Alabama A & M University48865215506855935543540052635094...58144922485350205333NormalAL13724160765200.022585
1100663University of Alabama at Birmingham1535615922157351591315362155021527414933...1754317575179991856818698BirminghamAL40521160765200.066685
2100690Amridge University104107106144140183182240...749775703631625MontgomeryAL1145260765200.001885
3100706University of Alabama in Huntsville81398624802682327492721867136464...76147629763673767348HuntsvilleAL18338360765200.030179
4100724Alabama State University45874822548856085037541655525273...57055425581660755519MontgomeryAL13801460765200.022713

5 rows × 32 columns

In [4]:
# MAPE with real 13, 14 enrollment
# Note: more logical step would be forecasting them and validate them, but here's just for a quick check

# calculate forecast by real data

def times_wis(x, yr):
    return state_level_forecast.ix[x.State, yr] * x.wis

er1['fall_2013_w'] = er1.apply(lambda x: times_wis(x, 2013), axis = 1)
er1['fall_2014_w'] = er1.apply(lambda x: times_wis(x, 2014), axis = 1)

# calculate mape
print('MAPE of 2013', sum(er1['fall_2013_w']-er1['Fall 2013'])/er1.shape[0])
print('MAPE of 2014', sum(er1['fall_2014_w']-er1['Fall 2014'])/er1.shape[0])
MAPE of 2013 90.8809816965
MAPE of 2014 90.0824075709

Ok... so probably this is not the best approach.

Approach 2

1. Scale and cluster

It's possible to do more processing and get different result, but we shall see which makes more sense.

1. Scale

  1. Scale to the entire row vector to [0,1] range
  2. Rebase (pick a year) to 0

2. Culster

  1. Cluster based on the n previous period
  2. Use either gmm or kmeans
    Very good reference on GMM (EM - Gaussian): http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
    If using gmm, since we don't have concers for high dimensions, it'll use full covariance matrix. See what that means here: http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note08-2up.pdf
  3. Generate n cluster solution

3. Forecast

  1. Produce aggregate forecast result
  2. Disaggregate forecast result using weight similar to the definition above but substitute the state by cluster
In [5]:
# prepare a matrix to cluster
er2 = enrollment.copy()
er2 = er2.transpose()
er2.columns = er2.ix['UnitId']
er2.drop(['UnitId', 'Institution Name', 'City', 'State'], inplace = True)
er2.index = [pd.Period(x) for x in range(1990,2015)]
er2.tail(5)
Out[5]:
UnitId100654100663100690100706100724100733100751100760100812100830...485306485342485351485379485388485397485412485421485458485467
201058141754374976145705030127244736215817...0000000000
201149221757577576295425031647245133895305...0000000000
201248531799970376365816033503202134155005...0000000000
201350201856863173766075034752190631705084...0000000000
201453331869862573485519036047172631285057...531814615575113679424310

5 rows × 7682 columns

In [6]:
def how_many_zeros(x):
    return sum(x == 0)

# Filter out those with less than 100 enrollment in the past
er2 = er2.ix[:, np.sum(er2, axis = 0) > 100]

# Filter out those with no data in the past 5 years
er2 = er2.ix[:,er2.tail(3).apply(lambda x: how_many_zeros(x))==0]

# filtered 7682-6970 = 1124 rows
er2.shape
Out[6]:
(25, 6970)
In [7]:
# validation module pipline the process and paramatize important variables.
# demo
demo = vaildation_mod.predict_by_cluster(er2)
WMAPE on Cluster
WMAPE of 2013 = 0.020565
WMAPE of 2014 = 0.038220
WMAPE on Institution
WMAPE of 2013 = 0.074574
WMAPE of 2014 = 0.106712
In [8]:
demo.draw_cluster_error()
In [9]:
# see what did scaling do
demo.draw_trend(0,1)

Notice that relatively, there's no change to the trend, just that now it's possible to compare different trend.

In [10]:
# see what cluster does
demo.draw_cluster_trend(15, 10, schools)

The last n period trend are similar

In [11]:
# change clustering algo (kmeans) with large n?
# conceptually, it should be more predictive, but it's not
# worse than GMM but similar to kmeans on n_clusters = default
p_km = vaildation_mod.predict_by_cluster(er2, cluster_method = 'kmeans')
WMAPE on Cluster
WMAPE of 2013 = 0.024972
WMAPE of 2014 = 0.043780
WMAPE on Institution
WMAPE of 2013 = 0.082462
WMAPE of 2014 = 0.114252
In [12]:
# different base?
p_n = vaildation_mod.predict_by_cluster(er2, rebase_to_n = 3, cluster_n_period = 20)
WMAPE on Cluster
WMAPE of 2013 = 0.018067
WMAPE of 2014 = 0.036291
WMAPE on Institution
WMAPE of 2013 = 0.071927
WMAPE of 2014 = 0.104767
In [13]:
p_n.draw_cluster_error()
In [14]:
p_n.draw_cluster_trend(26, 15, schools)
Adding umemployment
In [15]:
# add umemployment
umemployment = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='Unemployment Rate')
umemployment.index = [pd.Period(x) for x in umemployment.Year]
umemployment.drop(['Year'], axis = 1, inplace = True)
umemployment.columns = ['umemployment_rate']
In [16]:
print('umemployment lag0')
exog = umemployment[34:].copy()
p_um0 = vaildation_mod.predict_by_cluster(er2, exog = exog, rebase_to_n = 3, cluster_n_period = 20)
print('umemployment lag0 + lag1')
exog['umemployment_rate_lag1'] = umemployment[33:-1].values
p_um01 = vaildation_mod.predict_by_cluster(er2, exog = exog, rebase_to_n = 3, cluster_n_period = 20)
umemployment lag0
WMAPE on Cluster
WMAPE of 2013 = 0.021534
WMAPE of 2014 = 0.039217
WMAPE on Institution
WMAPE of 2013 = 0.074741
WMAPE of 2014 = 0.106586
umemployment lag0 + lag1
WMAPE on Cluster
WMAPE of 2013 = 0.024189
WMAPE of 2014 = 0.042203
WMAPE on Institution
WMAPE of 2013 = 0.076140
WMAPE of 2014 = 0.107207
Adding disposable income
In [17]:
dp = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='DISPOSABLE_INCOME')
dp.columns = ['Year', 'DSPI']
dpg = dp.groupby('Year')
dp = dpg.mean().reset_index()
dp.index = [pd.Period(x, freq = 'A') for x in dp.Year]
dp.drop(['Year'], axis = 1, inplace=True)
In [18]:
print('umemployment lag0 + lag1 + dspi lag 5')
exog = umemployment[34:].copy()
exog['umemployment_rate_lag1'] = umemployment[33:-1].values

exog['dspi_lag5'] = dp[26:-6].values
p_um0_ds5 = vaildation_mod.predict_by_cluster(er2, exog = exog, rebase_to_n = 3, cluster_n_period = 20)
print ('umemployment lag0 + lag1 + dspi lag 5 + dspi lag 4')
exog['dspi_lag4'] = dp[27:-5].values
p_um0_ds45 = vaildation_mod.predict_by_cluster(er2, exog = exog, rebase_to_n = 3, cluster_n_period = 20)
umemployment lag0 + lag1 + dspi lag 5
WMAPE on Cluster
WMAPE of 2013 = 0.049708
WMAPE of 2014 = 0.053833
WMAPE on Institution
WMAPE of 2013 = nan
WMAPE of 2014 = nan
umemployment lag0 + lag1 + dspi lag 5 + dspi lag 4
WMAPE on Cluster
WMAPE of 2013 = 0.040126
WMAPE of 2014 = 0.047338
WMAPE on Institution
WMAPE of 2013 = nan
WMAPE of 2014 = nan

Final Model

In [19]:
exog = umemployment[34:].copy()
exog['umemployment_rate_lag1'] = umemployment[33:-1].values
p = vaildation_mod.predict_by_cluster(er2, exog = None, rebase_to_n = 3, cluster_n_period = 20, hold_out_n=2)

# pickle it, takes a while to run
pickle.dump(p, open('inst_model.p', 'wb'))
# p = pickle.load(open('inst_model.p', 'rb'))

cluster = pd.DataFrame([p.ds.columns, p.ds_c.tolist()])
cluster = cluster.transpose()
cluster.columns = ['UnitID', 'cluster']
ds = cluster.merge(intinfo, how = 'left')
WMAPE on Cluster
WMAPE of 2013 = 0.018067
WMAPE of 2014 = 0.036291
WMAPE on Institution
WMAPE of 2013 = 0.071927
WMAPE of 2014 = 0.104767

Error Analysis

Where are the errors?

In [20]:
p.draw_cluster_error()
In [21]:
p.draw_cluster_trend(10, 10, schools)
In [22]:
p.look_at_institute(19, inst_name = schools)
In [23]:
# add two more years umemployment projection from FED

umemployment.loc[pd.Period(2015)] = 5.0
umemployment.loc[pd.Period(2016)] = 4.8
# umemployment.tail(2)
In [24]:
# produce the forecast

exog = umemployment[34:].copy()
exog['umemployment_rate_lag1'] = umemployment[33:-1].values

p = vaildation_mod.predict_by_cluster(er2, exog = exog, rebase_to_n = 3, cluster_n_period = 15, hold_out_n = 0)
forecast = p.get_forecast()
forecast.to_csv('institution_forecast.csv')

Profiling on Cluster on Final Model

This section will used some institution info such as age, gender, institution type (occupational/academic etc.) downloaded from NCES and try to make sense of cluster (or not?).

In [25]:
# variables to profile
for i in range(len(ds.columns)):
    print(i, ds.columns[i])
0 UnitID
1 cluster
2 Institution Name
3 Occupational (IC2014)
4 Academic (IC2014)
5 Continuing professional (IC2014)
6 Recreational or avocational (IC2014)
7 Adult basic remedial or high school equivalent (IC2014)
8 Secondary (high school) (IC2014)
9 Institutional control or affiliation (IC2014)
10 Open admission policy (IC2014)
11 Percent of undergraduate students receiving federal  state  local  institutional or other sources of grant aid (SFA1314)
12 Completion of college-preparatory program (ADM2014)
13 Secondary school GPA (ADM2014)
14 Secondary school rank (ADM2014)
15 Secondary school record (ADM2014)
16 Recommendations (ADM2014)
17 Formal demonstration of competencies (ADM2014)
18 Admission test scores (ADM2014)
19 Applicants total (ADM2014)
20 Published in-district tuition 2014-15 (IC2014_AY)
21 Published in-state tuition 2014-15 (IC2014_AY)
22 Published out-of-state tuition 2014-15 (IC2014_AY)
23 Grand total (EF2014A  All students total)
24 Grand total men (EF2014A  All students total)
25 Grand total women (EF2014A  All students total)
26 Asian total (EF2014A  All students total)
27 Asian men (EF2014A  All students total)
28 Asian women (EF2014A  All students total)
29 Black or African American total (EF2014A  All students total)
30 Black or African American men (EF2014A  All students total)
31 Black or African American women (EF2014A  All students total)
32 Hispanic total (EF2014A  All students total)
33 Hispanic men (EF2014A  All students total)
34 Hispanic women (EF2014A  All students total)
35 White total (EF2014A  All students total)
36 White men (EF2014A  All students total)
37 White women (EF2014A  All students total)
38 Students enrolled exclusively in distance education courses (EF2014A_DIST  All students total)
39 All students enrolled (EF2014A_DIST  All students total)
40 Students enrolled in some but not all distance education courses (EF2014A_DIST  All students total)
In [26]:
vaildation_mod.cluster_profiling(ds, lab, ds.columns[9])
In [27]:
vaildation_mod.cluster_profiling(ds, lab, ds.columns[25], ds.columns[35], ds.columns[29], ds.columns[26])
In [28]:
p.draw_cluster_trend(21, 10, schools)
In [29]:
# school lookup
p.ds_c[np.where(p.ds.columns == schools[schools['Institution Name'] == "Princeton University"]['UnitId'].values[0])]
Out[29]:
array([21], dtype=int64)

Python module: https://github.com/l1990790120/l1990790120.github.io/blob/master/_includes/nb/COLLEGE_ENROLLMENT_FORECAST_INST_LEVEL_vaildation_mod.py

comments powered by Disqus