College Enrollment Forecast (State Level)

Using ARIMA to forecast college enrollment for 2015 and 2016 at state level.


Postsecondary School Enrollment Forecast (State Level)

NCES publishes its school enrollment data every year here: https://nces.ed.gov/ipeds/datacenter/

However, the latest you data can find is always 1 year behind, in order to get the latest estimate, below shows my implementation using ARIMA to forecast current and next year's enrollment number.

NCES has forecast on national level and few demographic attribute, however, there's no state level forecast.

1. Aggregate College Enrollment by State

In [3]:
# load package
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from statsmodels import api as sm
import numpy as np

# turn off warning
import warnings
warnings.filterwarnings('ignore')

# load data
enrollment = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='EnrollmentByInstitution')
inst_state = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='Institution')
inst_state.columns = ['UnitId', 'Name', 'City', 'State']
state_enrollment = enrollment.merge(inst_state.drop('Name', axis = 1))
# a preview
state_enrollment.head(2)
Out[3]:
UnitIdInstitution NameFall 1990Fall 1991Fall 1992Fall 1993Fall 1994Fall 1995Fall 1996Fall 1997...Fall 2007Fall 2008Fall 2009Fall 2010Fall 2011Fall 2012Fall 2013Fall 2014CityState
0100654Alabama A & M University48865215506855935543540052635094...57065124532758144922485350205333NormalAL
1100663University of Alabama at Birmingham1535615922157351591315362155021527414933...1624616149168741754317575179991856818698BirminghamAL

2 rows × 29 columns

Transpose the dataframe, use state as column and convert to dateindex

In [5]:
# some plastic surgery on the df
# group sum by state
to_group = state_enrollment.drop(['UnitId', 'Institution Name', 'City'], axis = 1)
to_group = to_group.replace('-', 0)
state_enrollment_grp = to_group.groupby('State')
df = state_enrollment_grp.agg(sum).reset_index()

df = df.transpose()
# columns as state
df.columns = df.ix[0,:].values
df = df.drop(['State'])
# add time series index
df.index = [pd.Period(x, freq = 'A') for x in range(1990,2015)]
df.head()
Out[5]:
AKALARASAZCACOCTDCDE...TNTXUTVAVIVTWAWIWVWY
19903047120227797916121926887018021632367021638447858440180...2206758848681241783521311684356102792222990338447931326
199130970207636100888126727380820224362440781611757660140918...2308088999671335993561931797365292821953085988661232118
199230711213071105410129527953919888202509381618878048441428...2372419197431347583532801856364922840643069218855831548
199330759218898107596126428239619028892521741674818084343768...2409569317771449343485381867356453055943084778872030738
199428654215188105610124928910018784252511731660987582344403...2394749454991512313546171915347853095113036978612930730

5 rows × 59 columns

2. Prepare the Model

In [6]:
def produce_forecast(hold_out_n, inog, order, exog = None):
    """
    CV module
    input: 
    hold_out_n (hold out n years from training for cross validation)
    inog: endogenous variable (the enrollment number to forecast)
    exog: exogenous variable (other variable to forecast, if any)
    order: (p, d, q) parameter of ARIMA model
    output:
    print (average MAPE across all states)
    mape by state
    """
    
    def forecast_state(inog, exog, order, hold_out_n):
        # prep exog
        if exog is not None:
            train_exog = exog[:-hold_out_n]
            hold_out_exog = exog[-hold_out_n:]
        else:
            train_exog = hold_out_exog = None
        
        # forecast wrapper
        m = sm.tsa.ARIMA(inog, order=order, exog=train_exog)
        model = m.fit()
        
        return model.forecast(hold_out_n, exog = hold_out_exog, alpha = .95)[0]
    
    forecast_y = []
    
    # data needed
    Y = inog.ix[:-hold_out_n,]
    yTrue = inog.ix[-hold_out_n:,]
    
    if exog is not None:
        X = exog
    else:
        X = None
    
    for state in Y.columns:
        stateY = Y[state]
        yHat = forecast_state(stateY.values.astype('float64'), X, order, hold_out_n)
        forecast_y.append(yHat)
    
    test = yTrue
    yHat = np.array(forecast_y).transpose()
    error_matrix = abs(yTrue-yHat)/yHat
    
    print('MAPE for ARIMA %i, %i, %i' % order)
    print((np.sum(error_matrix, axis = 1)/Y.shape[1]))
    
    plt.figure(figsize = (15,3))
    plt.title('mape');
    plt.plot(error_matrix.ix[0], '-o');
    plt.plot(error_matrix.ix[1], '-o');
    plt.xticks(range(Y.shape[1]), df.columns, rotation = 'vertical');
    plt.legend(error_matrix.index)
    return error_matrix
In [7]:
error_100 = produce_forecast(2, df, (1,0,0), exog = None)
MAPE for ARIMA 1, 0, 0
2013    0.023323
2014    0.039384
Freq: A-DEC, dtype: float64

Looks good!

3. Add National Umemployment Rate as Exog

In [8]:
# load umemployment data
umemployment = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='Unemployment Rate')
umemployment.index = [pd.Period(x, freq = 'A') for x in umemployment.Year]
umemployment.drop(['Year'], axis = 1, inplace=True)
In [11]:
# umemployment lag 1
error_100_ue1 = produce_forecast(2, df, (1,0,0), exog = umemployment.ix[33:-1,0])
MAPE for ARIMA 1, 0, 0
2013    0.022335
2014    0.038261
Freq: A-DEC, dtype: float64
In [13]:
# umemployment lag 0 + lag 1
error_100_ue01 = produce_forecast(2, df, (1,0,0), exog = np.vstack([umemployment.ix[33:-1,0], umemployment.ix[34:,0]]).transpose())
MAPE for ARIMA 1, 0, 0
2013    0.020552
2014    0.035016
Freq: A-DEC, dtype: float64

Slight improvement.

4. Add Disposable Income as Exog

In [14]:
# load disposable income data
dp = pd.read_excel('DATA_SOURCE\ENROLLMENT_FORECAST_DS.xlsx', sheetname='DISPOSABLE_INCOME')
dp.columns = ['Year', 'DSPI']
dpg = dp.groupby('Year')
dp = dpg.mean().reset_index()
dp.index = [pd.Period(x, freq = 'A') for x in dp.Year]
dp.drop(['Year'], axis = 1, inplace=True)
In [17]:
# umemployment lag 0 + lag 1 + lag 5 disposable income
# not helping, but lag 5 has the lowest avg mape
exog = np.vstack([umemployment.ix[33:-1,0], umemployment.ix[34:,0], dp.ix[27:-5,0]]).transpose()
error_100_ue01 = produce_forecast(2, df, (1,0,0), exog = exog)
MAPE for ARIMA 1, 0, 0
2013    0.028857
2014    0.056395
Freq: A-DEC, dtype: float64

Write Forecast File

In [18]:
# 2015/6's umpmployment rate forecast from fed = [5.0, 4.8]
# http://www.federalreserve.gov/monetarypolicy/files/fomcprojtabl20150917.pdf

umemployment.ix[pd.Period(2015, freq = 'A')] = 5.0
umemployment.ix[pd.Period(2016, freq = 'A')] = 4.8

exog = np.vstack([umemployment.ix[34:,0], umemployment.ix[33:-1,0]]).transpose()
order = (1,0,0)
forecast_y = []

for state in df.columns:
    stateY = df[state]
#     yHat = forecast_state(stateY.values.astype('float64'), X, order, hold_out_n)
    m = sm.tsa.ARIMA(stateY.values.astype('float64'), order=order, exog=exog[:-2])
    model = m.fit()
    yHat = model.forecast(2, exog = exog[-2:], alpha = .95)[0]
    forecast_y.append(yHat)
In [19]:
output = pd.DataFrame(forecast_y)
output = output.transpose()
output.columns = df.columns
output.index = [pd.Period(2015, freq = 'A'), pd.Period(2016, freq = 'A')]
pd.concat([df, output]).transpose().head()
# pd.concat([df, output]).transpose().to_csv('output.csv')
Out[19]:
1990199119921993199419951996199719981999...2007200820092010201120122013201420152016
AK30471309703071130759286542948428758281852781127446...315243181333674349743557233859348903433134174.634098.4
AL202277207636213071218898215188213244209603207582207054214385...269365312618314383328947320590312102307496306600303406301607
AR97916100888105410107596105610106284109092113609117069119083...155393161652172056180365184173181259176924174173172296171204
AS121912671295126412491232123912489091172...176718062189219320911795148812761271.971280.21
AZ268870273808279539282396289100287673293032296582307991324855...634693716185844548804301807995746342704905685630671578665477

5 rows × 27 columns

comments powered by Disqus