Research talk:Reading time/Work log/2018-10-14

Sunday, October 14, 2018

Model Selection

I iterated a bit on the work I did on model selection earlier this week.

I fixed a bug in the ks tests . Now they make more sense and they agree with the AIC. Also note that the KS tests frequently accept the null hypotheses that the data was generated by a lognormal distribution or a Exponentiated Weibull distribution. This gives a pretty good statistical rationale for using these models. However, note that according to this answer on crossvalidated, if the parameters of the model were fit on the data then the assumptions of the KS are violated. I fit the models on a 75% "train set" and ran the tests on a 25% "test set," but at this point I'm not sure if I think this means the assumptions of the KS test are valid. I need to do some more reading or formal analysis to convince myself. The crossvalidated post uses bootstrapping and I could do that too, but it will take a big chunk of time. At this point I'm satisfied that the Exponentiated Weibull and Log Normal distributions are good enough fits for the data on a large majority of cases. I would be comfortable assuming these parametric models even if they are not perfect fits.

The code is below.

	AIC_rank		BIC_rank		ks_rank		ks_pvalue
	mean	median	mean	median	mean	median	mean	median
model
expon	4.896694	5.0	4.888430	5.0	4.946281	5.0	0.016276	0.000000
exponweib	1.471074	1.0	1.607438	2.0	1.491736	1.0	0.305993	0.222638
gamma	4.020661	4.0	4.012397	4.0	3.896694	4.0	0.025147	0.000000
lognorm	1.623967	2.0	1.504132	1.0	1.685950	2.0	0.263722	0.150254
weibull_min	2.987603	3.0	2.987603	3.0	2.979339	3.0	0.054931	0.000008

	ks_pass95	ks_pass97_5
	mean	mean
model
expon	0.033058	0.033058
exponweib	0.747934	0.814050
gamma	0.082645	0.107438
lognorm	0.681818	0.719008
weibull_min	0.194215	0.239669

import itertools as it

def select_model(dt_all):

    time_threshhold = np.inf

    models = [stats.weibull_min, stats.exponweib, stats.lognorm, stats.gamma, stats.expon]
    dfs = {mod.name:df for mod,df in zip(models,[2,3,2,2,1])}
    res = {}
    for wiki in set(dt_all.wiki):

        dtwiki = dt_all.loc[(dt_all.wiki==wiki) &
                            (~dt_all.visiblelength.isnull()) &
                            (dt_all.visiblelength>0) &
                            (dt_all.visiblelength<=time_threshhold),["visiblelength"]]
        trainassign = np.random.binomial(1,0.75,len(dtwiki))
        
        dtwiki_train = dtwiki.iloc[trainassign==1]
        dtwiki_test = dtwiki.iloc[trainassign==0]
        
        fit_models = [m.fit(dtwiki_train,floc=0) for m in models]
        
        ks_tests = [stats.kstest(dtwiki_test['visiblelength'],mod.name,fit) for mod,fit in zip(models,fit_models)]
        
        logliks = []
        AICs = []
        BICs = []
        for mod,fit in zip(models,fit_models):
            print(fit)
            loglik = mod.logpdf(dtwiki_test,*fit).sum()
            logliks.append(loglik)
            AICs.append(2 * dfs[mod.name] - 2 * loglik)
            BICs.append(np.log(len(dtwiki_test)) * dfs[mod.name] - 2 * loglik)
            
        print(logliks)
        logliks = np.array(logliks)
            
        print(logliks)
        lrs = []
            
            # these are chi-square tests with appropriate df
        lrtests = {}
        for pair in list(it.combinations(zip(models,logliks),2)):
            df = dfs[pair[0][0].name] - dfs[pair[1][0].name]
            if df < 0: 
                df = -df
                a = pair[1]
                b = pair[0]
            else:
                a = pair[0]
                b = pair[1]
            k = "{0}:{1}".format(a[0].name, b[0].name)
            # if lr > 0 then a is more likely than b
            lr = np.exp(a[1] - b[1])
            lrs.append((k,lr))
            pvalue = stats.chi2.sf(-2*(a[1] - b[1]),df=1+df)
            # it's the p-value that 
            lrtests[k] = pvalue
        print(len(models))
        print(len(ks_tests))
        res[wiki] = [pd.DataFrame({"model" : [mod.name for mod in models],
                                   "params" : fit_models,
                                   "ks" : [ks.statistic for ks in ks_tests],
                                   "ks_pvalue" : [ks.pvalue for ks in ks_tests],
                                   "loglik" : logliks,
                                   "AIC" : AICs,
                                   "BIC" : BICs}),
                     lrtests]
    return res

for r in res.keys():
    res[r][0]['wiki'] = r

table = pd.concat([res[r][0] for r in res.keys()]).reset_index()
table = table[[8,6,1,2,3,4,5,7]].reset_index()

gb = table.groupby(['wiki'])['AIC','BIC','ks'].rank().reset_index()
gb['AIC_rank'] = gb['AIC']
gb['BIC_rank'] = gb['BIC']
gb['ks_rank'] = gb['ks']
gb = gb.drop(['AIC','BIC','ks'],1)

table = table.merge(gb, on='index')
table.groupby(['model'])['AIC_rank','BIC_rank','ks_rank'].agg(['mean','median'])
table['ks_pass95'] = table['ks_pvalue'] >=0.05
table['ks_pass97_5'] = table['ks_pvalue'] >=0.025
table.groupby(['model'])['AIC_rank','BIC_rank','ks_rank','ks_pvalue'].agg(['mean','median'])

As a final step in model selection process. I made a series of plots that visualize the distribution fit. For example, here is the plot for English Wikipedia.

This chart compares the Exponentiated Weibull model to the empirical distribution of the data. We can see clearly that the model is a good fit to the right side of the distribution, where most of the data is. However, there is a long tail where the model does less well. This 4 panel chart shows 4 ways of visualizing fit of an Exponentiated Weibull distribution on page dwell time data from English Wikipedia. The top two plots compare the pdf (probability plot) and quantiles (qq) plot of the model to the empirical distribution. The bottom left shows the pdf of the model with a histogram of the data and the bottom right shows the cdf of the model with a cumulative histogram.