## Random parameter models with nonlinear functional form and heterogeneous overdispersion

##### Abstract

A vehicle crash is the combined effect of various factors, such as those related to the driver, vehicle, roadway geometry, environmental conditions, and so on. However, the existing databases only cover a very limited part of a large number of elements that would influence the vehicle crash outcomes. The absence of important information would turn into unobserved heterogeneity. Without an appropriate statistical modeling approach, the unobserved heterogeneity could cause serious specification biases leading to erroneous inference and predictions. Thus, finding proper methods to accommodate the unobserved heterogeneity is one of the major challenges in vehicle crash analysis.
Recent studies have proved the methodological advantages of random parameter negative binomial (RPNB) models in capturing unobserved heterogeneity. By allowing the mean function of the crash frequency to vary through certain distributional assumptions, the random parameter approach reflects the heterogeneous effect of a given roadway element on crash outcome across observations. In particular, this approach has been very useful in quantifying the effect of roadway geometrics on crash frequencies via the count model structure. The negative binomial (NB) model inclusive of the well known overdispersion parameter is an example, and one that has been used widely in crash frequency analysis. While numerous studies have explored the utility of the random parameter approach, all of the published studies have assumed constant overdispersion in random parameter frameworks. In situations where overdispersion has been allowed to vary as a function of observational attributes, the mean function has been modeled using a fixed-parameter NB specification. The statistical superiority of the random parameter framework relative to the fixed-parameter NB model with variable overdispersion has already been established in the published literature. Residual heterogeneity that is constrained to be the same through a fixed overdispersion parameter implies that unobserved effects not captured via the mean function are the same across observations. This is a highly restrictive assumption. For example, two locations with identical geometrical attributes may have unobserved effects that are only partly revealed through a random parameter in the mean function and the residual heterogeneity may vary across observations. Thus, a more comprehensive and unrestricted exploration of heterogeneity by allowing both the mean function variables and the overdispersion parameter to vary across observations simultaneously is potentially beneficial in minimizing the impact of unobserved heterogeneity.
An alternative view of unobserved heterogeneity is that functional form assumptions may contribute in part to unobserved effects. Accounting for proper functional form may therefore capture types of heterogeneity due to nonlinearities and the variation of nonlinearities over time (temporal instability of nonlinearities). In such a case, a linear mean function definition may be too restrictive when in fact the effect of a particular roadway variable may be highly nonlinear. Apriori assumptions of the functional form of geometric variables are hard to justify, since no prior theory is available to guide the definition of the nonlinearity. Therefore, empirical search of the functional form is a necessity in traffic safety analysis. In the functional form perspective, the hypothesis is that with the proper functional form for the mean function, and a variable overdispersion parameter, sufficient coverage of unobserved heterogeneity can be achieved so as to provide for a specification that is equally plausible as the traditional random parameter framework.
Finally, one needs to consider the effects of temporal correlation in crash frequency analysis. When multiple years of data are used for a given site, it induces temporal correlation. Panel forms of the RPNB are available and have been used to tackle this issue. The RPNB-panel is therefore a reasonable baseline model against which alternative heterogeneity frameworks can be evaluated. The baseline therefore in both of the hypotheses presented above for unobserved heterogeneity modeling of count data is the random parameter negative binomial (RPNB) panel model with a fixed overdispersion parameter. Evaluation of the proposed alternative heterogeneity models will be conducted using the RPNB model as a baseline. This dissertation therefore targets two methodological developments in crash frequency analysis: a) a multivariable fractional polynomial copula (MFP-Copula) framework that can address heterogeneity and temporal correlation through a functional form perspective; and b) a heterogeneous random parameter NB panel model that addresses heterogeneity and temporal correlation through a random parameter mean function coupled with a variable overdispersion parameter using a panel framework.
Through these original contributions, the MFP-NB-copula fills the gaps in the extant literature due to the fact that conventional MFP approaches can’t utilize intra-observation correlation, which leads to potential biased and inefficient parameter estimation and outcome prediction in a crash frequency panel setting. The second original contribution involves the HRPNB-panel model which allows for random parameters and an overdispersion parameter that varies across segments simultaneously. Further, the overdispersion parameter is modeled with a log-linear specification with non-linear-in-variables functional form as opposed to the log-linear specification with linear-in-variables form usually found in heterogeneous dispersion parameter models.
The scope of the study area covers the entire Washington State Interstate system containing crash counts, geometric, and traffic volume information for a period of two years – 2014 and 2015. Seven Interstates, namely, I-5, I-82, I-90, I-182, I-205, I-405, and I-705, with comprehensive information were evaluated in this study, which covered in total, 763.83 centerline miles. The fixed-length segmentation of 1-mile segments is applied here, which results in 763 total segments and a balanced 2-year panel of 1,526 observations. A total of 338 initial variables were identified from the Washington State Department of Transportation (WSDOT) database and represent detailed roadway geometry and traffic volume information within every single segment. This includes horizontal curvature, vertical curvature, roadway width and number of lanes, travel shoulder width, pavement material type, and segment location information such as route number, county number, and urban versus rural setting.
The two newly proposed methodological approaches are compared to the widely used random parameter negative binomial panel model (RPNB-panel). The empirical results suggest that the model fit of the MFP-NB-copula is greatly improved comparing to the MFP-NB that ignored intra-observation correlations. The MFP-NB panel model is still marginally inferior to the RPNB-panel model. As for the HRPNB-panel model, there is a significant improvement in statistical fit relative to the conventional fixed-dispersion parameter RPNB-panel model. The standard deviations of the random parameters appear to decrease in the heterogeneous RPNB-panel model compared to the fixed-dispersion RPNB-panel model. This seems to suggest that allowing for the overdispersion effect to vary across segments may decrease the “spread” and increase the “peakedness” of the distributions of the random parameters influencing the mean crash frequency. This is consistent with the expectation that constraining the overdispersion parameter to be fixed across segments is restrictive in that it masks the true nature of the random parameter distributions. The results indicate that the HRPNB-panel model is a more effective tool for capturing the effects of geometric features in a comprehensive manner.
Several advancements can occur using this dissertation as a methodological basis. For example, spatial correlation is not accounted for in the proposed frameworks. Spatial correlation has been shown to be a significant issue in the modeling constructs for count models of crash frequency. This can be considered a significant improvement over the proposed models presented in this dissertation. The models in this dissertation address fundamental questions relating to how unobserved heterogeneity can be approached as a specification issue, since the published literature to date has not explicitly addressed the specification effects of wrong functional form on both the mean function and the overdispersion parameter.