Spotlight on Economics: Are appropriate statistical models used in empirical analysis with hierarchical data?

Farm Forum

With increasing computer power and cloud space, we are faced with the luxury of big data to address and link economic issues with empirical analysis on a daily basis.

With big data comes big responsibilities of linking appropriate statistical models to the structure of the data to conduct empirical analysis. Big data commonly has hierarchical or nested data.

What is hierarchical or nested data?

As the word suggests, hierarchy is a system or organization in which data, people or groups are ranked one above the other according to location, status or authority. In the hierarchical structure of the data in North Dakota, the counties within North Dakota are part of a crop reporting district (CRD), the crop reporting districts are part of the state, and North Dakota is one of the 50 states in the U.S.

In North Dakota, we have nine crop reporting districts, with several counties within each crop reporting district. For example, the West Central crop reporting district has five counties: Dunn, McKenzie, McLean, Mercer and Oliver. The East Central crop reporting district also has five counties: Barnes, Cass, Griggs, Steele and Traill.

Based on the 2012 Census of Agriculture, Dunn and Oliver counties are nested within the West Central crop reporting district. Dunn has the highest (628) and Oliver has the lowest (290) number of farms in this reporting district. Barnes and Traill counties, which are nested within the East Central crop reporting district, have the highest (855) and lowest (468) number of farms, respectively. Finally, each farm has data through time.

Under what context do we observe hierarchical or nested data?

If the producer would like to evaluate changes in yields, income, cost or profits, considering information from his or her own farm is OK. However, if the issue is broader – to evaluate the importance of a new insurance policy or changes in one of the crop insurance provisions such as coverage level – the analysis should take into account not only the producer’s own information but information from other producers within a county, crop reporting district and state in that hierarchical order.

For example, in crop insurance, the data is hierarchically structured with policies (catastrophic vs. buy-up, and within buy-up, yield vs. revenue) nesting the coverage level (55 to 85 percent) and unit structure (whole farm, enterprise, basic or optional). However, the appropriate statistical procedure seldom is used.

What is the need to link the appropriate statistical model with hierarchical or nested data?

If the hierarchical or nested farm data is not linked or analyzed with appropriate statistical procedure, the spatial random variation across hierarchical structure of the data is not accounted for in the analysis, the results from traditional statistical models do not provide accurate results, and empirical analysis will lead to underestimation or overestimation and will indicate a relationship when it is not present, or vice versa.

Most of the empirical research and analyses does not take into account the hierarchical structure of the data, even in the presence of spatial random variations across farms within a county, crop reporting districts and the state. The hierarchical linear model (HLM) statistical procedures should be used to account for the presence of hierarchical structure of the data.

The HLM procedures are used widely in educational, social behavioral and health research to account for hierarchical structure of the data. However, the hierarchical linear model (HLM) statistical procedures never are considered to model or analyze farm level crop yields, even with the hierarchical structure of farms nested within a county, CRD and state.

A recent paper I wrote, “Hierarchical Crop Yield Linear Model,” investigates which statistical procedure – the traditional statistical procedures or hierarchical linear model – accounts for the observed spatial random variation in crop yields. Identification of the statistical procedure is accomplished using alternative statistical tests (Akaike information criteria, covariance test of the spatial random variations and out-of-sample performance using holdout sample). An empirical application to U.S. counties’ yields of 20 crops, grown across 48 states during 1957 to 2013, suggests the need to account for spatial random variation based on the multi-level hierarchy.

The 20 crops evaluated were: barley, dry edible beans, corn, upland cotton, flaxseed, alfalfa hay, all hay, oats, peanuts, potatoes, rice, rye, sorghum, soybeans, sugar beets, sunflower, all wheat, spring wheat, durum and winter wheat. Statistical tests revealed the presence of spatial random variation across counties nested within a crop reporting district and state. In addition, the statistical tests suggest the use of a three-way hierarchical linear model (HLM3) that considers the commonalities that arise because of the hierarchical structure of counties within a crop reporting district and state.

The results from this paper suggest that identifying, evaluating and using hierarchical linear models is important. This is true with the availability of nested or hierarchical data and the presence of spatial random variation across farms within a county, crop reporting district and state.