Combining large data sets

Presenter(s)

Kilian Seng

Zeppelin University Friedrichshafen

Author(s)

Kilian Seng

Zeppelin University Friedrichshafen

Panel Open Panel

Abstract

Some of the variables containing information necessary to analyze the question of interest are often not available within one data set. Therefore it might be useful to combine panel data sets from different sources. One method to do this is two-stage auxiliary instrumental variables estimation (2SAIV) which was already proposed by Franklin in 1989, but was actually not often applied since then. The central requirement for this is that  the different samples are drawn from the same population.   The most applications use up to ten standard demographic variables common to both data sets. Depending on the variable to be estimated, this might be not enough. Especially when survey data on attitudes has to be combined with other data, result might be poor and leads to low relevance of the instrument. One source of complication might be multicategorical variables with many categories. In this case, there is a trade-off between more and smaller categories containing specific information and less but bigger categories containing less precise information in order to obtain good estimates. A typical example for micro data would be the use of job classification schemes (ISCO) with very specific information in up to several thousand categories as first stage regressor.  For illustration I use panel data on employees in Germany and the USA from the SOEP and the PSID, which will be combined with attitudinal data about unemployment from the ISSP. I use up to 18 variables in the case of Germany, which exist in the SOEP as well as the ISSP, to generate an instrument for the worry about job loss. As a test for the relevance of the instrument, the estimate can be compared to a similar question which exists already in the SOEP (but not the PSID).

Install the app

Combining large data sets

Abstract