ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Combining Data from Different Sources: Different Techniques, Different Worlds

Course Dates and Times

Friday 26 February: 13:00-15:00 and 15:30-17:00
Saturday 27 February: 09:30-11:30 and 12:30-14:30
7.5 hours over two days

Susanne Rässler

susanne.raessler@uni-bamberg.de

University of Bamberg

Combining data from different sources means to achieve a complete data file from different sources which do not (e.g. data fusion) or do (e.g. record linkage) contain the same units or may have an overlap. In any case, statistical or probabilistic matching or imputation techniques might be used. Therefore at the beginning of the course definitions are given of all the different terms. Then the main focus of the course is on data fusion. Traditionally, data fusion is done on the basis of variables common to all files. It is well known that those approaches establish conditional independence of the (specific) variables not jointly observed given the common variables, although they may be conditionally dependent in reality. Hence, we treat the data fusion situation as a problem of missing data by design and suggest imputation approaches to multiply impute the specific variables using informative prior information to account for violations of the conditional independence assumption. Moreover, four levels of validity are introduced and simulation studies as well as some practical fusion and matching projects are presented as showcases.


Instructor Bio

Susanne Rässler (PhD in statistics, Habilitation in statistics and econometrics) is full Professor at Otto-Friedrich-University Bamberg, where she leads the Department of Statistics and Econometrics.

Prior to that, she was head of the Competence Centre Empirical Methods of the Institute for Employment Research, and head of the Department for Product and Program Analysis of the German Federal Employment Agency.

Susanne has published books on survey sampling, statistical matching/data fusion, and prediction techniques, along with articles touching very different topics.

From 2007 to 2013 she was a member of the German census committee and served as a member of the German data council.

Susanne's particular research interests are

  • methods for handling missing data
  • multiple imputation
  • Bayesian methods
  • matching techniques for causal analysis
  • data fusion

Combining data from different sources means to achieve a complete data file from different sources which do not (e.g. data fusion) or do (e.g. record linkage) contain the same units or may have an overlap. In any case, statistical or probabilistic matching or imputation techniques might be used.

On Day 1, after an introduction on all the practical and organisational aspects of the course, definitions are given trying to overcome the “Babylonian confusion” of many terms like e.g. statistical matching, data fusion, fusion, data integration, propensity score matching, single imputation and multiple imputation. Then it is discussed and shown that many real world research problems can be seen as missing data problems, likewise holds for statistical matching situations. The main focus of the course is on data fusion. Traditionally, data fusion is done on the basis of variables common to all files. It is well known that those approaches establish conditional independence of the (specific) variables not jointly observed given the common variables, although they may be conditionally dependent in reality. This implicit assumption will be illustrated by a little simulation study and there from four levels of validity are derived. These validity levels a data fusion procedure may or may not achieve will be discussed in greater detail.

On Day 2, the identification problem that is inherent in data fusion is shown and bounds for the inestimable correlations between the variables not jointly observed will be proposed and carefully explained. Since we treat the data fusion situation as a problem of missing data by design, we suggest imputation approaches to multiply impute the specific variables using informative prior information to account for violations of the conditional independence assumption. A short introduction to the multiple imputation theory will be given, too. Regarding the third level of the levels of validity a quality measure is proposed and simulation studies will illustrate the usefulness of these new ideas. Another shorter part of the course is devoted to propensity score matching and Rubin’s Causal Model (RCM). It will be shown the similarity of the potential outcome approach of the RCM to the data fusion situation. However there is a small but important difference of these two situations which makes propensity score matching a perfect choice for the RCM but not a good choice for data fusion. Guidance will be given to the participants which matching or imputation technique should be used in which situation. Finally, a “virtual”, i.e. anonymised data fusion project will be discussed with the participants pointing out many practical problems that may occur while undertaking data fusion projects. On Saturday afternoon the participants will follow the virtual project advice by themselves with hands on real world data and in the computer lab. After having finished the matching project in the lab, another real world project regarding the empirical application of the RCM is presented as a showcase.

Little specific prior knowledge is to be expected. Participants should only have:

  • Experience with inferential statistics
  • Basic knowledge of the linear regression model
  • Basic understanding of matrix algebra

Furthermore, the participants should simply be willing to reflect and talk about their research questions.

Day Topic Details
Friday • Babylonian confusion • Record linkage, data fusion and statistical matching • Statistical matching as a special case of missing data • Traditional approaches to data fusion • Implicit assumption: Conditional independence See course description
Saturday morning • Identification problem of data fusion • Possible solution: using bounds • Measuring quality of data fusion • Simulation studies See course description
Saturday afternoon • Rubin’s Causal Model and propensity score matching • Praxis I: virtual data fusion project • Praxis II: real world data projects • Conclusions and outlook See course description
Day Readings
Friday Rässler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Lecture Notes in Statistics 168, Springer, New York. Chapters 1,2 and 3.
Saturday morning Rässler, S. (2004). Data Fusion: Identification Problems, Validity, and Multiple Imputation, Austrian Journal of Statistics, 33, 153-171.
Saturday afternoon Rosenbaum, P. R., and D. B. Rubin. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70: 41-55.

Software Requirements

For Saturday afternoon a computer lab is needed and software such like R, even SPSS would do.

Hardware Requirements

None

Literature


D’Orazio, M., Di Zio, M., Scanu, M. (2006). Statistical Matching – Theory and Practice, Wiley, Chichester.

Kadane, J.B. (1978). Some Statistical Problems in Merging Data Files, 1978 Compendium of Tax Research, U.S. Department of the Treasury, 159-171. Reprinted 2001, Journal of Official Statistics, 17, 423-433.

Little, R.J.A., Rubin, D.B. (2002). Statistical Analysis with Missing Data. Wiley, Hobokon, New Yersey.

Moriarity, C., Scheuren, F. (2001). Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure, Journal of Official Statistics, 17, 407-422.

Moriarity, C., Scheuren, F. (2003). A Note on Rubin's Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputation, Journal of Business and Economic Statistics, 21, 65-73.

Rodgers, W.L. (1984). An Evaluation of Statistical Matching, Journal of Business and Economic Statistics, 2, 91-102.

Rubin, D.B. (1974). Characterizing the Estimation of Parameters in Incomplete-Data Problems, Journal of the American Statistical Association, 69, 467-474.

Rubin, D.B., Thayer, D. (1978). Relating Tests Given To Different Samples, Psychometrika, 43, 3-10.

Rubin, D.B. (1996), Multiple Imputation After 18+ Years, Journal of the American Statistical Association, 91, 473-489.

Rubin, D. B. (1986). Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87-95.

Recommended Courses to Cover After this One

Winter School WB106. Causal Inference for political and social sciences