ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Effective Data Management with R

Course Dates and Times

Thursday 27 - Saturday 29 July

10:00-12:00 and 14:00-17:00

Please see Timetable for full details.

Constantin Manuel Bosancianu

manuel.bosancianu@outlook.com

Central European University

The course targets participants who need an intensive introduction to the most commonly-used and versatile R packages for data management. Taking an applied approach, we deal with five stages encountered in most data analysis projects: a) reading in data, b) cleaning, checking and recoding variables, c) summarizing and aggregating data, d) reshaping data and, e) merging / matching multiple data sources. Throughout the three days, I introduce a set of effective and powerful R packages for data management, such as data.table, stringr, dplyr, tidyr, or rio. The focus will constantly be on how to make data cleaning and preparation as efficient as possible, with minimal errors along the way. Special attention will be devoted to the topic of handling very large data sets in R, as well as cleaning and recoding more uncommon types of data, such as string variables, time stamps, tweets or geographical data.


Instructor Bio

Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.

His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.

He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.

  @cmbosancianu

It's impossible to put a precise number on it, but it's a safe bet to say that somewhere between 40% and 60% of the work in most data analysis projects needs to be allocated to mundane tasks of data management. In some cases involving custom data sets obtained through Internet scraping, this share can easily grow to 70–80%. The goal of this course is to present a set of commonly-used and effective R packages that minimize the hassles and errors that invariably appear in the course of processing data. At the end of the class participants should be comfortable with reading in data in R that comes in a variety of formats, with cleaning and recoding such data, summarizing and aggregating it, and with merging it with other types of data. Although these might seem as simple goals for a 3-day class, the multitude of data types that can be encountered in the course of an analysis means that they are anything but simple. String variables (e.g. electoral district names), timestamps (e.g. from Twitter posts) and geographic coordinates (e.g. placement of polling stations) frequently have to be merged with administrative data that can come in a variety of formats, such as Excel, SQL or JSON. This course offers you a set of tools that can help you along in these tasks, and others like them.

In the course, we follow the typical structure of most data preparation projects: a) reading in data in R, b) cleaning, checking and recoding the variables needed for analysis, c) summarizing and aggregating the data, d) reshaping data based on our specific needs and, e) merging/matching the data with other data sources. For each task, we introduce and test a series of packages that can assist the user. To begin with, we go over some of the most common types of data sources that users have to input into R. The focus is less on common types of statistical data, e.g. .SAV or .POR (for SPSS) or .DTA (for Stata). Rather, we concentrate on typical formats encountered on the Internet and produced by government agencies or organizations: .CSV and .XLSX (Excel), .JSON, .HTML, .MDF (SQL database), and even .PDF. In the course of this section, I demonstrate the capabilities of the rio package, along with foreign, readstata13, readxl, XML, rjson, RMySQL, and googlesheets.

In the second part of the class, which is also frequently the most time consuming in an applied project, we deal with tasks of cleaning, checking and recoding variables. Data which has already been partially cleaned and validated for us, as encountered in SPSS or Stata data sets, will pose few problems. In most other situations, we are dealing with freshly-collected information, which requires heavy cleaning and checking. I cover here the most common problems that appear in text data (tweets, electoral district names, or city names), along with handling time stamps that appear on web pages, Facebook posts or tweets. In the course of this work, I rely on the stringr, lubridate, and zoo packages. Following this, I focus on how to clean and validate variables that will be used in the course of one’s analyses. The section concludes with how to restructure data based on the principles of “tidy data”, so as to render it in a format suitable to most R analyses and graphical functions. In the course of these tasks I demo the functions available in the janitor, validate and tidyr packages for R.

In the third part of the class we focus on a slightly smaller set of topics. We start with how to use R for the analysis of very large data sets, for which the most common types of R functions are very slow. I introduce in this section the data.table package, which speeds up considerably some data management tasks. We continue with a few common ways of summarizing and aggregating data, with the help of the dplyr and plyr packages. These are frequently needed in the process of data checking and recoding for time-series cross-section, or pooled cross-sectional, data sets. data.table and plyr will also prove to be useful in the course of merging different data sets, for the purpose of adding new variables to our analyses. Finally, at the end of the course, we look at a few ways in which the reshape2 package can help us with data reshaping, such as converting between “long” and “wide” data formats.

This course is certainly not a statistics course, in the sense that we do not cover any statistical concepts or methods. Rather, the point of the course is to make R a friendlier work environment for the part of the research project that comes before statistics enters the scene: preparing the data for analysis. Due to this focus, participants who are looking for an R-based introduction to statistics should opt for an alternative class. Neither can the class be considered an introduction to R course. I do not cover the object-based logic of R, the different types of objects R works with, or how R can be used for basic statistical analyses and graphs. The focus will constantly be on data cleaning, recoding, reshaping, and merging.

 

Participants should possess introductory-level knowledge of the R statistical environment: what types of objects it uses, how to manipulate these objects, and the basic rules of R syntax (e.g. how to use R functions, the difference between “=” and “==”).

Day Topic Details
Monday 1) Data input in R 2) Cleaning data

We cover reading in data in R from a variety of formats: .CSV, .SAV. .DTA, .JSON, .MDF, .HTML, .PDF, .XLSX.

Once data is in R memory, we go over a few useful packages for data cleaning, particularly for a few types that are difficult to wrangle, such as text, or time stamps.

Tuesday 3) Data validation and recoding 4) Handling very large data sets

We look at a few of the packages and functions through which data validation can be done in R: checking that variables we work with have values that we would expect them to have. In case these expectations are not met, we cover how to correct these discrepancies, and how to recode data for our purposes.

In the second part of the day, we focus on how to use the data.table package for working with very large data sets in R.

Wednesday 5) Data summarizing and aggregation 6) Data merging/matching 7) Data reshaping

We cover the ways in which data can be summarized and aggregated based on our needs: either analysis, or spotting problems with the data.

We continue with functions available for merging data from different sources, with the goal of introducing new variables in our data.

Finally, we cover data reshaping, as a final step before data is ready to be fed into a statistical analysis or a function for plotting.

Day Readings

The class emphasizes learning by doing, which is why mandatory readings can’t be assigned, particularly for topics such as data cleaning or recoding.

Software Requirements

All software used as part of the class is open source and freely downloadable over the Internet. I am OK with participants using their own laptops during the sessions – please make sure that the software is correctly installed on your computer and can be started without errors in this case.

R version 3.3.2 (or newer).

Rstudio version 1.0.136 (or newer).

 

Hardware Requirements

At least a Pentium Core 2 Duo processor, and a machine with minimum 2 GB of RAM. Around 300-400 MB of free HDD space, for installing additional R packages and storing data. Any laptop bought after 2011 ought to be fine in terms of these minimum requirements.

 

Literature

Over the course of the class participants may find the following materials useful. They are not mandatory, but should simply be used as a resource in case additional information is needed for some of the functions I present:

1.       Manual for the foreign package: https://cran.r-project.org/web/packages/foreign/foreign.pdf

2.       Manual for the rio package: https://cran.r-project.org/web/packages/rio/rio.pdf

3.       Manual for the dplyr package: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf

4.       Manual for the plyr package: https://cran.r-project.org/web/packages/plyr/plyr.pdf

5.       Manual for the stringr package: https://cran.r-project.org/web/packages/stringr/stringr.pdf

6.       Manual for the lubridate package: https://cran.r-project.org/web/packages/lubridate/lubridate.pdf

7.       Manual for the data.table package: https://cran.r-project.org/web/packages/data.table/data.table.pdf

8.       Cheat sheet for the data.table package: https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf

9.       Manual for the tidyr package: https://cran.r-project.org/web/packages/tidyr/tidyr.pdf

10.   Wickham, Hadley. Tidy Data. Available at: http://vita.had.co.nz/papers/tidy-data.pdf

11.   Manual for the janitor package: https://cran.r-project.org/web/packages/janitor/janitor.pdf

Manual for the reshape2 package: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf