ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Course Dates and Times

Friday 26 February: 13:00-15:00 and 15.30-17.00
Saturday 27 February: 09.30-11.30 and 12.30-14.30
7.5 hours over two days

Holger Döring

doering@uni-bremen.de

Universität Bremen

This short (two days) course is an introduction to programming and its application in the social sciences. The course has two goals. First, it should give an overview about the potential of new (big) data in the social sciences and an understanding of the tools that are required to work with new (big) data. Second, I introduce the Python programming language and some of its core concepts. I give an introduction to concepts such as variables, functions, conditionals and iterations as well as the respective Python syntax. We use applications of (Python) programming to work with textual data (web scraping, Twitter, news data). I present short code snippets to understand how Python packages can be used to gather, process, analyze and present data. This two days course cannot teach you (Python) programming at once but will help you to start to learn (Python) programming yourself.


Instructor Bio

Holger Döring is a senior researcher in political science at the University of Bremen.

His research and teaching interest focuses on political institutions, democratic delegation, public policy and new types of data collection.

He has successfully applied programming technologies (particularly Python and R) to establish ParlGov.org and PartyFacts.org, two modern data infrastructures for political science.

This short (two days) course is an introduction to (Python) programming and its application in the social sciences taught by a social scientist. I introduce new (big) data concepts and Python programming tools for a modern data workflow. Social scientists are relying increasingly on new data sources such as social media data and various text based information, that move beyond traditional rectangular data matrices. Technically, these sources may include semi-structured data such as web pages or structured data in various formats such as APIs (eg. Twitter, Worldbank, Manifesto project), RSS feeds, text corpora or databases (eg. ParlGov). Innovation in the social sciences is increasingly driven by creatively transforming and combining information from different (online) sources or by establishing new web based data with modern programming tools. In the course we focus on applications of programming in the social sciences to work with textual data (web scraping, Twitter, news data).

One of the core prerequisites to work with new (big) data is the command of a modern programming language and an understanding of the wide set of programming libraries (packages) that can be used for different types of data preparation, analysis and visualisation. Over the last years, the Python programming language has become a prominent tool for data processing, in addition to statistical packages, such as Stata, R and SPSS. Statistical packages excel at analysing data and providing graphical visualisation, but are more limited when it comes to new (big) data that comes in various formats. As a modern programming language with a well-established ecosystem, Python is particularly useful for all data related work in the social sciences. The language and its external libraries can be used for web-scraping, textual analysis, machine learning, presentation of data in web pages and data analysis. Python includes many modern programming concepts that make it particularly suited and relatively easy to learn for social scientists. By learning Python, you will better understand programming techniques and this knowledge will be very helpful for computing related tasks in the social sciences. Python is one of the few programming language that is loved and used by beginners and expert programmers.

The course has two goals. First, it should give you an overview about the vast potential of new (big) data in the social sciences and an understanding of the tools that are required to work with new (big) data. Second, I introduce the Python programming language and some of its core concepts. The course will enable you to better understand how programming tools are used to work with new (big) data. It will allow you to either move into programming yourself or to work more closely with social science researchers that draw extensively on programming in their work. I present different applications and short code snippets to help you understand how various Python packages can be used to process, analyse and present data, in particular textual information.

Applications of (Python) programming in the course focus on web scraping, social media (eg. Twitter, Blogs) and textual data analysis as well as presenting data in documents such as web pages and reports. I discuss the full stack of a modern data workflow: processing, analysing and presenting data. Processing data allows us to convert external data sources into a format suitable for data analysis. New (big) data is often structured in a way that differs from traditional data formats, such as the rectangular data matrix most commonly applied by social scientists. Often, online information is presented in objects and comes in various formats (esp. JSON, XML). Some webpages allow accessing and querying their data with an online interface (API) that enables downloading structured data with defined queries. Twitter is the most prominent example of providing an API as a modern access to data. Other information is semi-structured only and requires web-scraping techniques to transform text on internet pages into data. Python provides a superb set of packages (eg. beautifulsoup, requests) to automate this type of data access and transformation from web pages into datasets. Over the last decade, Python has progressed significantly as a tool for data analysis and has a prominent place in the sciences. There are well established packages for data processing (pandas, numpy, scipy), statistical analysis (statsmodels), machine learning (scikit-learn), textual analysis (nltk) and network analysis (networkx). These packages are regularly improved and allows the simplification of many time consuming tasks in data related work.

In the course, I focus on core concepts of (Python) programming. I give an introduction to concepts such as variables, functions, conditionals and iterations as well as the respective Python syntax. I also discuss how Python syntax differs from existing programming tools used in the social sciences (esp. R and Stata). Finally, I introduce Python’s package ecosystem, core packages for applications in the social sciences and give a short overview on software (editors, IDEs) for programming with Python. Short code snippets and small programming exercises demonstrate how Python is used for data processing. The course aims at social scientists who are interested in the potential of new (big) data and want to develop a better understanding of programming and its application in the social sciences. It is particularly well suited for those that have some regular experience in R or Stata programming and have previously applied programming concepts such as ‘if’ statements, ‘for’ loops and functions. This two days course cannot teach you (Python) programming at once but will help you to start and motivate you to learn (Python) programming yourself. “I (we) help you to get from A to B so that you can go from B to Z”, as Software Carpentry put it, the granddaddy of teaching skills for scientific computing.

Required literature:
Russell, Matthew A. 2013. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. 2nd ed. Sebastopol, CA: O’Reilly Media.
Downey, Allen B. 2012. Think Python. Sebastopol, CA: O’Reilly Media. www.greenteapress.com/thinkpython

Students should have been exposed to some type of programming before. This may include working with Stata and R scripts or some other programming language. Ideally, concepts like variable types, loops and functions have been used before in programming activities. Students with no previous knowledge will get a first introduction and a roadmap to self-study (Python) programming in the social science.

Day Topic Details
Friday New (big) data, programming and Python Friday afternoon
Saturday morning Python fundamentals, packages, data formats, APIs Saturday morning
Saturday afternoon Web scraping, text analysis, data presentation Saturday afternoon
Day Readings
Friday Downey (2012) chapters 1-11; get an understanding of the Python language
Saturday morning Russell (2013) chapters 1, 9
Saturday afternoon Russell (2013) chapters 5, 8

Software Requirements

Anaconda (Python 2.7) http://continuum.io/downloads

Literature


Gries, Paul, Jennifer Campbell, and Jason Montojo. 2013. Practical Programming: An Introduction to Computer Science Using Python 3. 2nd ed. Pragmatic Bookshelf.

McKinney, Wes. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Beijing: O’Reilly Media.

Mitchell, Ryan. 2015. Web Scraping with Python: Collecting Data from the Modern Web. Beijing: O’Reilly Media.

O’Neil, Cathy, and Rachel Schutt. 2013. Doing Data Science: Straight Talk from the Frontline. Sebastopol, CA: O’Reilly Media.