ECPR Winter School
University of Bamberg, Bamberg
3 - 10 March 2017




WC105 - Python Programming for Social Sciences: Collecting, Analyzing and Presenting Social Media Data

Instructor Details

Instructor Photo

Holger Döring

Institution:
Universität Bremen

Instructor Bio

Holger Döring is a senior researcher in political science at the University of Bremen.

His research and teaching interest focuses on political institutions, democratic delegation, public policy and new types of data collection.

He has successfully applied programming technologies (particularly Python and R) to establish ParlGov.org and PartyFacts.org, two modern data infrastructures for political science.


Course Dates and Times

Monday 6 to Friday 10 March 2017
Generally classes are either 09:00-12:30 or 14:00-17:30
15 hours over 5 days

Prerequisite Knowledge

Students need some basic programming experience. This may include working with Stata and R scripts or some other programming language. Ideally, concepts like variable types, loops and functions have been used before. Students with no previous programming experience should follow online tutorial before the course.

Short Outline

This course gives an introduction to Python programming, to analyze social media data and presents applications in the social sciences. The course has three goals: First, it will give an overview about the potential of new online data in the social sciences and an understanding of the tools that are required to work with heterogeneous data sources. Second, it will introduce the Python programming language and core concepts of computer programming, such as conditions, loops and functions. Finally, we apply Python programming to analyze social media data (web scraping, Twitter, news data). The course is highly interactive, I present concepts and short code snippets and students apply them in class exercises and their homework. All steps of a modern data pipeline are covered: automatically collecting and cleaning online sources, analyzing and visualizing cleaned data and presenting the results as online content.

Long Course Outline

The course provides an introduction to Python programming and applications in the social sciences taught by a social scientist. I introduce basic concepts of computer programming and present tools for a modern data workflow in Python. Social scientists are relying increasingly on new data sources such as social media data, other online content and various text based information, that move beyond traditional tabular data matrices. Technically, these sources may include semi-structured data such as web pages or structured data in different formats such as APIs (eg. Twitter, Worldbank, Manifesto project), RSS feeds, text corpora or databases (eg. ParlGov). Innovation in the social sciences is increasingly driven by creatively collecting, transforming and combining information from online sources or by establishing new web based data collections with modern programming tools.

One of the core prerequisite to work with online data is the command of a modern programming language and an understanding of the wide set of programming libraries (packages) that can be used for different types of data preparation, analysis and visualization. Over the last years, the Python programming language has become a prominent tool for data processing, in addition to statistical packages, such as Stata, R and SPSS. Statistical packages excel at analyzing data and in providing graphical visualization, but are more limited when it comes to heterogeneous data sources that are available in various formats. Python, a modern programming language with a well-established ecosystem, is particularly useful for all data related work in the social sciences. The language and its external libraries can be used for web-scraping, textual analysis, machine learning, the presentation of data in web pages and data analysis. Python includes many modern programming concepts that make it particularly suited and relatively easy to learn for social scientists. By learning Python, you will better understand programming techniques and the knowledge will be beneficial for computing related tasks in the social sciences. Python is one of the few programming language that is loved and used by beginners and expert programmers.

The course has three goals. First, it will give you an overview about the vast potential of new online data in the social sciences and an understanding of the tools that are required to work with different data sources. Second, I introduce the Python programming language and some of the core concepts in computer programming. Finally, we will use applications of Python programming to work with social media data (web scraping, Twitter, online news). The course will enable you to better understand how programming tools are used to work with new data sources. It will allow you to either move into programming yourself or to work more closely with social science researchers that draw extensively on programming in their work. I present different applications and discuss code snippets to help you understand how various Python packages can be used to process, analyze and present different types of data, in particular textual information.

The applications of Python programming in the course focus on analyzing and visualizing social media data (eg. Twitter, Blogs, web scraping). In addition, I demonstrate how to present the results in dynamically generated documents such as web pages and reports. I introduce the full stack of a modern data workflow: collecting, processing, analyzing and presenting data. New data sources are often structured in a way that differs from traditional data formats, such as the tabular data matrix most commonly applied by social scientists. Online information is often presented in different structured formats (esp. JSON, XML). Some webpages allow accessing and querying their data with an online interface (API) that enables downloading structured data with defined queries. Twitter is the most prominent example of providing an API as a modern access to data. Other information is semi-structured only and requires web-scraping techniques to transform text on webpages into structured data. Python provides a superb set of packages (eg. beautifulsoup, requests) to automate this type of data access and transformation from web pages into datasets. Over the last decade, Python has progressed significantly as a tool for data analysis and has a prominent place in the sciences. There are well established packages for data processing (pandas, numpy, scipy), statistical analysis (statsmodels), machine learning (scikit-learn), textual analysis (nltk) and network analysis (networkx). These packages are regularly improved and allow to simplify many time consuming tasks in data related work.

In the course, I focus on core concepts of Python programming. In the first part, I give an introduction to programming concepts such as variables, functions, conditionals and iterations as well as the respective Python syntax. I also discuss how Python syntax differs from existing programming tools used in the social sciences (esp. R and Stata). In a second part, I introduce Python’s package ecosystem, its core packages for applications in the social sciences and give a short overview on software (editors, IDEs) for programming with Python. Short code snippets and small programming exercises demonstrate how Python is used for data processing. We apply the programming concepts and make use of Python packages to work with different social media data (Blogs, Twitter, Facebook).

Course participants will acquire a basic knowledge of Python programming, will develop an understanding of various formats of structured and semi-structured online data and learn how to gather, analyze and visualize this data in Python. You will work on a course project during class sessions and in your homework and apply your new programming skills to a particular social media analysis.

The course aims at social scientists that are interested in the potential of new data and want to develop a better understanding of programming and its application in the social sciences. It is particularly well suited for those that have some regular experience in R or Stata programming and have previously applied programming concepts such as ‘if’ statements, ‘for’ loops and functions.

Required literature:

Downey, Allen B. 2015. Think Python. Sebastopol, CA: O’Reilly Media. greenteapress.com/wp/think-python-2e

Mitchell, Ryan. 2015. Web Scraping with Python: Collecting Data from the Modern Web. Beijing: O’Reilly Media.

Russell, Matthew A. 2013. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. 2nd ed. Sebastopol, CA: O’Reilly Media.

 

Day-to-Day Schedule

Day 
Topic 
Details 
1Introduction to ‘data science’ and Python programming 
Software Requirements

Anaconda (Python 3.5) http://continuum.io/downloads (freeware)

Participants need to install this on their laptop prior to the course. In case of difficulty, participants should seek supports from their local IT teams, prior to coming to Bamberg. As a last resort option, some troubleshooting can be organized during the first course session.

Hardware Requirements

Participants need to bring their own laptop.

 

Literature

Gries, Paul, Jennifer Campbell, and Jason Montojo. 2013. Practical Programming: An Introduction to Computer Science Using Python 3. Pragmatic Bookshelf.

VanderPlas, Jake. 2016. Python Data Science Handbook. Sebastopol, CA: O’Reilly Media.

O’Neil, Cathy, and Rachel Schutt. 2013. Doing Data Science: Straight Talk from the Frontline. Sebastopol, CA: O’Reilly Media.

Additional Information

Disclaimer

The information contained in this course description form may be subject to subsequent adaptations (e.g. taking into account new developments in the field, specific participant demands, group size etc.). Registered participants will be informed in due time in case of adaptations.

Note from the Academic Convenors

By registering to this course, you certify that you possess the prerequisite knowledge that is requested to be able to follow this course. The instructor will not teach these prerequisite items. If you are not sure if you possess this knowledge to a sufficient level, we suggest you contact the instructor before you proceed with your registration.


Share this page
 

"To govern is to choose" - Duc de Lévis


Back to top