The course provides an introduction to Python programming and applications in the social sciences taught by a social scientist. I introduce basic concepts of computer programming and present tools for a modern data workflow in Python. Social scientists are relying increasingly on new data sources such as social media data, other online content and various text based information, that move beyond traditional tabular data matrices. Technically, these sources may include semi-structured data such as web pages or structured data in different formats such as APIs (eg. Twitter, Worldbank, Manifesto project), RSS feeds, text corpora or databases (eg. ParlGov). Innovation in the social sciences is increasingly driven by creatively collecting, transforming and combining information from online sources or by establishing new web based data collections with modern programming tools.
One of the core prerequisite to work with online data is the command of a modern programming language and an understanding of the wide set of programming libraries (packages) that can be used for different types of data preparation, analysis and visualization. Over the last years, the Python programming language has become a prominent tool for data processing, in addition to statistical packages, such as Stata, R and SPSS. Statistical packages excel at analyzing data and in providing graphical visualization, but are more limited when it comes to heterogeneous data sources that are available in various formats. Python, a modern programming language with a well-established ecosystem, is particularly useful for all data related work in the social sciences. The language and its external libraries can be used for web-scraping, textual analysis, machine learning, the presentation of data in web pages and data analysis. Python includes many modern programming concepts that make it particularly suited and relatively easy to learn for social scientists. By learning Python, you will better understand programming techniques and the knowledge will be beneficial for computing related tasks in the social sciences. Python is one of the few programming language that is loved and used by beginners and expert programmers.
The course has three goals. First, it will give you an overview about the vast potential of new online data in the social sciences and an understanding of the tools that are required to work with different data sources. Second, I introduce the Python programming language and some of the core concepts in computer programming. Finally, we will use applications of Python programming to work with social media data (web scraping, Twitter, online news). The course will enable you to better understand how programming tools are used to work with new data sources. It will allow you to either move into programming yourself or to work more closely with social science researchers that draw extensively on programming in their work. I present different applications and discuss code snippets to help you understand how various Python packages can be used to process, analyze and present different types of data, in particular textual information.
The applications of Python programming in the course focus on analyzing and visualizing social media data (eg. Twitter, Blogs, web scraping). In addition, I demonstrate how to present the results in dynamically generated documents such as web pages and reports. I introduce the full stack of a modern data workflow: collecting, processing, analyzing and presenting data. New data sources are often structured in a way that differs from traditional data formats, such as the tabular data matrix most commonly applied by social scientists. Online information is often presented in different structured formats (esp. JSON, XML). Some webpages allow accessing and querying their data with an online interface (API) that enables downloading structured data with defined queries. Twitter is the most prominent example of providing an API as a modern access to data. Other information is semi-structured only and requires web-scraping techniques to transform text on webpages into structured data. Python provides a superb set of packages (eg. beautifulsoup, requests) to automate this type of data access and transformation from web pages into datasets. Over the last decade, Python has progressed significantly as a tool for data analysis and has a prominent place in the sciences. There are well established packages for data processing (pandas, numpy, scipy), statistical analysis (statsmodels), machine learning (scikit-learn), textual analysis (nltk) and network analysis (networkx). These packages are regularly improved and allow to simplify many time consuming tasks in data related work.
In the course, I focus on core concepts of Python programming. In the first part, I give an introduction to programming concepts such as variables, functions, conditionals and iterations as well as the respective Python syntax. I also discuss how Python syntax differs from existing programming tools used in the social sciences (esp. R and Stata). In a second part, I introduce Python’s package ecosystem, its core packages for applications in the social sciences and give a short overview on software (editors, IDEs) for programming with Python. Short code snippets and small programming exercises demonstrate how Python is used for data processing. We apply the programming concepts and make use of Python packages to work with different social media data (Blogs, Twitter, Facebook).
Course participants will acquire a basic knowledge of Python programming, will develop an understanding of various formats of structured and semi-structured online data and learn how to gather, analyze and visualize this data in Python. You will work on a course project during class sessions and in your homework and apply your new programming skills to a particular social media analysis.
The course aims at social scientists that are interested in the potential of new data and want to develop a better understanding of programming and its application in the social sciences. It is particularly well suited for those that have some regular experience in R or Stata programming and have previously applied programming concepts such as ‘if’ statements, ‘for’ loops and functions.
Downey, Allen B. 2015. Think Python. Sebastopol, CA: O’Reilly Media. greenteapress.com/wp/think-python-2e
Mitchell, Ryan. 2015. Web Scraping with Python: Collecting Data from the Modern Web. Beijing: O’Reilly Media.
Russell, Matthew A. 2013. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. 2nd ed. Sebastopol, CA: O’Reilly Media.