CTAWG Data

Why/how did we choose this data?

In the Fall of 2015 CTWAG group members decided to use one common data set through which to explore computational text analysis methods. Members chose to focus our collective efforts on the Congressional Record and Congressional Hearing. We chose these Congressional documents for the following reasons:

"Big" data set: There is a large amount of data.
Publicly available data: The data is in the public domain, so there are no copyright concerns.
Multiple forms of communication: The record includes both spoken and written text.
External linkages: Potential to link to voter records, metadata about congress people, committees served on, etc.

What data files should I use?

We are still in the preliminary stages of scraping and pre-processing Congressional Hearings from US GPO. For Spring 2016 we will therefore be working with two branches of data:

ICPSR Data v1.0

Version: 1.0
Copyright: Restricted to researchers in the ICPSR community (including UC Berkeley). To receive the data, follow the link above and request access using a berkeley.edu email address.
In 2011, Matthew Gentzkow and Jesse Shapiro compiled text data from the Congressional Record (104th-110th). This data is available to researchers at universities affiliated with the ICPSR collective (including UC Berkeley). This dataset will be used as a bridge, enabling us to work through more advanced stages of a text analysis project, while we prepare our own more exhaustive original data set using the US Government Publishing Office records. For mor information, visit the entry for this data set at the ICPSR.

CTAWG Data v0.1

Version: 0.1
Copyright: Unrestricted.
Description: We are currently developing our own data set from Congressional Hearings. To do this we built a web scraper to compile hearings housed at the US Government Publishing Office Site. Version 0.1 provides a very small sample of a subset of these records to offer a glimpse into the format and potential for this data.