Transcripts of the Sitting Justices

This post takes a look at supreme court transcripts as corpus for natural language processing. Lately I’ve been playing around with the nltk python module and I thought this might be an interesting data set (given that I’m also an avid follower of SCOTUS).

Supreme Court transcripts are made available by supremecourt.gov and can be downloaded in PDF form. Extracting data from the PDFs is not an exact science since the format varies a bit. I can get a majority of the cases broken up by speaker for what is available for download. The script does its best to compensate for transcription errors, typos, etc. Nothing is perfect though so it’s only useful to look at this data in the aggregate.

Click here for transcripts by person organized by name, or here for organized by case.

Stats from all available transcript data

statements sentences words stopwords unique words
JUSTICE_ROBERTS 10338 21925 127500 167153 8683
JUSTICE_ALITO 3042 6706 51472 70438 5897
JUSTICE_SCALIA 11870 27347 149102 220758 9239
JUSTICE_THOMAS 8 19 114 169 86
JUSTICE_KENNEDY 6081 12686 76322 113082 7125
JUSTICE_BREYER 9105 31739 181260 276786 9202
JUSTICE_GINSBURG 7336 17898 118725 169126 8740
JUSTICE_KAGAN 912 2222 17048 22474 3026
JUSTICE_SOTOMAYOR 3263 7145 43937 63213 5277
other speakers 59531 185848 1552665 1952296 21500

“Other speakers” are petitioners and respondents (not justices)
“stopwords” are high frequency words like the, to, also, etc.
“words”, “unique words” do not include “stopwords”

Stats from 500 randomly selected statements

sentences words stopwords unique words
JUSTICE_ROBERTS 1061 6971 7900 1952
JUSTICE_ALITO 1086 9563 11618 2443
JUSTICE_SCALIA 1146 7435 9354 1989
JUSTICE_KENNEDY 1045 7508 9239 2121
JUSTICE_BREYER 1729 11315 14582 2329
JUSTICE_GINSBURG 1206 9298 11493 2483
JUSTICE_KAGAN 1245 10867 12727 2250
JUSTICE_SOTOMAYOR 1150 8189 10051 2170
other speakers 1484 13256 15047 3178

Justice Thomas is not included in the above data set because there are only 8 statements from him in the generated corpus
“Other speakers” are petitioners and respondents (not justices)
“stopwords” are high frequency words like the, to, also, etc.
“words”, “unique words” do not include “stopwords”

dyerware.com


Linguistic diversity is a coarse measure of a varied vocabulary. The chart below displays the total number of unique words divided by the total number of words.

dyerware.com


Nothing too interesting or surprising here. Justices may use less words in their sentences for a variety of reasons. This data isn’t normalized to factor out introductions or interruptions. Nevertheless the trend appears to be that Justice Kagan and Alito’s sentence lengths are longer than the others and about equal to the petitioners and respondents.

Long words in the oral transcripts

18 or 19 letters is the typical length for the longest words (that are in the dictionary) used by the various speakers. This includes non-sitting justices where the data was available.

  • Justice Alito – (18) misrepresentation
  • Justice Thomas – (16) unconstitutional
  • Justice Kagan – (18) misrepresentations
  • Justice Rehnquist – (18) telecommunications
  • Justice Sotomayor – (18) unconstitutionally
  • Justice Ginsberg – (18) misrepresentations / telecommunications / disproportionately
  • Justice Scalia – (19) unconstitutionality
  • Justice Breyer – (18) representativeness / telecommunications / disproportionately / unconstitutionally
  • Other speakers – (19) counterintelligence / unconstitutionality / extraterritoriality

Sentiment Analysis

Sentiment analysis can yield interesting results for corpus data though in this case there is not very good training material. One of the standard data-sets used for this are movie reviews, widely available and with clear negative and positive denotations. For more information about sentiment analysis there is some good information here and in these two articles. Applying this to oral arguments? Well let’s leave it as just one way to look at this data..

dyerware.com



Justice Thomas is not included in the above data set because there are only 8 statements from him in the corpus
“Other speakers” are petitioners and respondents (not justices)

Once the sentiment engine is trained, each statement registers as either positive or negative based on the words that match closest to the language in positive and negative movie reviews.

I’m sure the error of margin is high and without training data for similar text I would hesitate to draw any conclusions. One that I might make from this is that petitioner and respondents tend to use more positive language than the judges on the bench. If that is a valid hypothesis this data certainly seems to validate it.

Laugh lines in the oral transcripts

Note: there are venturesome academic papers like this one from the Communication Law Review that address laughter in the SCOTUS courtroom. I make no attempt to go into that depth here though my results more or less agree with previous studies on this topic.

It’s not uncommon to get laughter after a statement from the bench, this is denoated in the transcripts as either [laughter] or (laughter). This chart displays the total number of laughter lines by sitting Judge

dyerware.com


As usual Thomas is excluded for generally not speaking when he sits on the bench.
Here is the same data but instead of the total number of [Laughter] lines it divides it by the number of statements for each justice.

dyerware.com


Justice Breyer appears to be funnier by this measure.

Additional Notes

Unfortunately it’s impossible to cleanly extract the argument data from PDFs. Older transcripts have the Justice’s remarks labeled as “QUESTION”; without specific name references the data had to be discarded. Some transcripts have the Justice’s name spelled incorrectly, for example: JUST SCALIA or JUDGE SCALIA

Here is an example of where there is wrong attribution:


MR. LANDAU:

JUSTICE O'CONNOR:

Your Honor, that is not -And I think it is
conceivable that the Florida court was correct that you
could draw the line some way and say contracts that are
void should be handled differently.

For this reason don’t take these results too seriously though hopefully the errors are down in the noise (I have no desire to go through and correct them).

The pdfs were parsed and the data was generated with two crufty python script, they are on my github if you want to look at this data yourself. If you make any improvements please let me know!



2 Responses to “A Text Analysis of Supreme Court Oral Arguments”

  1. 1 Chris

    It looks like the chart image are no longer showing up (ssl problem).