Transcripts of the Sitting Justices
This post takes a look at supreme court transcripts as corpus for natural language processing. Lately I’ve been playing around with the nltk python module and I thought this might be an interesting data set (given that I’m also an avid follower of SCOTUS).
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Supreme Court transcripts are made available by supremecourt.gov and can be downloaded in PDF form. Extracting data from the PDFs is not an exact science since the format varies a bit. I can get a majority of the cases broken up by speaker for what is available for download. The script does its best to compensate for transcription errors, typos, etc. Nothing is perfect though so it’s only useful to look at this data in the aggregate.
Click here for transcripts by person organized by name, or here for organized by case.
Stats from all available transcript data
| statements | sentences | words | stopwords | unique words | |
|---|---|---|---|---|---|
| JUSTICE_ROBERTS | 10338 | 21925 | 127500 | 167153 | 8683 |
| JUSTICE_ALITO | 3042 | 6706 | 51472 | 70438 | 5897 |
| JUSTICE_SCALIA | 11870 | 27347 | 149102 | 220758 | 9239 |
| JUSTICE_THOMAS | 8 | 19 | 114 | 169 | 86 |
| JUSTICE_KENNEDY | 6081 | 12686 | 76322 | 113082 | 7125 |
| JUSTICE_BREYER | 9105 | 31739 | 181260 | 276786 | 9202 |
| JUSTICE_GINSBURG | 7336 | 17898 | 118725 | 169126 | 8740 |
| JUSTICE_KAGAN | 912 | 2222 | 17048 | 22474 | 3026 |
| JUSTICE_SOTOMAYOR | 3263 | 7145 | 43937 | 63213 | 5277 |
| other speakers | 59531 | 185848 | 1552665 | 1952296 | 21500 |
“Other speakers” are petitioners and respondents (not justices)
“stopwords” are high frequency words like the, to, also, etc.
“words”, “unique words” do not include “stopwords”
Stats from 500 randomly selected statements
| sentences | words | stopwords | unique words | |
|---|---|---|---|---|
| JUSTICE_ROBERTS | 1061 | 6971 | 7900 | 1952 |
| JUSTICE_ALITO | 1086 | 9563 | 11618 | 2443 |
| JUSTICE_SCALIA | 1146 | 7435 | 9354 | 1989 |
| JUSTICE_KENNEDY | 1045 | 7508 | 9239 | 2121 |
| JUSTICE_BREYER | 1729 | 11315 | 14582 | 2329 |
| JUSTICE_GINSBURG | 1206 | 9298 | 11493 | 2483 |
| JUSTICE_KAGAN | 1245 | 10867 | 12727 | 2250 |
| JUSTICE_SOTOMAYOR | 1150 | 8189 | 10051 | 2170 |
| other speakers | 1484 | 13256 | 15047 | 3178 |
Justice Thomas is not included in the above data set because there are only 8 statements from him in the generated corpus
“Other speakers” are petitioners and respondents (not justices)
“stopwords” are high frequency words like the, to, also, etc.
“words”, “unique words” do not include “stopwords”
Linguistic diversity is a coarse measure of a varied vocabulary. The chart below displays the total number of unique words divided by the total number of words.
Nothing too interesting or surprising here. Justices may use less words in their sentences for a variety of reasons. This data isn’t normalized to factor out introductions or interruptions. Nevertheless the trend appears to be that Justice Kagan and Alito’s sentence lengths are longer than the others and about equal to the petitioners and respondents.
Long words in the oral transcripts
18 or 19 letters is the typical length for the longest words (that are in the dictionary) used by the various speakers. This includes non-sitting justices where the data was available.
- Justice Alito – (18) misrepresentation
- Justice Thomas – (16) unconstitutional
- Justice Kagan – (18) misrepresentations
- Justice Rehnquist – (18) telecommunications
- Justice Sotomayor – (18) unconstitutionally
- Justice Ginsberg – (18) misrepresentations / telecommunications / disproportionately
- Justice Scalia – (19) unconstitutionality
- Justice Breyer – (18) representativeness / telecommunications / disproportionately / unconstitutionally
- Other speakers – (19) counterintelligence / unconstitutionality / extraterritoriality
Sentiment Analysis
Sentiment analysis can yield interesting results for corpus data though in this case there is not very good training material. One of the standard data-sets used for this are movie reviews, widely available and with clear negative and positive denotations. For more information about sentiment analysis there is some good information here and in these two articles. Applying this to oral arguments? Well let’s leave it as just one way to look at this data..
Justice Thomas is not included in the above data set because there are only 8 statements from him in the corpus
“Other speakers” are petitioners and respondents (not justices)
Once the sentiment engine is trained, each statement registers as either positive or negative based on the words that match closest to the language in positive and negative movie reviews.
I’m sure the error of margin is high and without training data for similar text I would hesitate to draw any conclusions. One that I might make from this is that petitioner and respondents tend to use more positive language than the judges on the bench. If that is a valid hypothesis this data certainly seems to validate it.
Laugh lines in the oral transcripts
Note: there are venturesome academic papers like this one from the Communication Law Review that address laughter in the SCOTUS courtroom. I make no attempt to go into that depth here though my results more or less agree with previous studies on this topic.
It’s not uncommon to get laughter after a statement from the bench, this is denoated in the transcripts as either [laughter] or (laughter). This chart displays the total number of laughter lines by sitting Judge
As usual Thomas is excluded for generally not speaking when he sits on the bench.
Here is the same data but instead of the total number of [Laughter] lines it divides it by the number of statements for each justice.
Justice Breyer appears to be funnier by this measure.
Additional Notes
Unfortunately it’s impossible to cleanly extract the argument data from PDFs. Older transcripts have the Justice’s remarks labeled as “QUESTION”; without specific name references the data had to be discarded. Some transcripts have the Justice’s name spelled incorrectly, for example: JUST SCALIA or JUDGE SCALIA
Here is an example of where there is wrong attribution:
MR. LANDAU: JUSTICE O'CONNOR: Your Honor, that is not -And I think it is conceivable that the Florida court was correct that you could draw the line some way and say contracts that are void should be handled differently.
For this reason don’t take these results too seriously though hopefully the errors are down in the noise (I have no desire to go through and correct them).
The pdfs were parsed and the data was generated with two crufty python script, they are on my github if you want to look at this data yourself. If you make any improvements please let me know!
Filed under: Uncategorized | 2 Comments









It looks like the chart image are no longer showing up (ssl problem).
That’s odd, maybe there is a problem accessing the google charts from your browser.
Does this link work?