[Ed. Note: Please welcome guest blogger, Ravi Soni, data scientist from Casetext. I was introduced to Ravi by Casetext’s Vice-President, Pablo Arredondo, and asked to publish Ravi’s discussion on how he uses analytics at Casetext to determine if “the holding in a case is more procedural or more substantive,” and how to leverage that information to potentially predict outcomes. – GL]


One of the biggest constraints to innovation in legal research is how hard it is to scalably classify and quantify information without significant human intervention. At Casetext we’ve made real progress using advanced analytics to better leverage the wealth of content within the law to predict certain outcomes with more precision. The applications for this can range from anything between practice management, case strategy, or in my case, legal research. There is one such challenge I’m particularly interested in, namely, how to quantifiably determine whether the holding in a case is more procedural or more substantive.

I started with a collection of 47,464 briefs written by top law firms in the country. Using the citations and nature of suit (NOS) code associated with each brief, I was able to determine how many unique NOS codes were associated with each case. I defined this as how “polytopic” a case is. In other words, I counted all the unique NOS codes from the briefs that cited to each case and assigned that number as the polytopic score for each case. Ultimately, my goal was to use polytopicness as a proxy to measure proceduralness.

The idea behind using polytopicness to measure proceduralness comes from a simple concept. Let’s say we have a lawyer at an AmLaw 50 firm working on a massive M&A, a public defender in a small county appealing a death penalty verdict, and a boutique immigration firm working on a deportation case, and they all cite to the same case. What does this case have that all three of these attorneys found useful? The short answer is probably nothing substantive. What is more likely is that they are all citing to this case because it is a foundational case that sets the framework for some common motion that transcends practice area.

Let’s look at a concrete example. If I ask a roomful of lawyers if they know about A to Z Maintenance Corp. v. Dole 710 F. Supp. 853 (D.D.C. 1989), it’s quite unlikely that any of them would be able to tell me much, or anything at all. If I asked about a case like Bell Atl. Corp. v. Twombly 550 U.S. 544 (2007), any attorney in the room should be able to tell me how it changed the standards for dismissal. Looking at Figure 1, we can see how there is a difference in citation count and polytopic score between these two procedurally distinct cases.

In this example comparing these two metrics clearly show a difference between the procedural and substantive case – but does this hold for all cases in the data set? 
To find the answer, I first looked at the average number of citations per distinct polytopic score, as seen in Figure 2. To clarify what that means, I’ll use the point at roughly (50, 2500) as an example. This point can be translated to the following: Cases that have a polytopic score of 50 will on average be cited for a total of 2500 times in the briefs data set. The fact that it is a positive slope is intuitive and somewhat trivial; since a case that has a polytopic score of 5, must have been cited at least 5 times. The interesting piece here is the exponential growth, which means that proportionally, the cases that have a higher polytopic score will have a higher citation count. This finding was the first bit of evidence used to confirm our initial assumptions.

Next, I wanted to see what the distribution of polytopic scores look like in order to better understand how many cases are monotopic, bi-topic, etc. To do this, I aggregated the count of cases based on polytopic score (see Figure 3). Easily we can see that most cases in our brief data set are mono or bi-topic. However, when looking closer at the NOS codes (there are 102 in total) it seemed like some of the NOS codes could have been clustered together to make larger groups. For instance, there were codes like Personal Injury: Other, Personal Injury: Marine, Personal Injury: Automotive, etc. that could have been grouped together to make our groups more distinct from one another. As such, after grouping it seemed like any case that is associated with a polytopic score of 6 or more could be considered more procedural.
Although looking at polytopic score is useful, there are some corner cases where this metric would fail in measuring proceduralness. For instance, if a case has a polytopic score of 7, and it has only been cited 7 times ever, then to say it is procedural may not be correct. This is due to the fact that such a small number of citations may not be enough to give us an accurate polytopic score. As such, we need to account for how often cases are cited and adjust the polytopic score accordingly. Looking to Figure 4 we can see the overall distribution of case citations to better understand how often cases are cited. Figure 4 specifically looks at cases that have been cited at least once. 
Here, we can see that roughly half of all cited cases are cited less than 20 times. (In the same light, of the 8.99 million total cases that make up the common law, 5.65 million or about 63% have never been cited at all.) Using this citation information and the polytopic score for each case, I was able to distill an updated polytopic score that accounted for the number of times a case is cited. 
With the help of lawyers, I was able to manually go through 10% of cases that were most procedural and 10% of cases that were most substantive based on our polytopic scoring. I used this to determine whether or not this measurement was accurate in determining if a case is procedural or not. Overall, our assumptions were verified and we can say with some confidence that using polytopicness is a reliable measure of proceduralness for a case. For reference, here are the 10 cases that were shown to be the most procedural. 

ASHCROFT V. IQBAL 556 U.S. 662 (2009)
BELL ATL. CORP. V. TWOMBLY 550 U.S. 544 (2007)
CELOTEX CORP. V. CATRETT 477 U.S. 317 (1986)
CONLEY V. GIBSON 355 U.S. 41 (1957)
FOMAN V. DAVIS 371 U.S. 178 (1962)

While this analysis has shown a strong relationship between polytopicness and procedurality, there is still some fine tuning needed to address the small subset of corner cases. The next step in continuing forward with this would be to see how clustering of NOS codes could be used to further refine the polytopic score. In the same light, this analysis has also opened up different avenues to explore. Some of which include, looking at different relationships between a brief and the cases they cite, how citation counts for cases differ in briefs and court opinions, or if we can predict what a case is about using substantive citations in the case documents. 
If you have any questions, comments, or concerns, please feel free to send me an email at ravi@casetext.com.

Ravi Soni is a recent University of California, Berkeley graduate with a degree in Applied Mathematics. He is currently working as a Data Scientist at Casetext Inc., a legal technology company using AI to enhance legal research. Prior to joining Casetext, Ravi spent some time at other legal technology companies and worked as a legal assistant at a boutique IP firm where he focused on trademarks.