A litigation case management company aimed to stand out by predicting case outcomes—a key market differentiator. While they explored innovative technologies, they still needed help to evolve from basic analytics to impactful machine learning solutions.
In this article, we will guide you through the steps BlueCloud took to help this client build two machine learning models that predict the outcome of cases in the litigation space. Further, it will examine the performance of both models, showing how advanced case management solutions powered by ML can help companies assess and predict their chances of success.
Challenge: Predicting Case Outcomes in Litigation Space
Consumers harmed by defective products or pharmaceuticals often seek legal counsel for liability claims. Law firms need advanced solutions to handle these complex cases and assess their likelihood of success. Recognizing the critical role of case management systems, the client aimed to develop a machine learning model that not only predicts case outcomes but also tracks prediction changes throughout the case lifecycle.
BlueCloud partnered with the client to design a cutting-edge case management solution, leveraging machine learning to predict case success rates. These solutions are critical as they help the client drive revenue growth and enhance customer outcomes.
Solution: Building Advanced Case Management Solution with Machine Learning
By analyzing historical data, both models identify patterns to predict case outcomes, enabling smarter decision-making and improved case prioritization.
Pre-Intake Model
The aim of the pre-intake model is to predict the outcome of the cases. The proven cases are considered as ‘success’ and the canceled and disproven ones are considered as ‘failure’. A classification model was used for this purpose.
Digging into Data
The historical data, which included various types of internal and external data, helped train the pre-intake ML model. Data analysis included joining different tables with relevant data, cleaning the issues, and filtering out cases that may have a negative effect on the dataset.
When conducting a deep data analysis we took the case status (outcome), the case category and case distribution across companies into consideration to understand the data and determine the important patterns.
Data and Feature Engineering
Model performance depends heavily on data quality, making feature engineering essential for selecting relevant aspects of raw data based on the predictive task and model type. As described above, the pre-intake model will be used to predict the outcome using some features right after ingesting the cases into the system. We built a feature set and custom features that could help the model solve the pattern between the label and the features.
Creating Dataset
After analyzing the data and engineering the features, our next step was to build the final dataset that was clean, consistent, and representative of the overall process. Data cleaning steps included grouping rare examples into an "other" category, implementing filters, and removing outliers.
Data Modelling
After preparing the dataset, the next step was building the model pipeline. This involved converting all the data to a numerical format, as ML models require. The ML pipeline includes two main components: preprocessing and classification algorithm. The preprocessing step handles numerical, categorical, and binary data. Numerical features are processed with an imputer to fill missing values and a scaler to standardize them. Categorical data is imputed and then converted into a one-hot encoded format, while binary data remains unchanged. The preprocessed data is then fed into a classification algorithm.
For our model, we tested various classification algorithms and found that tree-based models like Random Forest, XGBoost, and LightGBM performed best, as they excel in datasets with conditional relationships. These models split the data into smaller regions based on similar features, allowing them to capture patterns effectively.
To validate the model, we used K-fold cross-validation to identify the best algorithm and parameter combination.
Building Machine Learning (ML) Pipeline
We built a sophisticated data pipeline within the Snowflake environment to optimize the pre-intake model training process. This pipeline integrates data from four different tables and incorporates specialized features to create the final dataset for model training. To automate the workflow, we implemented two core Snowflake Tasks: one for training the model and another for scoring the pre-intake data. After the execution of these tasks, all relevant artifacts, training data, and model metadata are securely stored in Snowflake.
Our prediction pipeline is designed to utilize the active model artifacts and generate a comprehensive data frame based on the most current data. It selectively generates scores for new or unscored cases, optimizing resource usage and ensuring timely predictions. The pipeline efficiently updates prediction tables, ensuring up-to-date insights.
To further streamline development and deployment, we integrated a CI/CD workflow, which enhances operational efficiency, scalability, and accuracy for model training and data-driven decision-making.
Post-Intake Model
The post-intake model predicts case outcomes after the intake stage, using all available data up to the prediction point, such as case details, intake results, number of calls, and documents obtained.
Digging into Data
The model is trained on historical data, which includes past status changes in the database. State-based features and statistical insights from past states are also included to enrich the dataset. The model’s goal is to classify cases as either 'success' or 'failure,' where successful cases continue, and canceled or disproven cases are categorized as failures.
Feature Engineering
The key difference between the post-intake and pre-intake models is that the post-intake model can predict case outcomes at any state, not just before the intake. The dataset is created with available data at each state, incorporating as many features as possible to provide the model with rich information. For instance, intake-related features like longer-than-expected intake times can indicate a higher likelihood of case failure.
In addition to the intake data, we also included the features from earlier states, such as the number of documents obtained, or time spent in each state to enhance prediction accuracy.
Creating Dataset
The post-intake model predicts case outcomes at any state, requiring a state-wise dataset. To achieve this, we created a table to track state changes and create summary data for each state. Key information such as phone calls, messages, documents, and events is summarized for each state. This summary is then merged with case data from various tables, including campaigns, claimants, and intake data.
Data Modelling
We applied the same steps and processes in the post-intake model that we used in the pre-intake model. When it comes to validation approach, we used custom cross-validation to handle cases with multiple rows in the dataset. This approach is designed to prevent the overrepresentation of cases and ensure a fair evaluation across different cases. To address imbalance and evaluate metrics effectively, the post-intake model achieved an impressive F1-score of 86%. This metric is particularly valuable for handling imbalanced data, offering a more comprehensive assessment than accuracy alone.
Building ML Pipeline
The post-intake model is a significant advancement in our data-driven initiatives, enhancing decision-making through a robust training and deployment pipeline. This model, like the pre-intake version, is built on a comprehensive dataset that integrates diverse data sources and features, all processed and managed in Snowflake.
The prediction pipeline leverages active model artifacts and preprocessing objects to create detailed data frames for scoring new inputs. Additionally, we have established tables such as POST_INTAKE_LATEST_DATA and POST_INTAKE_SCORES to manage and record post-prediction results.
The model scores only new or relevant data, optimizing computational resources
Tying it All Together with UI
We built the user interface with three main components: data operations/helpers, main page content, and sidebar content. The UI manages data efficiently without redundant queries.
The main page functions as the landing page, featuring data getters and displaying querying progress.
The Pre-intake Page offers a range of filters, that update the main page content dynamically. It includes sections for Data, Exploratory Analyses, Model Information, and Model Data, each providing insights into prediction results, model performance, and dataset composition.
The post-intake page enhances the pre-intake features with new functionalities designed for the post-intake phase. It includes an additional sidebar filter for viewing data and plots on case status breakdowns, crucial for the multiple statuses encountered after intake. The Exploratory Analyses section now features a data table displaying predictions and scores for each state in a case's history. This addition allows users to track how the model’s predictions change with different statuses, offering a comprehensive view of each case's progress.
Turning data chaos into actionable insights with Snowflake
BlueCloud’s expertise in Snowflake and Data Cloud was pivotal in helping the client unlock insights for advanced case management, smarter decisions, and improved litigation outcomes.
Impact: Data-Driven Decision Making with ML
This advanced case management solution enables the client to prioritize areas with the highest likelihood of success and review areas of improvement within cases that had a lower likelihood to improve success rates.
Finally, by conducting a cost-benefit analysis and recommending Snowflake over AWS for its superior cost-effectiveness and functionality, BlueCloud helped the client significantly save time, and reduce costs.