All Eyes on Machine Learning: Driving Data-Driven Decisions with ML

From process automation to customer experience personalization, ML-powered solutions push the boundaries of what can be achieved across industries and organizations worldwide. One standout use case of machine learning is its ability to extract valuable insights from diverse datasets that drive predictive analytics, transform decision-making, and improve outcomes.

In this article, we will guide you through the steps BlueCloud took to build two machine learning models that predict the outcome of cases in the litigation space. Further, this article will examine the performance of both models, showing how advanced case management solutions powered by ML can help companies assess and predict their chances of success.  

Cracking the Case: Predicting Wins with ML

When consumers are harmed by defective products or pharmaceuticals, they often seek legal counsel to pursue liability claims. These cases are complex, and law firms need efficient ways to manage them and assess their chances of success. This is where tech-driven companies, like BlueCloud step in, offering advanced case management solutions that predict whether a case will be successful or not. This capability is critical, as it directly affects their revenue growth.

So, how can they achieve their goal? Given the impact a strong case management system can have, it is vital to build an effective machine learning model that not only predicts the outcome of the case but also tracks how this prediction evolves throughout the case lifecycle. This is the problem that we were solving for clients in the legal services sector.

Where Do We Get Our Experience from

From the initial data analysis to model development, our ML-powered solutions help clients unlock the full power of AI.

MLOps holds the key to unlocking the enormous potential of AI. BlueClLoud’s machine learning consulting services are built on deep domain expertise and a strong history of successful ML implementations. We emphasize clear communication and a collaborative, problem-solving approach to deliver optimal outcomes for our clients. We help companies build solutions that enable them to make predictions, automate decision making, create innovative products and services and improve their bottom line fast.

Explore our AI and ML services to learn how we can help you transform insights into actions and make date-driven decisions to fuel your business growth.

Introducing ML Models

To predict case outcomes and prioritize those more likely to have positive outcomes, BlueCloud developed two machine learning models, pre-intake model and post-intake model.

These models use historical data to identify patterns between key features and predict case results based on those patterns.

Pre-intake model plays a crucial role in the early stages of case processing. After cases are ingested into the system, either through an API or manually, initial data become available. Before proceeding with further steps, the pre-intake model uses this initial data to predict the outcome, streamlining the process from the start.

Post-intake model is activated after the intake process is completed. It can predict the potential outcome at this stage, which helps efficiently manage cases through subsequent steps like document requests, gathering, QC, and reviews, all of which require significant effort. Efficient distribution of resources is crucial, and the data model supports this process.  

How We Did It

Building machine learning models requires a lot of components and moving parts. Our team of ML professionals put on an MLOps mindset to navigate all the complexities and deliver the two ML models in 16 weeks. In this article, we cover all the stages of model development including data analysis, data and feature engineering, dataset creation, modelling and finetuning, building machine learning pipeline, and User Interface (UI) implementation.

Pre-Intake Model

The aim of the pre-intake model is to predict the outcome of the cases. The proven cases are considered as ‘success’ and the canceled and disproven ones are considered as ‘failure’. A classification model was used for this purpose.

Digging into Data

The historical data, which included various types of legal data, helped train the pre-intake ML model. Data analysis included joining different tables with relevant data, cleaning the issues, and filtering out cases that may have a negative effect on the dataset.  

When conducting a deep data analysis we took the case status (outcome), the case category and case distribution across companies into consideration to understand the data and determine the important patterns.  

Data and Feature Engineering

Model performance depends heavily on data quality, making feature engineering essential for selecting relevant aspects of raw data based on the predictive task and model type. As described above, the pre-intake model will be used to predict the outcome using some features right after ingesting the cases into the system. We built a feature set and custom features that could help the model solve the pattern between the label and the features.

Creating Dataset

After analyzing the data and engineering the features, our next step was to build the final dataset that was clean, consistent, and representative of the overall process. Data cleaning steps included grouping rare examples into an "other" category, implementing filters, and removing outliers.

Data Modelling

After preparing the dataset, the next step was building the model pipeline. This involved converting all the data to a numerical format, as ML models require. The ML pipeline includes two main components: preprocessing and classification algorithm. The preprocessing step handles numerical, categorical, and binary data. Numerical features are processed with an imputer to fill missing values and a scaler to standardize them. Categorical data is imputed and then converted into a one-hot encoded format, while binary data remains unchanged. The preprocessed data is then fed into a classification algorithm.

For our model, we tested various classification algorithms and found that tree-based models like Random Forest, XGBoost, and LightGBM performed best, as they excel in datasets with conditional relationships. These models split the data into smaller regions based on similar features, allowing them to capture patterns effectively.

To validate the model, we used K-fold cross-validation to identify the best algorithm and parameter combination.    

Building Machine Learning (ML) Pipeline

We built a sophisticated data pipeline within the Snowflake environment to optimize the pre-intake model training process. This pipeline integrates data from four different tables and incorporates specialized features to create the final dataset for model training. To automate the workflow, we implemented two core Snowflake Tasks: one for training the model and another for scoring the pre-intake data. After the execution of these tasks, all relevant artifacts, training data, and model metadata are securely stored in Snowflake.  

Our prediction pipeline is designed to utilize the active model artifacts and generate a comprehensive data frame based on the most current data. It selectively generates scores for new or unscored cases, optimizing resource usage and ensuring timely predictions. The pipeline efficiently updates prediction tables, ensuring up-to-date insights.

To further streamline development and deployment, we integrated a CI/CD workflow, which enhances operational efficiency, scalability, and accuracy for model training and data-driven decision-making.

Post-Intake Model

The post-intake model predicts case outcomes after the intake stage, using all available data up to the prediction point, such as case details, intake results, number of calls, and documents obtained.

Digging into Data

The model is trained on historical data, which includes past status changes in the database. State-based features and statistical insights from past states are also included to enrich the dataset. The model’s goal is to classify cases as either 'success' or 'failure,' where successful cases continue, and canceled or disproven cases are categorized as failures.

Feature Engineering

The key difference between the post-intake and pre-intake models is that the post-intake model can predict case outcomes at any state, not just before the intake. The dataset is created with available data at each state, incorporating as many features as possible to provide the model with rich information. For instance, intake-related features like longer-than-expected intake times can indicate a higher likelihood of case failure.

In addition to the intake data, we also included the features from earlier states, such as the number of documents obtained, or time spent in each state to enhance prediction accuracy.

Creating Dataset

The post-intake model predicts case outcomes at any state, requiring a state-wise dataset. To achieve this, we created a table to track state changes and create summary data for each state. Key information such as phone calls, messages, documents, and events is summarized for each state. This summary is then merged with case data from various tables, including campaigns, claimants, and intake data.

Data Modelling

We applied the same steps and processes in the post-intake model that we used in the pre-intake model. When it comes to validation approach, we used custom cross-validation to handle cases with multiple rows in the dataset. This approach is designed to prevent the overrepresentation of cases and ensure a fair evaluation across different cases. To address imbalance and evaluate metrics effectively, the post-intake model achieved an impressive F1-score of 86%. This metric is particularly valuable for handling imbalanced data, offering a more comprehensive assessment than accuracy alone.

Building ML Pipeline

The post-intake model is a significant advancement in our data-driven initiatives, enhancing decision-making through a robust training and deployment pipeline. This model, like the pre-intake version, is built on a comprehensive dataset that integrates diverse data sources and features, all processed and managed in Snowflake.

Key data management components include:

  • Post-intake training: Stores recent training data for model refinement.
  • Post-intake test: Holds the latest test data for model evaluation.
  • Model metadata: Captures essential metadata for model governance.
  • Post-intake scores model: Records scores for training and test datasets.

The prediction pipeline leverages active model artifacts and preprocessing objects to create detailed data frames for scoring new inputs. Additionally, we have established tables such as POST_INTAKE_LATEST_DATA and POST_INTAKE_SCORES to manage and record post-prediction results.

The model scores only new or relevant data, optimizing computational resources.

What Are the Key Differences between Pre-Intake and Post-Intake Model?

Here are the key differences between the post-intake and pre-intake pipelines:

  1. Feature Utilization: The post-intake model includes the feature that tracks the state of the case after intake, unlike the pre-intake model where this feature is excluded due to incomplete intake states. The post-intake model also incorporates features like the number of states passed and counts of events and documents from previous states.
  1. Scoring and Data Tables: The post-intake scores tables indicate the state at which scores are generated. This allows the post-intake model to produce different scores for each state of a case, unlike the pre-intake model which generates a single score per case.
  1. Data Splitting: The pre-intake model uses a random train-test split, while the post-intake model employs a case-based split to avoid having the same cases in both training and test sets. This approach ensures more reliable performance evaluation and allows flexibility in the splitting method.

Tying it All Together with UI

We built the user interface with three main components: data operations/helpers, main page content, and sidebar content. The UI manages data efficiently without redundant queries.

The main page functions as the landing page, featuring data getters and displaying querying progress.  

The Pre-intake Page offers filters, including case ID and law firm, that update the main page content dynamically. It includes sections for Data, Exploratory Analyses, Model Information, and Model Data, each providing insights into prediction results, model performance, and dataset composition.

The post-intake page enhances the pre-intake features with new functionalities designed for the post-intake phase. It includes an additional sidebar filter for viewing data and plots on case status breakdowns, crucial for the multiple statuses encountered after intake. The Exploratory Analyses section now features a data table displaying predictions and scores for each state in a case's history. This addition allows users to track how the model’s predictions change with different statuses, offering a comprehensive view of each case's progress.

What Do you Want to Achieve with ML?

Millions of businesses use ML for routine decisions, but when applied at scale, it can tackle big strategic questions. By leveraging data across an organization, ML models can predict the impact of major initiatives, like reaching zero CO2 or transitioning to electric fleets. This shifts decision-making from gut instincts to data-driven insights, turning uncertain leaps into calculated, well-informed strategies.

Building an advanced case management system is one such application, with the potential to revolutionize the litigation space, saving significant time and money while safeguarding people’s rights.

Explore our AI and ML services to learn how we can help you transform insights into actions and make date-driven decisions to fuel your business growth.