Written by: Ismail Karaman
Introduction
Snowflake is a leading cloud-based data warehousing platform renowned for its scalability, performance, and ease of use. Unlike traditional data warehouses, Snowflake operates entirely in the cloud, offering a fully managed service that eliminates the need for hardware provisioning, maintenance, and tuning.
Integrating machine learning into data warehousing enhances insight extraction from vast datasets, automating predictive analytics and anomaly detection. Snowflake addresses the challenge of complex statistical and machine learning techniques for Data/Business Analysts through SQL-based ML Functions. These include Forecasting, Anomaly Detection, and Contribution Explorer, which are explored in this article with use cases illustrating their capabilities. More information on ML-Based Forecasting and Anomaly Detection in Snowflake Cortex located here:
Why and when to use Snowflake Cortex?
Snowflake Cortex provides fully managed ML-based functions for efficient data analysis and AI application development within the Snowflake platform. These functions use machine learning to generate insights and predictions from data, eliminating the need for specialized ML expertise.
Time-series functions within Snowflake Cortex leverage ML to analyze historical data trends, enabling functionalities like forecasting, anomaly detection, and Contribution Explorer. These capabilities empower organizations to gain deeper insights into their data without extensive ML expertise, streamlining the extraction of actionable insights and driving informed decision-making. By automating analytical tasks within the Snowflake platform, businesses can manage storage and compute costs associated with ML model training and deployment efficiently.
It's important to consider storage and computation costs, which vary based on data usage and feature selection, and can be optimized through managing model instances and monitoring usage views. More details on Snowflake compute costs are available here.
Exploring Snowflake's ML-Powered Functions
Time-Series Forecasting
Time-Series Forecasting harnesses the power of machine learning to predict future data based on historical time series data patterns. Whether analyzing single-series or multi-series data, this algorithm, driven by a gradient boosting machine (GBM), incorporates sophisticated techniques such as ARIMA models. Moreover, it efficiently manages high-cardinality categorical data, ensuring robust predictive capabilities. By incorporating exogenous variables and providing prediction intervals set at a default confidence level of 0.95, Time-Series Forecasting within Snowflake Cortex offers a comprehensive approach to anticipating future trends with confidence.
Contribution Explorer
Contribution Explorer, a notable feature within Snowflake Cortex, plays a pivotal role in root cause analysis by identifying the specific data segments responsible for fluctuations in observed metrics over time. By delving into the progression of metric values and highlighting alterations in data trends, Contribution Explorer facilitates actionable insights and optimization strategies. This functionality relies on datasets with non-negative metrics, timestamps, and diverse segmenting dimensions, allowing for nuanced analysis and segmentation of the underlying data structure.
Anomaly Detection
Anomaly Detection, a key component of Snowflake-Cortex's ML-powered functions, stands out for its ability to identify outliers within time series data. Utilizing a gradient boosting machine (GBM) algorithm coupled with ARIMA-like transformations, this feature excels at detecting anomalies and addressing potential data irregularities. By incorporating prediction intervals, which can be tailored based on specific use cases, Anomaly Detection offers a robust solution for ensuring data integrity and reliability. Snowflake's comprehensive documentation further elucidates the intricacies of this process, including model limitations, cost considerations, and training time.
The Syntax
The syntax for model operations is straightforward and similar to table operations. To create and train a model, the below syntax can be used:
CREATE SNOWFLAKE.ML.ANOMALY_DETECTION <model_name>(...);
‘CREATE’ is the same keyword for creating models or tables. The code calls ‘ANOMALY_DETECTION’ function from ‘SNOWFLAKE.ML’ library. The model name should be provided for future references and model parameters can be given inside the parenthesis.
CALL <model_name>!DETECT_ANOMALIES(...);
Now, the model can be called to predict anomalies. Also, ‘SHOW’ keyword can be used to list the anomaly detection models.
SHOW SNOWFLAKE.ML.ANOMALY_DETECTION;
Use Case Application
In a scenario where a software company relies on Snowflake for storing data and conducting all Machine Learning tasks, various departments utilize its capabilities, while simultaneous R&D and software projects aim to enhance company operations. However, the company faces challenges due to the escalating cloud costs, prompting a search for a partner capable of optimizing these expenses.
BlueCloud, with its scalable, high-performing, and on-demand global workforce, offers premium-rate assistance in optimization of Snowflake, aiming to mitigate unexpected usage costs for clients. An engineer from the team identifies the implementation of an anomaly detection algorithm as the optimal solution for controlling usage details. Through this implementation, the client can proactively manage cost distribution and prevent sudden spikes. This project was effectively executed utilizing Snowflake's ML-powered functions. The following outlines the steps the BlueCloud team would undertake:
The first step is creating a view for the data to analyze.
CREATE OR REPLACE VIEW training_usage_data_view
AS SELECT date, num_of_active_rd_projects, num_of_active_sw_projects, total_projects_use_storage, usage FROM cloud_usage_data
WHERE date >= ‘2020-01-01' AND service_provider=Snowflake;
The second step is now creating an object. To do that, ‘CREATE SNOWFLAKE.ML.ANOMALY_DETECTION’ should be used and the model can be trained using this object. While creating the model object, some of the parameters should be given. ‘INPUT_DATA’ is for the dataset to use and in our case, we are referencing the view we created in the previous step. ‘TIMESTAMP_COLNAME’ is used to indicate the date column and 'TARGET_COLNAME’ is for referencing the value that will be predicted. ‘LABEL_COLNAME’ can be left as an empty string if there is no label to provide. In this case, the aim is to detect anomalies using historical data. Lastly, ‘CONFIG_OBJECT’ can be used for tuning the prediction interval and set to 0.95 in this case.
CREATE SNOWFLAKE.ML.ANOMALY_DETECTION usage_anomly_model(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', ' training_usage_data_view '),
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'usage',
LABEL_COLNAME => '',
CONFIG_OBJECT => {'prediction_interval':0.95}
);
Now, the model object is created and trained using the given dataset. The ‘CALL’ object can be used to predict new unseen data. The test data or the dataset to analyze should be given to produce predictions. The rest is the same as creating the model.
CALL usage_anomly_model!DETECT_ANOMALIES(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'prediction_usage_data_view '),
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'usage',
CONFIG_OBJECT => {'prediction_interval':0.95}
);
Conclusion
Snowflake's newly released ML-powered functions in its data warehousing platform present many benefits and opportunities for businesses. The utilization of SQL-based ML Functions, particularly in time-series forecasting, anomaly detection, and contribution explorer, provides organizations with enhanced capabilities to derive insights and make informed decisions from their data.
One of the significant advantages of leveraging Snowflake's ML-powered functions is that leveraging ML functions within the Snowflake platform streamlines the process of transforming raw data into actionable intelligence, empowering businesses to extract maximum value from their data assets. This integration not only enhances operational efficiency but also fosters a data-driven culture within organizations, leading to improved decision-making and a competitive edge in the market. Furthermore, accessibility to non-technical users enables automation of predictive analytics and anomaly detection tasks without requiring specialized ML knowledge. This democratization of advanced analytics enhances operational efficiency and facilitates the distribution of actionable insights for stakeholders across various departments.
However, alongside these benefits, organizations must carefully manage storage and computation costs associated with using Snowflake's ML functions. Optimizing costs involves efficient storage management and consideration of computational resources required for model training and deployment, while also addressing challenges related to data quality, model interpretation, and deployment scalability.
In summary, Snowflake's ML-powered functions offer a compelling solution for organizations seeking to harness the power of machine learning within their data warehousing environment. By leveraging these functions, businesses can unlock new insights, automate analytical tasks, and drive innovation, ultimately leading to improved business outcomes and sustained competitive advantage in today's data-driven landscape.