Big Data Analytics with Cloud Notebooks and Query Style ML

A mini-guide about combining machine learning with cloud services

Photo credit: Pixabay

Cloud services play a much more important role than before because of their flexibility, sustainability, cost savings, security, etc. Then how to combine machine learning with cloud services become another aspect of the cloud future. Currently, we have multiple choices for cloud service vendors. For each of them, they have some similar components under cloud services that are related to machine learning. For example, there Microsoft Azure Notebooks, GCP Cloud AI Platform Notebooks, AWS EMR Notebooks, Machine Learning Services in Azure SQL, etc. In our special case here, we are using GCP as our example. When I look at them such as AI Platform Notebooks, BigQuery ML, AutoML, etc., I have a big question mark on my mind — how can we choose the right one to reach our goal? In this blog, we will use two different ways to create a logistic regression based on data from “Titanic — Machine Learning from Disaster” (reference 1). We try to compare the pros and cons between Cloud Notebook and Query Style ML based on the result.

The data we are using is from Kaggle, the main goal is to predict which passengers survived the Titanic shipwreck.

Agenda

  • The Architecture for generic Cloud Notebook
  • Predict data based on Notebook
  • Query Style ML
  • Predict data based on Query Style ML
  • Comparison between Cloud Notebook and Query Style ML
  • Recommendations

The Architecture for generic Cloud Notebook

Cloud Notebook — Architecture

Based on the above chart, we can find that the Jupyter Notebook is under Compute VM with the pre-config instance. It can reach out to different services like SQL, Storage, etc. It is an advanced Jupyter Notebook with the following benefits.

Benefits for Notebook

Predict data based on Notebook

We are following the below steps to build our notebook and you can find all detailed code under reference 2 notebook:

  • Business Understanding (Objective and Description)
  • Data Understanding: Statistical summaries, visualizations, and Data quality check
  • Data Preparation: Missing values imputation, Feature Engineering and Integrate data
  • Modeling: Build a model. Fit the model. Validate the model
  • Evaluation

Query Style ML?

Query Style ML is a new area for machine learning. The most key point for Query Style ML is to create an ML model just based on SQL. Even we don’t know programming languages like python or others’ ML libraries, we still can build ML modeling using SQL.

In here special example, we are using BigQuery ML which is around four steps: getting dataset, create and train model, evaluate the model and predict the result.

Photo credit: Google Cloud Platform

But BigQuery ML only supports the following models so far for regression, classification, and clustering. We can find them under the below table.

BigQueryML: Support Models

Predict data based on Query Style ML

Because it is not that easy to do statistical summaries, visualizations, data quality checks, and feature engineering, we are using the result data of feature engineering from notebook part and input the data into BigQuery table (we are using BigQuery as an example).

After inputting data, we are below four steps to work on the BigQuery ML process:

Step 1: Create Model based on Training Data

Like creating the table, we can create the logistic regression model based on the training set. There are a lot of parameters we can set, but for now, we just use all default values.

BigQueryML — Create Model

Step 2: Get Model Info

After creating the model, we can get a lot of different model information using SQL. The below query will return the weights for all features. Also, we can a lot of model metadata, we can refer to the page from reference 3.

BigQuery — Get Model Info

Step 3: Evaluate Model

Using the below query and testing dataset, we can get 6 performance metrics such as precision, recall, accuracy, f1_score, log_loss, and roc_auc for classification problems.

BigQuery — Evaluate Model

Step 4: Predict

The final step is to predict the data based on the model. We can use the below query to predict the data.

BigQuery — Predict

Comparison between Cloud Notebook and Query Style ML

We are using two different ways to create the logistic regression based on the same data. Let’s see the performance result for those two models. Based on the below table, we can find those two models have very similar results.

| Metric    | Notebook    | BigQuery ML |
|-----------|-------------|-------------|
| precision | 0.798407557 | 0.790697674 |
| recall | 0.746405229 | 0.790697674 |
| accuracy | 0.842602041 | 0.850622407 |
| roc_auc | 0.901349357 | 0.902874126 |

Expect for model performance, let’s deep dive into the 7 categories below to get more detailed aspects between Cloud Notebooks and Query Style ML. The score range is from low to high. We can find the final comparison table below also.

Data Cleaning Flexible / Feature Engineering:

Programing language can deal with data more flexibly, even SQL can do some basic methods.

Model Diversity:

BigQuery ML has limitations for model selection.

Model Creation and Saving:

BigQuery ML plays well here because it can create a model just like create a table.

Basic Model Performance:

Both methods have the same performance result for a basic model.

Auto Performance Tuning:

BigQuery ML need to set some parameter manually under SQL statement, but for programing language, we some libraries like GridSearchCV to auto-tuning the model.

Visualizing:

BigQuery ML doesn’t support data visualizing, it needs support from looker or others tools. However, it is much more flexible to perform the visualizing under the notebook.

Skill Diversity:

Learning SQL maybe a little simpler than learning a programing language.

| Aspects                   | AI Platform Notebooks | BigQuery ML |
|---------------------------|-----------------------|-------------|
| Data Cleaning Flexible | high | low |
| Feature Engineering | high | low |
| Model Diversity | high | low |
| Model Creation and Saving | medium | high |
| Basic Model Performance | high | high |
| Auto Performance Tuning | high | low |
| Visualizing | high | low |
| Skill Diversity | high | medium |

Recommendations

Based on the last section, we can find Query Style ML has a good performance on model creation and saving, basic model performance, and skill diversity. Let’s discuss when to use Query Style ML and when to use Cloud Notebook.

Query Style ML:

Dealing with simple regression, classification, and clustering based on a cleaned dataset, we can use Query Style ML for initial analysis.

Cloud Notebook:

We can use it any time for any situation such as image classification, NLP, etc.

For my understanding, when I need to create some simple models, I will use both Notebook and Query Style ML. Using a notebook to preprocess data and feature engineering and using Query Style ML to create and store model.

Reference

  1. https://www.kaggle.com/c/titanic
  2. https://github.com/jason-jz-zhu/Webinar_BigQueryML/blob/main/basic-notebook-for-titanic-demo.ipynb
  3. https://cloud.google.com/bigquery-ml/docs/getting-model-metadata

Big Data Analytics with Cloud Notebooks and Query Style ML was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Big Data Analytics with Cloud Notebooks and Query Style ML | by Jiazhen Zhu | Walmart Global Tech Blog | May, 2021 | Medium