The 7 Steps of Machine Learning - 吳俊逸的數位歷程檔

個人資訊

吳俊逸

訪客 (634587), 推薦 (3)

文章 (377)

回應 (1)

文章分類

Trademark (11)

Industry 4.0 (32)

NLP (10)

AI (46)

Patent (110)

Course (20)

Supply chain (44)

uncategorized (38)

最新文章

愛爾蘭半導體產業 (12-24)

AI & ML & Data analysis Questions (08-25)

什麼是半導體？ (07-25)

什麼是O Visa? 如何申請？ (05-28)

AI Topics (11-16)

常用連結

位置: 吳俊逸 > AI

The 7 Steps of Machine Learning

by 吳俊逸

2018-06-28 09:36:32, 回應(0), 人氣(2153)

https://www.kdnuggets.com/2018/05/general-approaches-machine-learning-process.html

I actually came across Guo's article by way of first watching " target="_blank" rel="noopener">a video of his on YouTube, which came recommended after an afternoon of going down the Google I/O 2018 video playlist rabbit hole. The post is the same content as the video, and so if interested one of the two resources will suffice.

Image source

Guo laid out the steps as follows (with a little ad-libbing on my part):

1 - Data Collection

The quantity & quality of your data dictate how accurate our model is
The outcome of this step is generally a representation of data (Guo simplifies to specifying a table) which we will use for training
Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this step

2 - Data Preparation

Wrangle data and prepare it for training
Clean that which may require it (remove duplicates, correct errors, deal with missing values, normalization, data type conversions, etc.)
Randomize data, which erases the effects of the particular order in which we collected and/or otherwise prepared our data
Visualize data to help detect relevant relationships between variables or class imbalances (bias alert!), or perform other exploratory analysis
Split into training and evaluation sets

3 - Choose a Model

Different algorithms are for different tasks; choose the right one

4 - Train the Model

The goal of training is to answer a question or make a prediction correctly as often as possible
Linear regression example: algorithm would need to learn values for m (or W) and b (x is input, y is output)
Each iteration of process is a training step

5 - Evaluate the Model

Uses some metric or combination of metrics to "measure" objective performance of model
Test the model against previously unseen data
This unseen data is meant to be somewhat representative of model performance in the real world, but still helps tune the model (as opposed to test data, which does not)
Good train/eval split? 80/20, 70/30, or similar, depending on domain, data availability, dataset particulars, etc.

6 - Parameter Tuning

This step refers to hyperparameter tuning, which is an "artform" as opposed to a science
Tune model parameters for improved performance
Simple model hyperparameters may include: number of training steps, learning rate, initialization values and distribution, etc.

7 - Make Predictions

Using further (test set) data which have, until this point, been withheld from the model (and for which class labels are known), are used to test the model; a better approximation of how the model will perform in the real world

Universal Workflow of Machine Learning

In section 4.5 of his book, Chollet outlines a universal workflow of machine learning, which he describes as a blueprint for solving machine learning problems.

The blueprint ties together the concepts we've learned about in this chapter: problem definition, evaluation, feature engineering, and fighting overfitting.

How does this compare with Guo's above framework? Let's have a look at the 7 steps of Chollet's treatment (keeping in mind that, while not explicitly stated as being specifically tailored for them, his blueprint is written for a book on neural networks):

Defining the problem and assembling a dataset
Choosing a measure of success
Deciding on an evaluation protocol
Preparing your data
Developing a model that does better than a baseline
Scaling up: developing a model that overfits
Regularizing your model and tuning your parameters

Source: Andrew Ng's Machine Learning class at Stanford

Chollet's workflow is higher level, and focuses more on getting your model from good to great, as opposed to Guo's, which seems more concerned with going from zero to good. While it does not necessarily jettison any other important steps in order to do so, the blueprint places more emphasis on hyperparameter tuning and regularization in its pursuit of greatness. A simplification here seems to be:

good model → "too good" model → scaled back, "generalizable" model

Drafting A Simplified Framework

We can reasonably conclude that Guo's framework outlines a "beginner" approach to the machine learning process, more explicitly defining early steps, while Chollet's is a more advanced approach, emphasizing both the explicit decisions regarding model evaluation and the tweaking of machine learning models. Both approaches are equally valid, and do not prescribe anything fundamentally different from one another; you could superimpose Chollet's on top of Guo's and find that, while the 7 steps of the 2 models would not line up, they would end up covering the same tasks in sum.

Mapping Chollet's to Guo's, here is where I see the steps lining up (Guo's are numbered, while Chollet's are listed underneath the corresponding Guo step with their Chollet workflow step number in parenthesis):

Data collection
→ Defining the problem and assembling a dataset (1)
Data preparation
→ Preparing your data (4)
Choose model
Train model
→ Developing a model that does better than a baseline (5)
Evaluate model
→ Choosing a measure of success (2)
→ Deciding on an evaluation protocol (3)
Parameter tuning
→ Scaling up: developing a model that overfits (6)
→ Regularizing your model and tuning your parameters (7)
Predict

It's not perfect, but I stand by it.

In my view, this presents something important: both frameworks agree, and together place emphasis, on particular points of the framework. It should be clear that model evaluation and parameter tuning are important aspects of machine learning. Addition agreed-upon areas of importance are the assembly/preparation of data and original model selection/training.

Let's use the above to put together a simplified framework to machine learning, the 5 main areas of the machine learning process:

1 - Data collection and preparation: everything from choosing where to get the data, up to the point it is clean and ready for feature selection/engineering

2 - Feature selection and feature engineering: this includes all changes to the data from once it has been cleaned up to when it is ingested into the machine learning model

3 - Choosing the machine learning algorithm and training our first model: getting a "better than baseline" result upon which we can (hopefully) improve

4 - Evaluating our model: this includes the selection of the measure as well as the actual evaluation; seemingly a smaller step than others, but important to our end result

5 - Model tweaking, regularization, and hyperparameter tuning: this is where we iteratively go from a "good enough" model to our best effort

So, which framework should you use? Are there really any important differences? Do those presented by Guo and Chollet offer anything that was previously lacking? Does this simplified framework provide any real benefit? As long as the bases are covered, and the tasks which explicitly exist in the overlap of the frameworks are tended to, the outcome of following either of the two models would equal that of the other. Your vantage point or level of experience may exhibit a preference for one.

As you may have guessed, this has really been less about deciding on or contrasting specific frameworks than it has been an investigation of what a reasonable machine learning process should look like.