Part 2 Logistic Regression
1 Logistic Regression 的hypotheses函數
在Linear Regression中,如果我們假設待預測的變量y是離散的一些值,那么這就是分類問題。如果y只能取0或1,這就是binary classification的問題。我們仍然可考慮用Regression的方法來解決binary classification的問題。但是此時,由于我們已經知道y /in {0,1},而不是整個實數域R,我們就應該修改hypotheses函數h_/theta(x)的形式,可以使用Logistic Function將任意實數映射到[0,1]的區間內。即

其中我們對所有feature先進行線性組合,即/theta’ * x = /theta_0 * x_0 + /theta_1 * x_1 +/theta_2 * x_2 …, 然后把線性組合后的值代入Logistic Function(又叫sigmoid function)映射成[0,1]內的某個值。Logistic Function的圖像如下

當z->正無窮大時,函數值->1;當z->負無窮大時,函數值->0.因此新的hypotheses函數h_/theta(x)總是在[0,1]這個區間內。我們同樣增加一個feature x_0 = 1以方便向量表示。Logistic Function的導數可以用原函數來表示,即

這個結論在后面學習參數/theta的時候還會使用到。
2 用最大似然估計和梯度上升法學習Logistic Regression的模型參數/theta
給定新的hypotheses函數h_/theta(x),我們如何根據訓練樣本來學習參數/theta呢?我們可以考慮從概率假設的角度使用最大似然估計MLE來fit data(MLE等價于LMS算法中的最小化cost function)。我們假設:

即用hypotheses函數h_/theta(x)來表示y=1的概率; 1-h_/theta(x)來表示y=0的概率.這個概率假設可以寫成如下更緊湊的形式

假設我們觀察到了m個訓練樣本,它們的生成過程獨立同分布,那么我們可以寫出似然函數

取對數后變成log-likelihood

我們現在要最大化log-likelihood求參數/theta. 換一種角度理解,就是此時cost function J = - l(/theta),我們需要最小化cost function 即- l(/theta)。
類似于我們在學習Linear Regression參數時用梯度下降法,這里我們可以采用梯度上升法最大化log-likelihood,假設我們只有一個訓練樣本(x,y),那么可以得到SGA(增量梯度上升)的update rule

里面用到了logistic function的導數的性質 即 g’ = g(1-g).于是我們可以得到參數更新rule

這里是不斷的加上一個量,因為是梯度上升。/alpha是learning rate. 從形式上看和Linear Regression的參數 LMS update rule是一樣的,但是實質是不同的,因此假設的模型函數h_/theta(x)不同。在Linear Regression中只是所有feature的線性組合;在Logistic Regression中是先把所有feature線性組合,然后在帶入Logistic Function映射到區間[0,1]內,即此時h_/theta(x)就不再是一個線性函數。其實這兩種算法都是Generalized Linear Models的特例。
另外也可以考慮用牛頓迭代法來求參數更新的update rule。牛頓迭代法是一種求方程f(/theta) = 0的根的方法,即函數f(/theta)與x軸的交點坐標值。從某個初始/theta開始,按照下式不斷迭代更新/theta,會發現/theta的值越來越逼近真實的方程的根,即更新rule是

這樣我們可以用數值解法求方程的根。更多關于牛頓迭代法的講解可以參考維基百科 http://en.wikipedia.org/wiki/Newton%27s_method,下面這張圖解釋非常形象,來自維基百科。

應用到求Logistic Regression的參數/theta的更新rule中就是,我們要求l(/theta)的一階導數等于0得到的方程的根,即l‘(/theta) = 0。根據牛頓法,需要按照下面的rule來更新參數/theta,

而/theta是n維向量(每一維對應一個feature),針對向量求導,上面的式子就變成了

H是Hessian矩陣,相當于二階偏導數,是一個n*n的矩陣,元素的(i,j)的計算方法如下

后面是l(/theta)對/theta_j的偏導數。
牛頓法通常可以比batch gradient descent方法更快收斂,經過更少次數的迭代就可以接近cost function最小的參數值。但是牛頓法的每一次迭代的計算量更大,因為需要對n*n的Hessian矩陣求逆矩陣。但是只要n不是太大,牛頓法都可以更快的收斂。當我們用牛頓法來最大化Logistic Regression的log likelihood,這種方法也叫Fisher scoring方法。
3 編程實戰
(注:本部分編程習題全部來自Andrew Ng機器學習網上公開課)
3.1 Logistic Regression的Matlab實現
假定我們是大學錄取委員會的成員,給定學生的兩門課程成績和是否錄取的歷史記錄,需要對新的學生是否應該錄取做binary classification。所以每個學生用兩個feature來描述,分別對應兩門課程成績。現在根據訓練樣本trian一個 Logistic Regression(decision boundary)model,然后對新的學生測試樣本做分類。主程序如下:
[plain] view plain copy PRint?%% Initializationclear ; close all; clc%% Load Data% The first two columns contains the exam scores and the third column% contains the label.data = load('ex2data1.txt');X = data(:, [1, 2]); y = data(:, 3);%% ==================== Part 1: Plotting ====================% We start the exercise by first plotting the data to understand the % the problem we are working with.fprintf(['Plotting data with + indicating (y = 1) examples and o ' ... 'indicating (y = 0) examples./n']);plotData(X, y);% Put some labels hold on;% Labels and Legendxlabel('Exam 1 score')ylabel('Exam 2 score')% Specified in plot orderlegend('Admitted', 'Not admitted')hold off;fprintf('/nProgram paused. Press enter to continue./n');pause;%% ============ Part 2: Compute Cost and Gradient ============% In this part of the exercise, you will implement the cost and gradient% for logistic regression. You neeed to complete the code in % costFunction.m% Setup the data matrix appropriately, and add ones for the intercept term[m, n] = size(X);% Add intercept term to x and X_testX = [ones(m, 1) X];% Initialize fitting parametersinitial_theta = zeros(n + 1, 1);% Compute and display initial cost and gradient[cost, grad] = costFunction(initial_theta, X, y);fprintf('Cost at initial theta (zeros): %f/n', cost);fprintf('Gradient at initial theta (zeros): /n');fprintf(' %f /n', grad);fprintf('/nProgram paused. Press enter to continue./n');pause;%% ============= Part 3: Optimizing using fminunc =============% In this exercise, you will use a built-in function (fminunc) to find the% optimal parameters theta.% Set options for fminuncoptions = optimset('GradObj', 'on', 'MaxIter', 400);% Run fminunc to obtain the optimal theta% This function will return theta and the cost [theta, cost] = ... fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);% Print theta to screenfprintf('Cost at theta found by fminunc: %f/n', cost);fprintf('theta: /n');fprintf(' %f /n', theta);% Plot BoundaryplotDecisionBoundary(theta, X, y);% Put some labels hold on;% Labels and Legendxlabel('Exam 1 score')ylabel('Exam 2 score')% Specified in plot orderlegend('Admitted', 'Not admitted')hold off;fprintf('/nProgram paused. Press enter to continue./n');pause;%% ============== Part 4: Predict and Accuracies ==============% After learning the parameters, you'll like to use it to predict the outcomes% on unseen data. In this part, you will use the logistic regression model% to predict the probability that a student with score 45 on exam 1 and % score 85 on exam 2 will be admitted.%% Furthermore, you will compute the training and test set accuracies of % our model.%% Your task is to complete the code in predict.m% Predict probability for a student with score 45 on exam 1 % and score 85 on exam 2 prob = sigmoid([1 45 85] * theta);fprintf(['For a student with scores 45 and 85, we predict an admission ' ... 'probability of %f/n/n'], prob);% Compute accuracy on our training setp = predict(theta, X);fprintf('Train Accuracy: %f/n', mean(double(p == y)) * 100);fprintf('/nProgram paused. Press enter to continue./n');pause;首先可以在feature(x_1,x_2)平面上visualize出訓練數據集, 對正實例和負實例用不同記號表示如下
圖中橫縱坐標對應兩門課程成績,兩類點分別對應錄取的學生和不錄取的學生。 畫圖的code如下
[plain] view plain copy print?
function plotData(X, y) %PLOTDATA Plots the data points X and y into a new figure % PLOTDATA(x,y) plots the data points with + for the positive examples % and o for the negative examples. X is assumed to be a Mx2 matrix. % Create New Figure figure; hold on; % ====================== YOUR CODE HERE ====================== % Instructions: Plot the positive and negative examples on a % 2D plot, using the option ‘k+’ for the positive % examples and ‘ko’ for the negative examples. % find all the indices of postive and negtive training example pos = find(y == 1); neg = find(y == 0); plot(X(pos,1), X(pos,2), ‘k+’, ‘LineWidth’, 2, ‘MarkerSize’, 7); plot(X(neg,1), X(neg,2), ‘ko’, ‘LineWidth’, 2, ‘MarkerSize’, 7, ‘MarkerFaceColor’, ‘y’); % ========================================================================= hold off; end ![]()
function plotData(X, y)%PLOTDATA Plots the data points X and y into a new figure % PLOTDATA(x,y) plots the data points with + for the positive examples% and o for the negative examples. X is assumed to be a Mx2 matrix.% Create New Figurefigure; hold on;% ====================== YOUR CODE HERE ======================% Instructions: Plot the positive and negative examples on a% 2D plot, using the option 'k+' for the positive% examples and 'ko' for the negative examples.% find all the indices of postive and negtive training examplepos = find(y == 1); neg = find(y == 0); plot(X(pos,1), X(pos,2), 'k+', 'LineWidth', 2, 'MarkerSize', 7);plot(X(neg,1), X(neg,2), 'ko', 'LineWidth', 2, 'MarkerSize', 7, 'MarkerFaceColor', 'y');% =========================================================================hold off;end然后可以實現sigmoid函數如下,用于將feature的線性組合或者非線性組合映射到[0.1]之間
[plain] view plain copy print?
function g = sigmoid(z) %SIGMOID Compute sigmoid functoon % J = SIGMOID(z) computes the sigmoid of z. % You need to return the following variables correctly g = zeros(size(z)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the sigmoid of each value of z (z can be a matrix, % vector or scalar). g = 1.0 ./ (1.0 + exp(-z)); % ============================================================= end ![]()
function g = sigmoid(z)%SIGMOID Compute sigmoid functoon% J = SIGMOID(z) computes the sigmoid of z.% You need to return the following variables correctly g = zeros(size(z));% ====================== YOUR CODE HERE ======================% Instructions: Compute the sigmoid of each value of z (z can be a matrix,% vector or scalar).g = 1.0 ./ (1.0 + exp(-z));% =============================================================endsigmoid函數中z與g(z)是正相關的,z越大,g(z)越接近1,反之g(z)越接近0.下面我們可以實現Logistic Regression的代價函數(cost function) 和梯度函數(gradient),分別如下


cost function可以認為是生成訓練數據的negative log likelihood, 最小化cost function等價于最大似然估計MLE。下面梯度函數的推導過程詳見本文第二部分。這里用的是batch gradient descent,即每更新一個theta_j都需要掃描所有m個訓練樣本。代碼實現如下
[plain] view plain copy print?
function [J, grad] = costFunction(theta, X, y) %COSTFUNCTION Compute cost and gradient for logistic regression % J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the % parameter for logistic regression and the gradient of the cost % w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta % % Note: grad should have the same dimensions as theta % define cost function and gradient % use fminunc function to solve the minimul value hx = sigmoid(X * theta); J = (1.0/m) * sum(-y .* log(hx) - (1.0 - y) .* log(1.0 - hx)); grad = (1.0/m) .* X’ * (hx - y); % ============================================================= end ![]()
function [J, grad] = costFunction(theta, X, y)%COSTFUNCTION Compute cost and gradient for logistic regression% J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the% parameter for logistic regression and the gradient of the cost% w.r.t. to the parameters.% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly % ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta.% You should set J to the cost.% Compute the partial derivatives and set grad to the partial% derivatives of the cost w.r.t. each parameter in theta%% Note: grad should have the same dimensions as theta% define cost function and gradient % use fminunc function to solve the minimul value hx = sigmoid(X * theta); J = (1.0/m) * sum(-y .* log(hx) - (1.0 - y) .* log(1.0 - hx)); grad = (1.0/m) .* X' * (hx - y);% =============================================================end可以求出初始情況下(initial_theta = zeros(n + 1, 1))的cost function 和gradient。[plain] view plain copy print?
Cost at initial theta (zeros): 0.693147 Gradient at initial theta (zeros): -0.100000 -12.009217 -11.262842 Program paused. Press enter to continue. ![]()
Cost at initial theta (zeros): 0.693147Gradient at initial theta (zeros): -0.100000 -12.009217 -11.262842 Program paused. Press enter to continue.然后用Matlab內置的fminunc函數來求解cost function的最小值和對應的參數值/theta.這是一個無約束的優化問題,可以用fminunc函數求解,不必自己實現gradient descent(自己實現也很容易但是用這個函數更方便)。查閱下doc,其支持如下的輸入輸出[x,fval] = fminunc(fun,x0,options)
input 中 fun是定義待優化的cost func和gradient(可選)的函數,x0是自變量初始值,options是選項設置,在Logistic Regression的實現中相關語句是
[plain] view plain copy print?
% Set options for fminunc options = optimset(‘GradObj’, ‘on’, ‘MaxIter’, 400); % Run fminunc to obtain the optimal theta % This function will return theta and the cost [theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options); ![]()
% Set options for fminuncoptions = optimset('GradObj', 'on', 'MaxIter', 400);% Run fminunc to obtain the optimal theta% This function will return theta and the cost [theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);optimset提供了兩個參數選項分別是是否提供gradient和迭代次數。下面調用fminunc是第一個參數@(t)是一個匿名函數的函數句柄,后面costFunction(t, X, y)是函數定義,具體實現在對應的.m函數文件中。
output中是最優參數值x以及最優cost function的值。程序輸出如下
[plain] view plain copy print?
Local minimum possible. fminunc stopped because the final change in function value relative to its initial value is less than the default value of the function tolerance. <stopping criteria details> Cost at theta found by fminunc: 0.203506 theta: -24.932905 0.204407 0.199617 ![]()
Local minimum possible.fminunc stopped because the final change in function value relative to its initial value is less than the default value of the function tolerance.<stopping criteria details>Cost at theta found by fminunc: 0.203506theta: -24.932905 0.204407 0.199617 這樣就找到了最優的參數theta值和對應的cost function值。將最優參數值對應的decision boundary畫出來就是
下面可以對新的學生測試樣本做預測。并且,可以利用decision對所有訓練樣本中的學生的錄取結果做預測,然后對比實際錄取情況計算train accuracy 。給定一個學生的兩門課程成績的feature,我們可以得到其被錄取的概率即g(/theta’ * x),然后大于等于0.5的認為應該錄取,否則不錄取。實現如下
[plain] view plain copy print?
function p = predict(theta, X) %PREDICT Predict whether the label is 0 or 1 using learned logistic %regression parameters theta % p = PREDICT(theta, X) computes the predictions for X using a % threshold at 0.5 (i.e., if sigmoid(theta’*x) >= 0.5, predict 1) m = size(X, 1); % Number of training examples % You need to return the following variables correctly p = zeros(m, 1); % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned logistic regression parameters. % You should set p to a vector of 0’s and 1’s % p = sigmoid(X* theta); index_1 = find(p >= 0.5); index_0 = find(p < 0.5); p(index_1) = ones(size(index_1)); p(index_0) = zeros(size(index_0)); % ========================================================================= end ![]()
function p = predict(theta, X)%PREDICT Predict whether the label is 0 or 1 using learned logistic %regression parameters theta% p = PREDICT(theta, X) computes the predictions for X using a % threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)m = size(X, 1); % Number of training examples% You need to return the following variables correctlyp = zeros(m, 1);% ====================== YOUR CODE HERE ======================% Instructions: Complete the following code to make predictions using% your learned logistic regression parameters. % You should set p to a vector of 0's and 1's%p = sigmoid(X* theta);index_1 = find(p >= 0.5);index_0 = find(p < 0.5);p(index_1) = ones(size(index_1));p(index_0) = zeros(size(index_0));% =========================================================================end程序輸出如下[plain] view plain copy print?
For a student with scores 45 and 85, we predict an admission probability of 0.774322 Train Accuracy: 89.000000 ![]()
For a student with scores 45 and 85, we predict an admission probability of 0.774322Train Accuracy: 89.000000可以知道對于取得45和85的學生錄取的概率是0.774322,train出的model對89%的訓練樣本的預測結果是正確的。3.2 Regularized Logistic Regression的Matlab實現
現在我們換一個數據集,這個數據集的特征是訓練樣本不再線性可分。比如某公司生產的芯片需要經過兩次質量檢查,然后質檢員根據兩次質量檢查的結果決定是否通過質檢。給定一些歷史的質檢結果和是否通過質檢的決策結果,需要我們對新的測試樣本是否可以通過質檢做預測。這個問題中,芯片也是用兩個feature來描述,即兩次檢查的結果。我們同樣可以基于訓練數據train 一個Logistic Regression model,然后根據model對測試樣本做預測。主程序如下
[plain] view plain copy print?
%% Initialization clear ; close all; clc %% Load Data % The first two columns contains the X values and the third column % contains the label (y). data = load(‘ex2data2.txt’); X = data(:, [1, 2]); y = data(:, 3); plotData(X, y); % Put some labels hold on; % Labels and Legend xlabel(‘Microchip Test 1’) ylabel(‘Microchip Test 2’) % Specified in plot order legend(‘y = 1’, ‘y = 0’) hold off; %% =========== Part 1: Regularized Logistic Regression ============ % In this part, you are given a dataset with data points that are not % linearly separable. However, you would still like to use logistic % regression to classify the data points. % % To do so, you introduce more features to use – in particular, you add % polynomial features to our data matrix (similar to polynomial % regression). % % Add Polynomial Features % Note that mapFeature also adds a column of ones for us, so the intercept % term is handled X = mapFeature(X(:,1), X(:,2)); % Initialize fitting parameters initial_theta = zeros(size(X, 2), 1); % Set regularization parameter lambda to 1 lambda = 1; % Compute and display initial cost and gradient for regularized logistic % regression [cost, grad] = costFunctionReg(initial_theta, X, y, lambda); fprintf(‘Cost at initial theta (zeros): %f/n’, cost); fprintf(‘/nProgram paused. Press enter to continue./n’); pause; %% ============= Part 2: Regularization and Accuracies ============= % Optional Exercise: % In this part, you will get to try different values of lambda and % see how regularization affects the decision coundart % % Try the following values of lambda (0, 1, 10, 100). % % How does the decision boundary change when you vary lambda? How does % the training set accuracy vary? % % Initialize fitting parameters initial_theta = zeros(size(X, 2), 1); % Set regularization parameter lambda to 1 (you should vary this) lambda = 100; % Set Options options = optimset(‘GradObj’, ‘on’, ‘MaxIter’, 400); % Optimize [theta, J, exit_flag] = … fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options); % Plot Boundary plotDecisionBoundary(theta, X, y); hold on; title(sprintf(‘lambda = %g’, lambda)) % Labels and Legend xlabel(‘Microchip Test 1’) ylabel(‘Microchip Test 2’) legend(‘y = 1’, ‘y = 0’, ‘Decision boundary’) hold off; % Compute accuracy on our training set p = predict(theta, X); fprintf(‘Train Accuracy: %f/n’, mean(double(p == y)) * 100); ![]()
%% Initializationclear ; close all; clc%% Load Data% The first two columns contains the X values and the third column% contains the label (y).data = load('ex2data2.txt');X = data(:, [1, 2]); y = data(:, 3);plotData(X, y);% Put some labels hold on;% Labels and Legendxlabel('Microchip Test 1')ylabel('Microchip Test 2')% Specified in plot orderlegend('y = 1', 'y = 0')hold off;%% =========== Part 1: Regularized Logistic Regression ============% In this part, you are given a dataset with data points that are not% linearly separable. However, you would still like to use logistic % regression to classify the data points. %% To do so, you introduce more features to use -- in particular, you add% polynomial features to our data matrix (similar to polynomial% regression).%% Add Polynomial Features% Note that mapFeature also adds a column of ones for us, so the intercept% term is handledX = mapFeature(X(:,1), X(:,2));% Initialize fitting parametersinitial_theta = zeros(size(X, 2), 1);% Set regularization parameter lambda to 1lambda = 1;% Compute and display initial cost and gradient for regularized logistic% regression[cost, grad] = costFunctionReg(initial_theta, X, y, lambda);fprintf('Cost at initial theta (zeros): %f/n', cost);fprintf('/nProgram paused. Press enter to continue./n');pause;%% ============= Part 2: Regularization and Accuracies =============% Optional Exercise:% In this part, you will get to try different values of lambda and % see how regularization affects the decision coundart%% Try the following values of lambda (0, 1, 10, 100).%% How does the decision boundary change when you vary lambda? How does% the training set accuracy vary?%% Initialize fitting parametersinitial_theta = zeros(size(X, 2), 1);% Set regularization parameter lambda to 1 (you should vary this)lambda = 100;% Set Optionsoptions = optimset('GradObj', 'on', 'MaxIter', 400);% Optimize[theta, J, exit_flag] = ... fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);% Plot BoundaryplotDecisionBoundary(theta, X, y);hold on;title(sprintf('lambda = %g', lambda))% Labels and Legendxlabel('Microchip Test 1')ylabel('Microchip Test 2')legend('y = 1', 'y = 0', 'Decision boundary')hold off;% Compute accuracy on our training setp = predict(theta, X);fprintf('Train Accuracy: %f/n', mean(double(p == y)) * 100);首先同樣是visualize 數據如下
可以看出此時兩類樣本不再線性可分。即找不到一條直線很好的區分正樣本和負樣本,如果直接應用Logistic Regression,效果肯定會不好,得出的cost function最優值會比較大。此時,就需要增加feature dimension, 參數/theta的dimension也同樣增加,更多的參數可以描述更豐富的樣本信息。首先做feature mapping,比如基于x1和x2生成階為6以下的所有多項式組合,一共有28項(1+2+3+ … + 7 = 28)如下

實現如下
[plain] view plain copy print?
function out = mapFeature(X1, X2) % MAPFEATURE Feature mapping function to polynomial features % % MAPFEATURE(X1, X2) maps the two input features % to quadratic features used in the regularization exercise. % % Returns a new feature array with more features, comprising of % X1, X2, X1.^2, X2.^2, X1*X2, X1*X2.^2, etc.. % % Inputs X1, X2 must be the same size % degree = 6; out = ones(size(X1(:,1))); for i = 1:degree for j = 0:i out(:, end+1) = (X1.^(i-j)).*(X2.^j); end end end ![]()
function out = mapFeature(X1, X2)% MAPFEATURE Feature mapping function to polynomial features%% MAPFEATURE(X1, X2) maps the two input features% to quadratic features used in the regularization exercise.%% Returns a new feature array with more features, comprising of % X1, X2, X1.^2, X2.^2, X1*X2, X1*X2.^2, etc..%% Inputs X1, X2 must be the same size%degree = 6;out = ones(size(X1(:,1)));for i = 1:degree for j = 0:i out(:, end+1) = (X1.^(i-j)).*(X2.^j); endendend這樣通過feature mapping,原始2維feature被映射到的新的28維feature空間,相應的/theta也變成了29維(包括theta_0).在這樣更高維的feature vector上面訓練出來的Logistic Regression Classifier可以有更復雜的decision boundary,不再是一條直線。然而,更多的參數更復雜的decision boundary也容易造成model over-fitting,即過擬合,造成model的泛化能力差。因此我們需要在cost function中引入 regularization term正則項來解決over fitting的問題。帶regularization term的Logistic Regression的cost function 定義如下

我們在原始cost function后面多加了一項,即參數/theta_j的平方和除以2m,參數lambda控制regularization term的權重。lambda越大,regularization term權重越大,越容易under fit;反之,regularization term權重越小,越容易over fit.但是我們不應該對參數/theta_0進行regularization。cost function的梯度是一個n+1維向量(含j_0),每一維的計算公式是

可以發現僅僅是對j>=1的情況后面多加了一項regularization term對/theta_j求導的結果。因此帶regularization的Logistic Regression的cost function和gradient 實現如下
[plain] view plain copy print?
function [J, grad] = costFunctionReg(theta, X, y, lambda) %COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization % J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using % theta as the parameter for regularized logistic regression and the % gradient of the cost w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta hx = sigmoid(X * theta); J = (1.0/m) * sum(-y .* log(hx) - (1.0 - y) .* log(1.0 - hx)) + lambda / (2 * m) * norm(theta([2:end]))^2; reg = (lambda/m) .* theta; reg(1) = 0; grad = (1.0/m) .* X’ * (hx - y) + reg; % ============================================================= end ![]()
function [J, grad] = costFunctionReg(theta, X, y, lambda)%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization% J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using% theta as the parameter for regularized logistic regression and the% gradient of the cost w.r.t. to the parameters. % Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;grad = zeros(size(theta));% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta.% You should set J to the cost.% Compute the partial derivatives and set grad to the partial% derivatives of the cost w.r.t. each parameter in theta hx = sigmoid(X * theta); J = (1.0/m) * sum(-y .* log(hx) - (1.0 - y) .* log(1.0 - hx)) + lambda / (2 * m) * norm(theta([2:end]))^2; reg = (lambda/m) .* theta; reg(1) = 0; grad = (1.0/m) .* X' * (hx - y) + reg;% =============================================================end其中norm是對向量theta的后面n維求模,用于計算cost function 中的regularization term。cost function的minimize求解仍然使用fminunc函數,與3.1部分一樣。當lambda = 1時,得到的decision boundary如下
Train Accuracy是 83.050847。現在我們把lambda調小,比如設為0.0001,也就是減小regularization term的權重,就會發現分類器幾乎可以把所有training data分類正確,但是得到一條很復雜的decision boundary,因此overfitting

Train Accuracy是86.440678但是模型的泛化能力變差。比如對于x =(0.25,1.5)的芯片會被預測為通過,這顯然和traning數據表現出的特征不符合。相反,如果lambda太大,比如100,那么regularization term權重過大,model容易under fit,如下圖所示,Train Accuracy只有61.016949。

因此選取合適的regularization term的weight對于得到既fit data又有良好泛化能力的model是很重要的。
新聞熱點
疑難解答