Classification algorithm - logistic regression

[also an iterative method, self updating w]

Linear input to classification problem:

Input: the formula of linear regression is used as the input of logical regression

sigmoid function:

Logistic regression formula: [that is, how sigmoid converts input into probability value]

e: 2.71

Z = result of regression

Output: probability value can also be obtained

Loss function and optimization of logistic regression

It is the same as the principle of linear regression, but because it is a classification problem, the loss function is different and can only be solved by gradient descent

[when the target value is a class, the closer the probability is to 1, the smaller the loss function]

[which category has few samples in the second category, and the judgment probability refers to this category]

[the target value is zero, and the closer the probability is to 1, the greater the loss function]

[logistic regression only judges whether a category is or not, and judges the probability of belonging to a category. h (x) all refers to judging a category, so the image is opposite, so the above situation can occur]

[similar to information entropy, the smaller the better]

[it is also to update the weight]

Comparative analysis of loss function:

Syntax:

sklearn.linear_model.LogisticRegression

sklearn.linear_model.LogisticRegression(penalty = 'l2', C = 1.0) [with regularization]

Logistic regression classifier

coef_: regression coefficient

C: Regularization gradient

penalty: regularization term

Application: [only applicable to category II]

- Advertising click through rate
- Judge the user's gender
- Predict whether users will buy a given product category
- Judge whether a comment is positive or negative

[logistic regression is a powerful tool to solve binary classification problems]

Case:

Tumor prediction in benign and malignant breast cancer

Download address of original data: https://archive.ics.uci.edu/ml/machine-learning-databases/

Data Description:

(1) 699 samples, a total of 11 columns of data. The first column uses the retrieved id, the last 9 columns are the medical characteristics related to tumor, and the last column represents the value of tumor type.

(2) Contains 16 missing values, use "?" Mark.

Target value:

Malignant tumor was selected as the target value as a positive example

technological process:

- Online data acquisition tool (pandas)
- Data missing value processing and standardization
- LogisticRegression estimator process

from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error, classification_report from sklearn.externals import joblib import pandas as pd import numpy as np def logistic(): """ Logistic regression was used to make binary classification for cancer prediction (according to the attribute characteristics of cells) :return: NOne """ # Construct column label name column = ['Sample code number','Clump Thickness', 'Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class'] # Read data data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=column) print(data) # Processing missing values data = data.replace(to_replace='?', value=np.nan) data = data.dropna() # Data segmentation x_train, x_test, y_train, y_test = train_test_split(data[column[1:10]], data[column[10]], test_size=0.25) # Standardized treatment std = StandardScaler() x_train = std.fit_transform(x_train) x_test = std.transform(x_test) # Logistic regression prediction lg = LogisticRegression(C=1.0) lg.fit(x_train, y_train) print(lg.coef_) y_predict = lg.predict(x_test) print("Accuracy:", lg.score(x_test, y_test)) print("Recall rate:", classification_report(y_test, y_predict, labels=[2, 4], target_names=["Benign", "malignant"]))#Remember label return None if __name__ == "__main__": logistic()

The results show that:

Result analysis:

When it comes to cancer, we're thinking about recall rates