Logistic Regression Training

Logistic regression is a widely used classification machine learning algorithm, which is used to estimate the probability of an event. Although the name of logistic regression is "regression", it is actually a classification method, mainly used for binary classification problems, that is, there are only two outputs, representing two categories, for example, we want to predict whether a patient will recover, whether customers will buy products and so on.

The data training procedure of the logistic regression method is carried out by the method, and a model can be obtained according to the data characteristics, and then the model is used for prediction.

When creating a logistic regression training task, you need to set the following parameters:

Training Dataset: Required parameter, the dataset to be trained accesses connection info, including data type, connect parameter, dataset name, etc. You can connect HBase data, dsf data, and local data.
Data Query Conditions: Optional parameter, the specified data can be filtered out for corresponding analysis according to the query conditions; attribute conditions and spatial query are supported. E.g. SmID<100 and BBOX(the_geom, 120,30,121,31).
Explanatory Fields: Required parameter, the field of the explanatory variable. Enter one or more explanatory fields of the training dataset as the independent variables of the model, which can help predict the results.
Modeling Field: Required parameter, which is used to train the field of the model, that is, the dependent variable. This field corresponds to a known (trained) value of a variable that will be used to make predictions at unknown locations.
Maximum Iterations: Optional parameter, value range >0,default is 100.
Regularization Parameter: Optional parameter, value range ≥ 0, default value is 0.0. It is mainly used to prevent overfitting.
Selection of Regularization Mode: Optional parameters, mainly used to alleviate the overfitting problem of the model. 0.0 is L2 regularization, 1.0 is L1 regularization, the range of values is [0.0,1.0], and the default value is 0.0.
Distance Explanatory Variable Dataset: Optional parameter, supports point, line and region dataset, calculates the closest distance between the elements of the given dataset and the elements in the training dataset, and automatically creates a list of explanatory variables.
Model Save Directory: Optional parameter, save the model with good training result to this address. If it is empty, the model will not be saved.

After executing the training task, the following result parameter is output:

IRCharacteristics: Attributes of a logistic regression model.
Variable: The field array of the logistic regression model, which refers to the field of the independent variable in the training model.
MSE: Mean square error, the mean of the squared error between the predicted value and the true value.
RMSE: RMSE, the mean of the square root of the error between the predicted value and the true value.
Mae: Mean absolute error, the mean of the absolute value of the error between the predicted value and the true value.
R2: Coefficient of determination. According to the value of R2, the quality of the model can be judged. The value range is [0,1]. Generally speaking, the larger the R2 is, the better the fitting effect of the model is. R2 reflects how accurate it is, because with the increase of the number of samples, R2 will inevitably increase, which can not really quantitatively explain the degree of accuracy, but can only be roughly quantitative.
Explained Variance: Explains the variance.
NumIterations: Actual iterations.