This notebook uses supervised learning to fit and select the best model for prediction using the Heart Failure Prediction Dataset. As heart disease is the leading cause of death globally, we believed it would interest to examine this dataset with supervised learning methods.
The data was obtained via Kaggle and sourced from five different clinical data sets. It consists of 11 clinical features for predicting heart disease. These clinical features are:
This notebook seeks to fit models that predict for HeartDisease
. As this attribute is categorical, we proceeded with classification methods.
Supervised learning a type of machine learning model were a variable or variables is to represent a response. The goal of supervised learning is to make inference or make predictions . Algorithms are used to fit and select models for classification or prediction. In our case, we are using classification.
One well known method of supervised learning is the generalized linear model. As our response data is binary, we specifically used logistic regression. Other methods include Tree based methods such as classification trees, random forest, or boosting. Tree models use recursive splitting of the predictor region. Classification trees are used for predicting group membership. This works well with our binary data. Random forest and boosting are tree based methods which averaged overall trees using bootstrap aggregation.
As we seek to predict the class of an observation as either 0 - Normal or 1 - Heart Disease, it is best practice to not predict on data that was used to fit the model. The reason for this is that novel data is necessary to accurately test your predictions. If a model was fit using the full data set, there aren't any novel observations. This leads to overfitting and inflated model accuracy.
The solution to the problem of overfitting is to split the full data set into training and testing data sets. Models will be fit and selected based on training data. Fitted models will be tested and predictions drawn from the test data set. Typically, a 80% train 20% test data split is used. We selected a 75% train and 25% test split to allow for adequate sample size.
We begin by reading in the heart.csv
file into a pandas-on-Spark DataFrame. We chose to start with pandas-on-spark as it allows for easier manipulations for exploratory data analysis.
But first let's import the required libraries and set the notebook environment.
In the Later part, we will need spark session to work with RDD, Dataframe. So we are creating new spark session as well.
# Import packages
import os
import sys
import warnings
import pandas as pd
import numpy as np
import pyspark.pandas as ps
import matplotlib.pyplot as plt
import seaborn as sns
# Set environment
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
# Remove warnings from rendered output
warnings.filterwarnings("ignore")
# Set figure size
plt.rcParams["figure.figsize"] = (10,7)
# Spark Session builder
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.debug.maxToStringFields", "100").getOrCreate()
Here we are reading in the data as a pandas-on-Spark DataFrame.
# Read in data as pandas-on-Spark data frame
psdf_heart = ps.read_csv("heart.csv")
# Checking if import was successful
psdf_heart.head()
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
Using psdf_heart.info()
we determined learned the data type of the pandas-on-Spark variables. This allows us to determine if we need to transform variables to match their true data type. For example categorical data should be classified as such.
psdf_heart.info()
<class 'pyspark.pandas.frame.DataFrame'> Int64Index: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int32 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int32 4 Cholesterol 918 non-null int32 5 FastingBS 918 non-null int32 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int32 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int32 dtypes: float64(1), int32(6), object(5)
It looks like the categorical variables have all been stored as objects. Later, we will use transformations get these variables in our desired format.
Another potential issue would be the presence of null variables. Any null values should be processed to prevent errors.
# Check for null values
print(psdf_heart.isnull().sum())
Age 0 Sex 0 ChestPainType 0 RestingBP 0 Cholesterol 0 FastingBS 0 RestingECG 0 MaxHR 0 ExerciseAngina 0 Oldpeak 0 ST_Slope 0 HeartDisease 0 dtype: int64
As there are no null values, we elected to proceed with data splitting. We begin by converting our pandas-on-Spark DataFrame to a Pyspark SQL DataFrame.
# Make a Pyspark SQL dataframe from pandas-on-Spark df
sql_heart = psdf_heart.to_spark()
To the split the data into train and test data sets, we used randomSplit
with a split of 75/25 training to test. This was selected as a safe ratio allow for enough samples in both the training data and the test data.
train, test = sql_heart.randomSplit([0.75, 0.25], seed = 1234)
print(train.count(), test.count())
686 232
For some of the plotting, we need matplotlib. This requires a pandas DataFrame.
# Make a pandas DF from PySpark SQL df
pd_train = train.toPandas()
Other graphs and numeric summaries lent themselves better to using a pandas-on-Spark data set best fit our needs.
# Make a pandas-on-Spark DF from PySpark SQL df
psdf_train = train.to_pandas_on_spark()
Exploring the training data allows us to better understand the relationship of the predictor and the response variables. It also alerts us to any unusual values.
We generated 8 number summaries for the Age
, RestingBP
, Cholesterol
, MaxHR
, Oldpeak
.
psdf_train[['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']].describe()
Age | RestingBP | Cholesterol | MaxHR | Oldpeak | |
---|---|---|---|---|---|
count | 686.000000 | 686.000000 | 686.000000 | 686.000000 | 686.000000 |
mean | 53.690962 | 132.091837 | 196.749271 | 136.677843 | 0.907289 |
std | 9.282468 | 18.546025 | 111.301411 | 25.526051 | 1.074837 |
min | 28.000000 | 0.000000 | 0.000000 | 60.000000 | -2.000000 |
25% | 48.000000 | 120.000000 | 170.000000 | 120.000000 | 0.000000 |
50% | 54.000000 | 130.000000 | 222.000000 | 138.000000 | 0.600000 |
75% | 60.000000 | 140.000000 | 265.000000 | 156.000000 | 1.500000 |
max | 76.000000 | 200.000000 | 603.000000 | 202.000000 | 6.200000 |
From the summary table above, it is evident that there are potential outliers for Cholesterol
and RestingBP
. Cholesterol
has a maximum of 603 which is more than two standard deviations above the mean. This observation is suspect, but more information of that particular observation would be needed before removing it. The minimum resting blood pressure and cholesterol are 0. Cholesterol
and RestingBP
of 0 is cause for concern. It could be that these values were not recorded or that these values are true measurements. We do not know for certain. Our concerns is that these observations might be over represented in either the test or training data and lead inaccurate model predictions or erroneously be selected as a good predictor for heart disease.
To assess the proportion of observations with either Cholesterol
of 0 or RestingBP
we obtained counts by filtering the SPARK SQL
dataset accordingly.
First, let us look at the occurrences of a RestingBP
of 0. The value below is the count from the training data set.
# Display counts for RestingBP of 0
print(train.filter(train.RestingBP == 0).count())
1
There is only a single observation with a RestingBP
of 0. Because we cannot determine the reason for this value, we will continue to use this observation.
Now, let's examine counts for observations where Cholesterol
is zero. The first value printed is the number of observations in the training and the second number is ratio of observations with heart disease and Cholesterol
of 0 out of the total number of observations with heart disease in the training data.
# Display counts for Cholesterol of 0 and respective ratios for training and test data
print(train.filter(train.Cholesterol == 0).count(),
train.filter(train.Cholesterol == 0)
.filter(train.HeartDisease == 1).count()/
train.filter(train.HeartDisease == 1).count())
135 0.3151041666666667
We calculated the proportion of observations with heart disease and cholesterol measurement of 0. This proportion is approximately 0.315.
There is a relatively weak correlation -0.246 between heart disease and cholesterol. However, given the general understanding of the relationship between heart disease, we decided to proceed with outlier observations and make note of this for further work.
psdf_train.corr().style.background_gradient(cmap='coolwarm').set_precision(3)
Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | HeartDisease | |
---|---|---|---|---|---|---|---|
Age | 1.000 | 0.263 | -0.114 | 0.217 | -0.418 | 0.240 | 0.298 |
RestingBP | 0.263 | 1.000 | 0.093 | 0.044 | -0.113 | 0.151 | 0.103 |
Cholesterol | -0.114 | 0.093 | 1.000 | -0.275 | 0.250 | 0.057 | -0.246 |
FastingBS | 0.217 | 0.044 | -0.275 | 1.000 | -0.169 | 0.049 | 0.277 |
MaxHR | -0.418 | -0.113 | 0.250 | -0.169 | 1.000 | -0.185 | -0.430 |
Oldpeak | 0.240 | 0.151 | 0.057 | 0.049 | -0.185 | 1.000 | 0.406 |
HeartDisease | 0.298 | 0.103 | -0.246 | 0.277 | -0.430 | 0.406 | 1.000 |
The above table shows that no numeric variable has a correlation more extreme than -0.430 which is the correlation between heart disease and max heart rate. Oldpeak has a correlation of 0.406. These finding suggest that while there is some weak correlations to no correlation among the predictors. This is a good thing. However, the predictors are also only just weakly correlated with heart disease.
Below is the graphical representation of correlation between HeartDisease and Numerical__ variables.
cat_data = pd_train[['HeartDisease', 'Sex', 'ChestPainType', 'FastingBS','RestingECG', 'ExerciseAngina', 'ST_Slope']]
Let's see below correlation in graphical format for Numeric variables 'Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak' with Heart Disease.
num_data = [col for col in pd_train.columns if col not in cat_data]
num_data
['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
sns.pairplot(pd_train[num_data+['HeartDisease']],
plot_kws={'alpha': 0.5});
Now we will find aggregate means for each of the numeric variables based on Heart Disease.
psdf_summ_group = psdf_train.groupby('HeartDisease').agg(
{'Age': ['mean', 'std'], 'RestingBP': ['mean', 'std'], 'Cholesterol': ['mean', 'std'],
'MaxHR': ['mean', 'std'], 'Oldpeak': ['mean', 'std']})
psdf_summ_group
Age | RestingBP | Cholesterol | MaxHR | Oldpeak | ||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | |
HeartDisease | ||||||||||
1 | 56.140625 | 8.400299 | 133.783854 | 20.245559 | 172.460938 | 128.289952 | 126.945312 | 23.208998 | 1.294271 | 1.161353 |
0 | 50.576159 | 9.429181 | 129.940397 | 15.903251 | 227.632450 | 74.434305 | 149.052980 | 22.867752 | 0.415232 | 0.692076 |
There are visible difference in the mean and standard deviations over HeartDisease
. With the exception o Cholesterol
and MaxHR
all means and standard deviations are greater in observations flagged as having heart disease. With respect to Cholesterol, the difference in mean might be linked to previously mentioned the extreme maximum and minimum values.
Let's repeat above task to get mean and std for the given numeric variables based on sex.
psdf_summ_sex = psdf_train.groupby('Sex').agg(
{'Age': ['mean', 'std'], 'RestingBP': ['mean', 'std'], 'Cholesterol': ['mean', 'std'],
'MaxHR': ['mean', 'std'], 'Oldpeak': ['mean', 'std']})
psdf_summ_sex
Age | RestingBP | Cholesterol | MaxHR | Oldpeak | ||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | |
Sex | ||||||||||
F | 52.514085 | 9.956455 | 131.873239 | 19.182904 | 238.049296 | 92.569570 | 145.556338 | 22.463374 | 0.687324 | 0.978119 |
M | 53.998162 | 9.082900 | 132.148897 | 18.393837 | 185.968750 | 113.313598 | 134.360294 | 25.786608 | 0.964706 | 1.092248 |
When comparing means between male and female observations, we noted that there are not strong differences between sexes. However, the female Cholesterol
mean is approximately 238 and the male mean is ~186. It is possible that this difference is due to the fact that women tend to have higher levels of HDL cholesterol than men.
Examining the center and spread of our numeric variables allowed us to identify unique points in RestingBP
and Cholesterol
. To visualize the shape and spread of these variables' distributions, when created histograms for each of the numeric variables.
First, let us look at the distribution of Age
.
pd_train.Age.hist(bins = 10)
plt.xlabel("Age")
plt.title("Histogram of Age of Participants")
plt.show()
The histogram for Age
shows a bell shaped left-skewed distribution. This is likely due to the fact that older individuals the ones most commonly assessed for heart disease.
Next, we have a histogram of RestingBP
.
pd_train.RestingBP.hist(bins = 20)
plt.xlabel("RestingBP")
plt.title("Histogram of Participant Resting Blood Pressure")
plt.show()
Systolic blood pressure is typically greater than 100, which is what we see from this plot. There is one observation at zero which matches our previous finding.
Histogram for Cholesterol
.
pd_train.Cholesterol.hist(bins = 20)
plt.xlabel("Cholesterol")
plt.title("Histogram of Participant Cholesterol Measurement")
plt.show()
Once again we see that many observations have Cholesteral
of zero. At the upper end of the scale, you can see several values beyond 450. Most other observations are between 100 and 400. This matches our previous statistics.
Next, we have a histogram for MaxHR
.
pd_train.MaxHR.hist(bins = 20)
plt.xlabel("MaxHR")
plt.title("Histogram of Participant Maximum Heart Rate")
plt.show()
Exception ignored in: <function JavaWrapper.__del__ at 0x0000028BD71B7D30> Traceback (most recent call last): File "C:\Users\kar_d\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 39, in __del__ if SparkContext._active_spark_context and self._java_obj is not None: AttributeError: 'OneHotEncoder' object has no attribute '_java_obj'
MaxHR
has a large spread. This histogram matches with the fact that MaxHR
has a mean and standard deviation of ~25.
Our last histogram is for Oldpeak
.
pd_train.Oldpeak.hist(bins = 50)
plt.xlabel("Oldpeak")
plt.title("Histogram of Participant Oldpeak")
plt.show()
Oldpeak
has a rather unusual histogram with numerous observations at zero and no clear shape. The distribution almost looks as if this variable is more discrete than continuous. This histogram matches with the fact that Oldspeak
values are generally small.
By now we have a pretty good sense of shape, spread, and center of the numeric predictor variables. Let's now examine how these components change when grouped by HeartDisease
. To do this we will look at boxplots for each numeric variable plotted over HeartDisease
.
pd_train.boxplot(column = ['Age'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()
As shown by the 8 number summaries and aggregate mean and standard deviation tables, individual's with heart disease tend to be older.
pd_train.boxplot(column = ['RestingBP'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()
A normal BP level is less than 120/80 mmHg..As we observed above, patients having heart disease has minimum value of RestingBP as 0 which is an outlier and data seems to be littlebit skewed. Patients having heart disease tends to have more RestingBP than non having heart disease persons.
pd_train.boxplot(column = ['Cholesterol'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()
The above boxplot displays Cholesterol effect on heart disease carrier and non-carriers. There are many outliers. Cholesterol contains many zero readings for patients having heart disease. so we observe here negative skewness or we can say bottom skewed data for having heart disease. we may need to analyze more on zero readings if we ignore/adjust values accordingly or just keep as it is to comeup with correct predictions for Cholesterol.
pd_train.boxplot(column = ['MaxHR'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()
"Calculate your resting heart rate by counting how many times your heart beats per minute when you are at rest, such as first thing in the morning. It's usually somewhere between 60 and 100 beats per minute for the average adult." Source
Average MaxHR is 136. Higher Maximum heart rate achieved records refers having less cases of heard disease carriers vs non-heart disease carriers. Many of tested patients with heart disease have around 120 maxHR
pd_train.boxplot(column = ['Oldpeak'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()
Oldpeak denotes ST depression induced by exercise relative to rest. Most cases with zero Oldpeak do not carry heart disease.
In statistics, Contingency table is a type of table in matrix format which display frequency distribution of the categorical variables.Source Pandas is matured library to display more visualization on contingency tables than pyspark pandas like rownames, column names, margins,... Since we are using below pyspark pandas dataframe, we are limited to use the params.
Let's continue to see the tabular format relationship between our categorical variables below.
train.crosstab('HeartDisease', 'Sex').show()
+----------------+---+---+ |HeartDisease_Sex| F| M| +----------------+---+---+ | 1| 38|346| | 0|104|198| +----------------+---+---+
Above Contingency table for HearDisease
and Sex
displays males are having more cases of heart disease. 0.6 males out of total males having heart disease vs 0.2 out of total females having heart disease. Based on sex, 0.05 females out of total male and female and 0.5 males out of total male and female having heart disease which is significant difference.
train.crosstab('HeartDisease', 'ChestPainType').show()
+--------------------------+---+---+---+---+ |HeartDisease_ChestPainType|ASY|ATA|NAP| TA| +--------------------------+---+---+---+---+ | 1|299| 18| 53| 14| | 0| 78|106| 97| 21| +--------------------------+---+---+---+---+
Contingency table for HeartDisease
and ChestPainType
displays ASY
is having most cases of heart disease where as TA
is having minimum cases of heart disease. Based on the each type, 0.79 ASY out of total ASY, 0.14 ATA out of total ATA, 0.35 NAP out of total NAP and 0.4 TA out of total TA is having heart disease.
train.crosstab('HeartDisease', 'FastingBS').show()
+----------------------+---+---+ |HeartDisease_FastingBS| 0| 1| +----------------------+---+---+ | 1|250|134| | 0|269| 33| +----------------------+---+---+
The above table shows the counts for Fasting Blood Suger vs heart disease. ~0.2 out of total is having heart disease.
train.crosstab('HeartDisease', 'RestingECG').show()
+-----------------------+---+------+---+ |HeartDisease_RestingECG|LVH|Normal| ST| +-----------------------+---+------+---+ | 1| 69| 219| 96| | 0| 65| 194| 43| +-----------------------+---+------+---+
Contingency table for HeartDisease
and RestingECG
display that ST
is having lower number of non-heart disease cases. Let's go deep into it to see more in details. Out of total, 0.3 Normal, 0.10 LHV, 0.13 ST carry heart disease. If we observe for each restingECG type, 0.51% LVH out of total LVH, 0.53% Noraml out of total Normal and 0.69% ST out of total ST contains heart disease. So most frequency having heart disease is ST even though we see less counts than other 2 categories.
train.crosstab('HeartDisease', 'ExerciseAngina').show()
+---------------------------+---+---+ |HeartDisease_ExerciseAngina| N| Y| +---------------------------+---+---+ | 1|144|240| | 0|262| 40| +---------------------------+---+---+
HeartDisease
and ExerciseAngina
shows patients having heart disease and also Exercise induced angina is 240 i.e. 0.35% out of total ExerciseAngina records from available data.
train.crosstab('HeartDisease', 'ST_Slope').show()
+---------------------+----+----+---+ |HeartDisease_ST_Slope|Down|Flat| Up| +---------------------+----+----+---+ | 1| 36| 282| 66| | 0| 9| 59|234| +---------------------+----+----+---+
Contingency table forHeartDisease
and ST_Slope
display Flat category having more heart disease carriers than Down and Up categories.
Till now we see tabular form of the frequencies between HeartDisease and other categorical variables. Let's visualize these using Bar Plots which will help us to differenciate these results quickly. Let's have a quick look below.
table = pd.crosstab(cat_data.HeartDisease, cat_data.Sex)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Sex")
plt.show()
Same as above contingency table for HeartDisease
and Sex
, we obseve that mostly men carries heart disease than females.
table = pd.crosstab(cat_data.HeartDisease, cat_data.ChestPainType)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Chest Pain Type")
plt.show()
Bar plot for HeartDisease
vs ChestPainType
displays Type ASY is having most cases of heart disease.
table = pd.crosstab(cat_data.HeartDisease, cat_data.FastingBS)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Fasting BS")
plt.show()
Bar plot for HeartDisease
vs FastingBS
displays less correlation between heart disease and having Fasting Blood Sugar. Having FastingBS is having relatively less number of heart disease than non-having FastingBS.
.
table = pd.crosstab(cat_data.HeartDisease, cat_data.RestingECG)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Resting ECG")
plt.show()
Barplot for HeartDisease
vs RestingECG
shows there is not much impact on LVH for disease carrier vs non-carrier.
plt.style.use('fivethirtyeight')
table = pd.crosstab(cat_data.HeartDisease, cat_data.ExerciseAngina)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Exercise Angina")
plt.legend(loc="upper center")
plt.show()
Bar plot HeartDisease
vs ExerciseAngina
display more number of heart disease carrier when there is ExerciseAngina is true vs less disease carriers where ExerciseAngina is false.
plt.style.use('fivethirtyeight')
table = pd.crosstab(cat_data.HeartDisease, cat_data.ST_Slope)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by ST Slope")
plt.show()
Bar plot for HeartDisease
vs ST_Slope
shows Flat category is having more disease carriers vs Up category is having less number of non-disease carriers. We observed the same in contingency table.
Let's see couple of scatter plot visualization for our data
import seaborn as sns
sns.scatterplot(pd_train['Age'],pd_train['Oldpeak'], hue=pd_train['HeartDisease'])
<AxesSubplot:xlabel='Age', ylabel='Oldpeak'>
The above scatter plot displays the older patients having heart disease, are having higher oldpeak. Oldpeak is appearing mostly in the range of 0 to 2.
sns.scatterplot(pd_train['MaxHR'],pd_train['ChestPainType'], hue=pd_train['HeartDisease'])
<AxesSubplot:xlabel='MaxHR', ylabel='ChestPainType'>
As we already observed, ASY chest pain type is having more heart disease carriers than other chest pain types.
sns.violinplot(x=cat_data["ChestPainType"],y=pd_train["MaxHR"],hue=cat_data["HeartDisease"],palette="viridis")
plt.xlabel("Chest Pain Type")
plt.ylabel("Maximum heart rate achieved")
plt.title("Maximum heart rate achieved vs Chest Pain Type vs Heart Disease Carrier")
plt.legend(loc=4);
The above violinplot from seaborn library displays visualization between MaxHR
, ChaistPainType
and HeartDisease
. There is MaxHR for ATA chaistpaintype but seems like low number of hear disease carriers. The most heart disease carriers are chestpaintype is ASY having maxHR between 50 to 200.
Above we looked into available data and analyze it accordingly. Now its time to deep dive into prediction.
The goal of Statistical Modeling is to summarize test results such a way that researcher can observe the data patterns and draw conclusion to take efficient business decisions.
Preventing heart disease is very necessary to save lives of people. For healthy lives, good data driven system to predict heart disease can improve the research and prevent this disease where machine learning comes into picture. We will be proceeding to use some machine learning models to predict the heart disease.
Also we will be using MLlib pipeline which is API for ML algorithms to combine multiple algorithms into single pipeline or workflow.
This report contains the Heart Disease Data to predict if patient is having heart disease or not which is our response variable. Since our response variable is binary categorical variable, we will use classification algorithms as below.
Let's import required libraries.
#import required library
from pyspark.ml.feature import StringIndexer, OneHotEncoder, SQLTransformer, VectorAssembler
Data PreProcessing
Machine learning models require numerical data. We do have categorical variables in our data like Sex, ChestPainType, ... So we need to convert our categorical variables into numeric forms.
StringIndexer maps string column of labels to an ML column of label indices. Let's say our Sex input column is having Male and Female which will convert to 0.0 and 1.0 indices.
One Hot Encoder Categorical features are turned into binary features that are “one-hot” encoded, meaning that if a feature is represented by that column, it receives a 1. Otherwise, it receives a 0
Log Transormation - We have numerical variables in different ranges. This will highly impact our prediction. To reduce this skewness, we will transform our RestingBP
, Cholesterol
and MaxHR
into log form. RestingBP
and Cholesterol
is having 0 value which is causing issue when we transform to log and run the models. so we have added 1 to each column value of RestingBP
and Cholesterol
and then have taken the log.
sex_indexer = StringIndexer(inputCol = "Sex", outputCol="SexIndex")
chestPain_indexer = StringIndexer(inputCol = "ChestPainType", outputCol="ChestPainTypeIndex")
RestingECG_indexer = StringIndexer(inputCol = "RestingECG", outputCol="RestingECGIndex")
ExerciseAngina_indexer = StringIndexer(inputCol = "ExerciseAngina", outputCol="ExerciseAnginaIndex")
ST_Slope_indexer = StringIndexer(inputCol = "ST_Slope", outputCol="ST_SlopeIndex")
encoder = OneHotEncoder().setInputCols(["SexIndex", "ChestPainTypeIndex", "RestingECGIndex",
"ExerciseAnginaIndex", "ST_SlopeIndex"])\
.setOutputCols(["Sex_encoded", "ChestPainType_encoded",
"RestingECG_encoded", "ExerciseAngina_encoded",
"ST_Slope_encoded"])
SQLTransformer - After Data Processing, we are using transformer. SQLTransformer implements transformations defined by below sql statement and keep HeartDisease
as label to use for modeling.
sqlTrans1 = SQLTransformer(
statement = "SELECT Age, Sex_encoded, ChestPainType_encoded,"+
"log(RestingBP+1) as log_RestingBP," +
"log(Cholesterol+1) as log_Cholesterol, FastingBS, RestingECG_encoded," +
"log(MaxHR) as log_MaxHR, ExerciseAngina_encoded, Oldpeak," +
"ST_Slope_encoded," +
"HeartDisease as label FROM __THIS__"
)
VectorAssembler - VectorAssembler help to merge all predictors into one vector to use as features variable while modeling.
assembler = VectorAssembler(inputCols = ["Age","Sex_encoded", "ChestPainType_encoded","FastingBS","RestingECG_encoded",
"ExerciseAngina_encoded","Oldpeak","ST_Slope_encoded","log_MaxHR","log_RestingBP","log_Cholesterol"],
outputCol = "features",
handleInvalid = 'keep')
Logistic Regression is a type of Generalized Linear model which can have both numerical and categorical predictors. Logistic Regression is used for binary response variable. So this is perfect to model our data. It models average of HeartDisease
success represent the probability of patients having heart disease. Basic logistic Regression models success probability using the logistic function. Logistic Regression's range is bounded between 0 and 1. Logistic Regression uses a loss function as MLE which is a conditional probability. If the probability is greater than 0.5, the predictions will be classified as 0. Otherwise, 1 will be assigned.
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()
As mentioned above, MLlib Pipeline is an API , make easier to combine multiple algorithms into a single workflow. Pipeline works in the sequence of stages. It contains below stages.
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
ExerciseAngina_indexer, ST_Slope_indexer, encoder,
sqlTrans1, assembler, lr])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show()
+-----+--------------------+ |label| features| +-----+--------------------+ | 0|[40.0,1.0,0.0,0.0...| | 1|[49.0,0.0,0.0,1.0...| | 0|(15,[0,1,4,8,11,1...| | 1|(15,[0,2,6,9,10,1...| | 0|[54.0,1.0,0.0,1.0...| | 0|[39.0,1.0,0.0,1.0...| | 0|(15,[0,4,6,8,11,1...| | 0|[54.0,1.0,0.0,0.0...| | 1|[37.0,1.0,1.0,0.0...| | 0|(15,[0,4,6,8,11,1...| | 0|(15,[0,3,6,8,11,1...| | 1|(15,[0,1,4,9,10,1...| | 0|[39.0,1.0,0.0,0.0...| | 1|[49.0,1.0,1.0,0.0...| | 0|(15,[0,3,8,11,12,...| | 0|[54.0,0.0,0.0,0.0...| | 1|[38.0,1.0,1.0,0.0...| | 0|(15,[0,4,6,8,11,1...| | 1|[60.0,1.0,1.0,0.0...| | 1|[36.0,1.0,0.0,0.0...| +-----+--------------------+ only showing top 20 rows
Using only splitting of data into train and test set, it may split randomly. So if any variables having less type of categories, more data points having those categories may endup exists for training set or testing set. As a result, our model may not learn or predict properly. To avoid such issue, we could consider the data split multiple ways and averaging over the results.
We are using 5-Fold Cross Validation
For our binary response variable, we are using Binary Classification Evaluator as evaluator which uses metric as 'areaUnderROC'
AreaUnderROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.1, 0.5, 1.0, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0])
.addGrid(lr.maxIter, [1, 5, 10, 20, 50])
.build())
# Evaluate model
lrevaluator = BinaryClassificationEvaluator()
# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
estimatorParamMaps = paramGrid,
evaluator = lrevaluator,
numFolds = 5)
cvmodel = crossval.fit(train)
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
[(0.921698236655667, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.921876533412266, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9253820866178969, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9231798740796417, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9228313885886124, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9217520666549949, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9199346225920013, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9206359851919063, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9229306982354585, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9221178894247135, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9215982780450014, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9200155520121336, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9207239794636595, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9235616160296888, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9227068378545418, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9216927707575282, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9205062484893496, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9206764960828429, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9217284815436905, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9224004483985779, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9216990345602316, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9213131743075629, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.920880294362944, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9238808776379728, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9241391446709429, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.921698236655667, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9218508175035522, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9244825533299477, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9243307067255726, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9249160927743874, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9220537329751081, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.919317068002891, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.92019071555804, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9256304374225555, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9241867701942522, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9207711003828134, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9177818330553108, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9187030441093909, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9219464268113003, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9203431651041946, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.919006094267317, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9149693355269335, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.915611839718522, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9155929586882892, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9175641589905652, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9157544718287107, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.909549044425528, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9052989932359772, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9035818069180428, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9055935834347858, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.921698236655667, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9214301633822402, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9235642257233458, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9237412199002757, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9248698194438002, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.9084701515090448, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9032914986959106, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.901107403245561, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8995250455244649, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9008647731590198, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.8400116447064009, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.8373774271437777, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.8391005740423633, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8241198861175963, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8241198861175963, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.921698236655667, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9212270623180709, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9230339581265148, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.923257114480347, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9241171933977016, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.8400116447064009, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.8400116447064009, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.8400116447064009, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8374548265245827, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8400116447064009, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.921698236655667, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.9217135653603661, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.9222601536649523, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9219214918776323, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9228879198999813, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.5, {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0, Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0, Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50})]
# use the best model
cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ |Age| Sex_encoded|ChestPainType_encoded| log_RestingBP| log_Cholesterol|FastingBS|RestingECG_encoded| log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label| features| rawPrediction| probability|prediction| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ | 29|(1,[0],[1.0])| (3,[2],[1.0])| 4.795790545596741|5.497168225293202| 0| (2,[0],[1.0])|5.075173815233827| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|[29.0,1.0,0.0,0.0...|[2.08694552674690...|[0.88962786326173...| 0.0| | 30| (1,[],[])| (3,[],[])| 5.14166355650266|5.472270673671475| 0| (2,[1],[1.0])|5.135798437050262| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|(15,[0,7,8,11,12,...|[2.36858881296343...|[0.91440046835339...| 0.0| | 35| (1,[],[])| (3,[0],[1.0])|4.9344739331306915|5.214935757608986| 0| (2,[0],[1.0])|5.204006687076795| (1,[0],[1.0])| 1.4| (2,[1],[1.0])| 0|[35.0,0.0,1.0,0.0...|[1.26141899229498...|[0.77927028268269...| 0.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741| 0.0| 1| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.2| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[-2.5756088964456...|[0.07072478302241...| 1.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741|5.293304824724492| 0| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.6| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[-1.4607612263304...|[0.18835092524094...| 1.0| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ only showing top 5 rows
lrROC = lrevaluator.evaluate(cvmodel.transform(test))
print(lrROC)
0.9363052568697732
RandomForest is Tree based algorithm. It average across many fitted trees. It create multiple trees from bootstrap samples. It don't use all predictors for each bootstrap sample fit. RF randomly select subset of predictors and provides output with strong predictor subset having more accuracy.
### Random Forest
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
ExerciseAngina_indexer, ST_Slope_indexer, encoder,
sqlTrans1, assembler, rf])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+ |label| features| +-----+--------------------+ | 0|[40.0,1.0,0.0,0.0...| | 1|[49.0,0.0,0.0,1.0...| +-----+--------------------+ only showing top 2 rows
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
.addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
.build()
# Evaluate model
rfevaluator = BinaryClassificationEvaluator()
# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
estimatorParamMaps = paramGrid,
evaluator = rfevaluator,
numFolds = 5)
cvmodel = crossval.fit(train)
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
[(0.9111788690211999, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}), (0.8972880540427625, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}), (0.8969978063647439, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25}), (0.921390158382225, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}), (0.9194762757480442, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}), (0.9194762757480442, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25}), (0.9226541832127487, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}), (0.9246970601500079, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}), (0.9248342566413711, {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50, Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25})]
# use the best model
cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ |Age| Sex_encoded|ChestPainType_encoded| log_RestingBP| log_Cholesterol|FastingBS|RestingECG_encoded| log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label| features| rawPrediction| probability|prediction| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ | 29|(1,[0],[1.0])| (3,[2],[1.0])| 4.795790545596741|5.497168225293202| 0| (2,[0],[1.0])|5.075173815233827| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|[29.0,1.0,0.0,0.0...|[49.8957446062709...|[0.99791489212541...| 0.0| | 30| (1,[],[])| (3,[],[])| 5.14166355650266|5.472270673671475| 0| (2,[1],[1.0])|5.135798437050262| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|(15,[0,7,8,11,12,...|[43.9101496768557...|[0.87820299353711...| 0.0| | 35| (1,[],[])| (3,[0],[1.0])|4.9344739331306915|5.214935757608986| 0| (2,[0],[1.0])|5.204006687076795| (1,[0],[1.0])| 1.4| (2,[1],[1.0])| 0|[35.0,0.0,1.0,0.0...|[31.9546509845149...|[0.63909301969029...| 0.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741| 0.0| 1| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.2| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[3.58490108022818...|[0.07169802160456...| 1.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741|5.293304824724492| 0| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.6| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[3.07678419711129...|[0.06153568394222...| 1.0| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ only showing top 5 rows
rfROC = lrevaluator.evaluate(cvmodel.transform(test))
print(rfROC)
0.9318623058542413
Gradient Boosting is a complex and slow learner algorithm of Tree Classifier. It is a grouping of Gradient descent and Boosting. In gradient boosting, each new model minimizes the loss function from its predecessor using the Gradient Descent Method. This procedure continues until a more optimal estimate of the target variable has been achieved
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(maxIter=10)
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
ExerciseAngina_indexer, ST_Slope_indexer, encoder,
sqlTrans1, assembler, gbt])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+ |label| features| +-----+--------------------+ | 0|[40.0,1.0,0.0,0.0...| | 1|[49.0,0.0,0.0,1.0...| +-----+--------------------+ only showing top 2 rows
# Create 5-fold CrossValidator
paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxDepth, [2, 4, 6])
.addGrid(gbt.maxBins, [20, 60])
.addGrid(gbt.maxIter, [10, 20])
.build())
crossval = CrossValidator(estimator = pipeline,
estimatorParamMaps = paramGrid,
evaluator = evaluator,
numFolds = 5)
cvmodel = crossval.fit(train)
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
[(0.9096798679414457, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9156012613588078, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9085977987092215, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9135681150201598, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9114822513993694, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9171626983143366, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.9047018979423522, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.9098797841912523, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8851504202139524, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.886531155652834, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8704223486989228, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8803257386042476, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20})]
# use the best model
cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ |Age| Sex_encoded|ChestPainType_encoded| log_RestingBP| log_Cholesterol|FastingBS|RestingECG_encoded| log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label| features| rawPrediction| probability|prediction| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ | 29|(1,[0],[1.0])| (3,[2],[1.0])| 4.795790545596741|5.497168225293202| 0| (2,[0],[1.0])|5.075173815233827| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|[29.0,1.0,0.0,0.0...|[1.29550740802887...|[0.93028106501865...| 0.0| | 30| (1,[],[])| (3,[],[])| 5.14166355650266|5.472270673671475| 0| (2,[1],[1.0])|5.135798437050262| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|(15,[0,7,8,11,12,...|[0.84109146598563...|[0.84319337096572...| 0.0| | 35| (1,[],[])| (3,[0],[1.0])|4.9344739331306915|5.214935757608986| 0| (2,[0],[1.0])|5.204006687076795| (1,[0],[1.0])| 1.4| (2,[1],[1.0])| 0|[35.0,0.0,1.0,0.0...|[0.96705524005898...|[0.87370369024086...| 0.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741| 0.0| 1| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.2| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[-1.1716780737366...|[0.08759531203936...| 1.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741|5.293304824724492| 0| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.6| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[-1.2415565571794...|[0.07705052522522...| 1.0| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ only showing top 5 rows
gbtROC = evaluator.evaluate(cvmodel.transform(test))
print(gbtROC)
0.9171146953405022
It classify the group based on Tree classifier algorithm. It splits the data based on question yes/no till it split data appropriately into each classes. This algorithm is easy to understand and interpret output. Predictors don't need to be scaled but small change may lead to large affect on prediction.
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
ExerciseAngina_indexer, ST_Slope_indexer, encoder,
sqlTrans1, assembler, dt])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+ |label| features| +-----+--------------------+ | 0|[40.0,1.0,0.0,0.0...| | 1|[49.0,0.0,0.0,1.0...| +-----+--------------------+ only showing top 2 rows
dtevaluator = BinaryClassificationEvaluator()
# Create ParamGrid for Cross Validation
dtparamGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [2, 5, 10])
.addGrid(dt.maxBins, [10, 20, 40, 80, 100])
.build())
# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
estimatorParamMaps = dtparamGrid,
evaluator = dtevaluator,
numFolds = 5)
cvmodel = crossval.fit(train)
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
[(0.7792443486530496, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.7792443486530496, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.7792443486530496, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.7792443486530496, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.7792443486530496, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8434679871865438, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8204128078958767, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8073469584329684, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.7986465938728258, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.7945293721569198, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}), (0.8307560098307648, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}), (0.8255715819650933, {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6, Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 60, Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20})]
# use the best model
dtpred = cvmodel.transform(test)
dtpred.show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ |Age| Sex_encoded|ChestPainType_encoded| log_RestingBP| log_Cholesterol|FastingBS|RestingECG_encoded| log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label| features| rawPrediction| probability|prediction| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ | 29|(1,[0],[1.0])| (3,[2],[1.0])| 4.795790545596741|5.497168225293202| 0| (2,[0],[1.0])|5.075173815233827| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|[29.0,1.0,0.0,0.0...|[49.8957446062709...|[0.99791489212541...| 0.0| | 30| (1,[],[])| (3,[],[])| 5.14166355650266|5.472270673671475| 0| (2,[1],[1.0])|5.135798437050262| (1,[0],[1.0])| 0.0| (2,[1],[1.0])| 0|(15,[0,7,8,11,12,...|[43.9101496768557...|[0.87820299353711...| 0.0| | 35| (1,[],[])| (3,[0],[1.0])|4.9344739331306915|5.214935757608986| 0| (2,[0],[1.0])|5.204006687076795| (1,[0],[1.0])| 1.4| (2,[1],[1.0])| 0|[35.0,0.0,1.0,0.0...|[31.9546509845149...|[0.63909301969029...| 0.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741| 0.0| 1| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.2| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[3.58490108022818...|[0.07169802160456...| 1.0| | 35|(1,[0],[1.0])| (3,[0],[1.0])| 4.795790545596741|5.293304824724492| 0| (2,[0],[1.0])|4.867534450455582| (1,[],[])| 1.6| (2,[0],[1.0])| 1|[35.0,1.0,1.0,0.0...|[3.07678419711129...|[0.06153568394222...| 1.0| +---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+ only showing top 5 rows
dtROC = dtevaluator.evaluate(cvmodel.transform(test))
print(dtROC)
0.8903076463560335
sns.set_style("whitegrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("AreaUnderROC")
plt.xlabel("Algorithms")
sns.lineplot(data=range(1,100), x=["Logistic",'RandomForest','GradientBosting','DecisionTree'],
y=[0.9363052568697732,0.9318623058542413,0.9171146953405022,0.8903076463560335])
plt.show()
Based on above graph, we observe the AreaUnderROC is covered mostly with Logistic Regression i.e. 93.63%
Algorithm | AreaUnderROC |
---|---|
Logistic Regression | 93.63% |
Random Forest | 93.18% |
Gradient Bosting | 91.71% |
Decision Tree | 89.03% |
In this report, we proposed 4 classification algorithms in which comparative analysis is done and promising results are achieved based on available data. The conclusion we found is that based on our binary classification evaluator , associated metric AreaUnderROC and parameters provided for each algorithm, logistic regression model performed at best to classify if patient is having heart disease or not. Logistic Regression took comparatively more time to learn the model and predict the results but able to get more promising prediction. Random Forest is also having about to same prediction result to classify heart disease carrier vs non-carriers. So we should be able to achieve good results by tuning hyper parameters for Random Forest Model.
Key Findings
It is also observed that the statistical analysis is a necessary task in the combination of latest technologies like MLlib Pipeline. When a dataset is analyzed, it can handle better way. We observed Oldpeak
and MaxHR
are two strongest predictors for this dataset. Then the outlier’s detection is also important to understand how data may behave/skewed. Data preprocessing is necessary to provide proper data for modeling to achieve better prediction results.
Future Scope
This prediction is not sufficient to predict real world case. To produce an even more accurate heart disease prediction model, it would be helpful to obtain a larger dataset as well as a more recent dataset. Also proper actions to be taken to handle outliers.