Supervised Learning Model Selection
¶

MLlib Pipline and Cross-Validation Utilizing Heart Failure Data
¶

Pramodini Karwande and Ashley Ko
¶

Introduction¶

This notebook uses supervised learning to fit and select the best model for prediction using the Heart Failure Prediction Dataset. As heart disease is the leading cause of death globally, we believed it would interest to examine this dataset with supervised learning methods.

The data was obtained via Kaggle and sourced from five different clinical data sets. It consists of 11 clinical features for predicting heart disease. These clinical features are:

  • Age: age of the patient [years]
  • Sex: sex of the patient [M: Male, F: Female]
  • hestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  • RestingBP: resting blood pressure [mm Hg]
  • Cholesterol: serum cholesterol [mm/dl]
  • FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  • RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
  • MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  • ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  • Oldpeak: oldpeak = ST [Numeric value measured in depression]
  • ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  • HeartDisease: output class [1: heart disease, 0: Normal]

This notebook seeks to fit models that predict for HeartDisease. As this attribute is categorical, we proceeded with classification methods.

Supervised Learning Idea and Data Split¶

Supervised learning a type of machine learning model were a variable or variables is to represent a response. The goal of supervised learning is to make inference or make predictions . Algorithms are used to fit and select models for classification or prediction. In our case, we are using classification.

One well known method of supervised learning is the generalized linear model. As our response data is binary, we specifically used logistic regression. Other methods include Tree based methods such as classification trees, random forest, or boosting. Tree models use recursive splitting of the predictor region. Classification trees are used for predicting group membership. This works well with our binary data. Random forest and boosting are tree based methods which averaged overall trees using bootstrap aggregation.

As we seek to predict the class of an observation as either 0 - Normal or 1 - Heart Disease, it is best practice to not predict on data that was used to fit the model. The reason for this is that novel data is necessary to accurately test your predictions. If a model was fit using the full data set, there aren't any novel observations. This leads to overfitting and inflated model accuracy.

The solution to the problem of overfitting is to split the full data set into training and testing data sets. Models will be fit and selected based on training data. Fitted models will be tested and predictions drawn from the test data set. Typically, a 80% train 20% test data split is used. We selected a 75% train and 25% test split to allow for adequate sample size.

Data Splitting¶

We begin by reading in the heart.csv file into a pandas-on-Spark DataFrame. We chose to start with pandas-on-spark as it allows for easier manipulations for exploratory data analysis.

But first let's import the required libraries and set the notebook environment.
In the Later part, we will need spark session to work with RDD, Dataframe. So we are creating new spark session as well.

In [292]:
# Import packages
import os
import sys
import warnings
import pandas as pd
import numpy as np
import pyspark.pandas as ps
import matplotlib.pyplot as plt
import seaborn as sns

# Set environment
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# Remove warnings from rendered output
warnings.filterwarnings("ignore")


# Set figure size
plt.rcParams["figure.figsize"] = (10,7)

# Spark Session builder
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.debug.maxToStringFields", "100").getOrCreate()

Here we are reading in the data as a pandas-on-Spark DataFrame.

In [293]:
# Read in data as pandas-on-Spark data frame
psdf_heart = ps.read_csv("heart.csv")
# Checking if import was successful
psdf_heart.head()
Out[293]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0

Using psdf_heart.info() we determined learned the data type of the pandas-on-Spark variables. This allows us to determine if we need to transform variables to match their true data type. For example categorical data should be classified as such.

In [294]:
psdf_heart.info()
<class 'pyspark.pandas.frame.DataFrame'>
Int64Index: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int32  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int32  
 4   Cholesterol     918 non-null    int32  
 5   FastingBS       918 non-null    int32  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int32  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int32  
dtypes: float64(1), int32(6), object(5)

It looks like the categorical variables have all been stored as objects. Later, we will use transformations get these variables in our desired format.

Another potential issue would be the presence of null variables. Any null values should be processed to prevent errors.

In [295]:
# Check for null values
print(psdf_heart.isnull().sum())
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

As there are no null values, we elected to proceed with data splitting. We begin by converting our pandas-on-Spark DataFrame to a Pyspark SQL DataFrame.

In [296]:
# Make a Pyspark SQL dataframe from pandas-on-Spark df
sql_heart = psdf_heart.to_spark()

To the split the data into train and test data sets, we used randomSplit with a split of 75/25 training to test. This was selected as a safe ratio allow for enough samples in both the training data and the test data.

In [297]:
train, test = sql_heart.randomSplit([0.75, 0.25], seed = 1234)
print(train.count(), test.count())
686 232

For some of the plotting, we need matplotlib. This requires a pandas DataFrame.

In [298]:
# Make a pandas DF from PySpark SQL df
pd_train = train.toPandas()

Other graphs and numeric summaries lent themselves better to using a pandas-on-Spark data set best fit our needs.

In [299]:
# Make a pandas-on-Spark DF from PySpark SQL df
psdf_train = train.to_pandas_on_spark()

Exploratory Data Analysis¶

Exploring the training data allows us to better understand the relationship of the predictor and the response variables. It also alerts us to any unusual values.

We generated 8 number summaries for the Age, RestingBP, Cholesterol, MaxHR, Oldpeak.

In [300]:
psdf_train[['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']].describe()
Out[300]:
Age RestingBP Cholesterol MaxHR Oldpeak
count 686.000000 686.000000 686.000000 686.000000 686.000000
mean 53.690962 132.091837 196.749271 136.677843 0.907289
std 9.282468 18.546025 111.301411 25.526051 1.074837
min 28.000000 0.000000 0.000000 60.000000 -2.000000
25% 48.000000 120.000000 170.000000 120.000000 0.000000
50% 54.000000 130.000000 222.000000 138.000000 0.600000
75% 60.000000 140.000000 265.000000 156.000000 1.500000
max 76.000000 200.000000 603.000000 202.000000 6.200000

From the summary table above, it is evident that there are potential outliers for Cholesterol and RestingBP. Cholesterolhas a maximum of 603 which is more than two standard deviations above the mean. This observation is suspect, but more information of that particular observation would be needed before removing it. The minimum resting blood pressure and cholesterol are 0. Cholesterol and RestingBP of 0 is cause for concern. It could be that these values were not recorded or that these values are true measurements. We do not know for certain. Our concerns is that these observations might be over represented in either the test or training data and lead inaccurate model predictions or erroneously be selected as a good predictor for heart disease.

To assess the proportion of observations with either Cholesterol of 0 or RestingBP we obtained counts by filtering the SPARK SQL dataset accordingly.

First, let us look at the occurrences of a RestingBP of 0. The value below is the count from the training data set.

In [301]:
# Display counts for RestingBP of 0
print(train.filter(train.RestingBP == 0).count())
1

There is only a single observation with a RestingBP of 0. Because we cannot determine the reason for this value, we will continue to use this observation.

Now, let's examine counts for observations where Cholesterol is zero. The first value printed is the number of observations in the training and the second number is ratio of observations with heart disease and Cholesterol of 0 out of the total number of observations with heart disease in the training data.

In [302]:
# Display counts for Cholesterol of 0 and respective ratios for training and test data
print(train.filter(train.Cholesterol == 0).count(),
      train.filter(train.Cholesterol == 0)
      .filter(train.HeartDisease == 1).count()/
      train.filter(train.HeartDisease == 1).count())
135 0.3151041666666667

We calculated the proportion of observations with heart disease and cholesterol measurement of 0. This proportion is approximately 0.315.

There is a relatively weak correlation -0.246 between heart disease and cholesterol. However, given the general understanding of the relationship between heart disease, we decided to proceed with outlier observations and make note of this for further work.

In [303]:
psdf_train.corr().style.background_gradient(cmap='coolwarm').set_precision(3)
Out[303]:
  Age RestingBP Cholesterol FastingBS MaxHR Oldpeak HeartDisease
Age 1.000 0.263 -0.114 0.217 -0.418 0.240 0.298
RestingBP 0.263 1.000 0.093 0.044 -0.113 0.151 0.103
Cholesterol -0.114 0.093 1.000 -0.275 0.250 0.057 -0.246
FastingBS 0.217 0.044 -0.275 1.000 -0.169 0.049 0.277
MaxHR -0.418 -0.113 0.250 -0.169 1.000 -0.185 -0.430
Oldpeak 0.240 0.151 0.057 0.049 -0.185 1.000 0.406
HeartDisease 0.298 0.103 -0.246 0.277 -0.430 0.406 1.000

The above table shows that no numeric variable has a correlation more extreme than -0.430 which is the correlation between heart disease and max heart rate. Oldpeak has a correlation of 0.406. These finding suggest that while there is some weak correlations to no correlation among the predictors. This is a good thing. However, the predictors are also only just weakly correlated with heart disease.
Below is the graphical representation of correlation between HeartDisease and Numerical__ variables.

In [471]:
cat_data = pd_train[['HeartDisease', 'Sex', 'ChestPainType', 'FastingBS','RestingECG', 'ExerciseAngina', 'ST_Slope']]

Let's see below correlation in graphical format for Numeric variables 'Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak' with Heart Disease.

In [472]:
num_data = [col for col in pd_train.columns if col not in cat_data]
num_data
Out[472]:
['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
In [452]:
sns.pairplot(pd_train[num_data+['HeartDisease']],
             plot_kws={'alpha': 0.5});

Now we will find aggregate means for each of the numeric variables based on Heart Disease.

In [304]:
psdf_summ_group = psdf_train.groupby('HeartDisease').agg(
    {'Age': ['mean', 'std'], 'RestingBP': ['mean', 'std'], 'Cholesterol': ['mean', 'std'],
     'MaxHR': ['mean', 'std'], 'Oldpeak': ['mean', 'std']})
psdf_summ_group
Out[304]:
Age RestingBP Cholesterol MaxHR Oldpeak
mean std mean std mean std mean std mean std
HeartDisease
1 56.140625 8.400299 133.783854 20.245559 172.460938 128.289952 126.945312 23.208998 1.294271 1.161353
0 50.576159 9.429181 129.940397 15.903251 227.632450 74.434305 149.052980 22.867752 0.415232 0.692076

There are visible difference in the mean and standard deviations over HeartDisease. With the exception o Cholesterol and MaxHR all means and standard deviations are greater in observations flagged as having heart disease. With respect to Cholesterol, the difference in mean might be linked to previously mentioned the extreme maximum and minimum values.

Let's repeat above task to get mean and std for the given numeric variables based on sex.

In [305]:
psdf_summ_sex = psdf_train.groupby('Sex').agg(
    {'Age': ['mean', 'std'], 'RestingBP': ['mean', 'std'], 'Cholesterol': ['mean', 'std'],
     'MaxHR': ['mean', 'std'], 'Oldpeak': ['mean', 'std']})
psdf_summ_sex
Out[305]:
Age RestingBP Cholesterol MaxHR Oldpeak
mean std mean std mean std mean std mean std
Sex
F 52.514085 9.956455 131.873239 19.182904 238.049296 92.569570 145.556338 22.463374 0.687324 0.978119
M 53.998162 9.082900 132.148897 18.393837 185.968750 113.313598 134.360294 25.786608 0.964706 1.092248

When comparing means between male and female observations, we noted that there are not strong differences between sexes. However, the female Cholesterol mean is approximately 238 and the male mean is ~186. It is possible that this difference is due to the fact that women tend to have higher levels of HDL cholesterol than men.

Examining the center and spread of our numeric variables allowed us to identify unique points in RestingBP and Cholesterol. To visualize the shape and spread of these variables' distributions, when created histograms for each of the numeric variables.

Histogram***¶

First, let us look at the distribution of Age.

In [306]:
pd_train.Age.hist(bins = 10)
plt.xlabel("Age")
plt.title("Histogram of Age of Participants")
plt.show()

The histogram for Age shows a bell shaped left-skewed distribution. This is likely due to the fact that older individuals the ones most commonly assessed for heart disease.

Next, we have a histogram of RestingBP.

In [307]:
pd_train.RestingBP.hist(bins = 20)
plt.xlabel("RestingBP")
plt.title("Histogram of Participant Resting Blood Pressure")
plt.show()

Systolic blood pressure is typically greater than 100, which is what we see from this plot. There is one observation at zero which matches our previous finding.

Histogram for Cholesterol.

In [308]:
pd_train.Cholesterol.hist(bins = 20)
plt.xlabel("Cholesterol")
plt.title("Histogram of Participant Cholesterol Measurement")
plt.show()

Once again we see that many observations have Cholesteral of zero. At the upper end of the scale, you can see several values beyond 450. Most other observations are between 100 and 400. This matches our previous statistics.

Next, we have a histogram for MaxHR.

In [309]:
pd_train.MaxHR.hist(bins = 20)
plt.xlabel("MaxHR")
plt.title("Histogram of Participant Maximum Heart Rate")
plt.show()
Exception ignored in: <function JavaWrapper.__del__ at 0x0000028BD71B7D30>
Traceback (most recent call last):
  File "C:\Users\kar_d\anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 39, in __del__
    if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError: 'OneHotEncoder' object has no attribute '_java_obj'

MaxHR has a large spread. This histogram matches with the fact that MaxHR has a mean and standard deviation of ~25.

Our last histogram is for Oldpeak.

In [310]:
pd_train.Oldpeak.hist(bins = 50)
plt.xlabel("Oldpeak")
plt.title("Histogram of Participant Oldpeak")
plt.show()

Oldpeak has a rather unusual histogram with numerous observations at zero and no clear shape. The distribution almost looks as if this variable is more discrete than continuous. This histogram matches with the fact that Oldspeak values are generally small.

By now we have a pretty good sense of shape, spread, and center of the numeric predictor variables. Let's now examine how these components change when grouped by HeartDisease. To do this we will look at boxplots for each numeric variable plotted over HeartDisease.

Box Plot**¶

In [395]:
pd_train.boxplot(column = ['Age'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()

As shown by the 8 number summaries and aggregate mean and standard deviation tables, individual's with heart disease tend to be older.

In [396]:
pd_train.boxplot(column = ['RestingBP'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()

A normal BP level is less than 120/80 mmHg..As we observed above, patients having heart disease has minimum value of RestingBP as 0 which is an outlier and data seems to be littlebit skewed. Patients having heart disease tends to have more RestingBP than non having heart disease persons.

In [397]:
pd_train.boxplot(column = ['Cholesterol'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()

The above boxplot displays Cholesterol effect on heart disease carrier and non-carriers. There are many outliers. Cholesterol contains many zero readings for patients having heart disease. so we observe here negative skewness or we can say bottom skewed data for having heart disease. we may need to analyze more on zero readings if we ignore/adjust values accordingly or just keep as it is to comeup with correct predictions for Cholesterol.

In [398]:
pd_train.boxplot(column = ['MaxHR'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()

"Calculate your resting heart rate by counting how many times your heart beats per minute when you are at rest, such as first thing in the morning. It's usually somewhere between 60 and 100 beats per minute for the average adult." Source

Average MaxHR is 136. Higher Maximum heart rate achieved records refers having less cases of heard disease carriers vs non-heart disease carriers. Many of tested patients with heart disease have around 120 maxHR

In [399]:
pd_train.boxplot(column = ['Oldpeak'], by = 'HeartDisease').legend(('0=heart Disease Non-Carriers',"1=heart Disease Carriers"),loc='upper center')
plt.show()

Oldpeak denotes ST depression induced by exercise relative to rest. Most cases with zero Oldpeak do not carry heart disease.

In statistics, Contingency table is a type of table in matrix format which display frequency distribution of the categorical variables.Source Pandas is matured library to display more visualization on contingency tables than pyspark pandas like rownames, column names, margins,... Since we are using below pyspark pandas dataframe, we are limited to use the params.
Let's continue to see the tabular format relationship between our categorical variables below.

Contingency Tables¶

In [316]:
train.crosstab('HeartDisease', 'Sex').show()
+----------------+---+---+
|HeartDisease_Sex|  F|  M|
+----------------+---+---+
|               1| 38|346|
|               0|104|198|
+----------------+---+---+

Above Contingency table for HearDisease and Sex displays males are having more cases of heart disease. 0.6 males out of total males having heart disease vs 0.2 out of total females having heart disease. Based on sex, 0.05 females out of total male and female and 0.5 males out of total male and female having heart disease which is significant difference.

In [317]:
train.crosstab('HeartDisease', 'ChestPainType').show()
+--------------------------+---+---+---+---+
|HeartDisease_ChestPainType|ASY|ATA|NAP| TA|
+--------------------------+---+---+---+---+
|                         1|299| 18| 53| 14|
|                         0| 78|106| 97| 21|
+--------------------------+---+---+---+---+

Contingency table for HeartDisease and ChestPainType displays ASY is having most cases of heart disease where as TA is having minimum cases of heart disease. Based on the each type, 0.79 ASY out of total ASY, 0.14 ATA out of total ATA, 0.35 NAP out of total NAP and 0.4 TA out of total TA is having heart disease.

In [411]:
train.crosstab('HeartDisease', 'FastingBS').show()
+----------------------+---+---+
|HeartDisease_FastingBS|  0|  1|
+----------------------+---+---+
|                     1|250|134|
|                     0|269| 33|
+----------------------+---+---+

The above table shows the counts for Fasting Blood Suger vs heart disease. ~0.2 out of total is having heart disease.

In [319]:
train.crosstab('HeartDisease', 'RestingECG').show()
+-----------------------+---+------+---+
|HeartDisease_RestingECG|LVH|Normal| ST|
+-----------------------+---+------+---+
|                      1| 69|   219| 96|
|                      0| 65|   194| 43|
+-----------------------+---+------+---+

Contingency table for HeartDisease and RestingECG display that ST is having lower number of non-heart disease cases. Let's go deep into it to see more in details. Out of total, 0.3 Normal, 0.10 LHV, 0.13 ST carry heart disease. If we observe for each restingECG type, 0.51% LVH out of total LVH, 0.53% Noraml out of total Normal and 0.69% ST out of total ST contains heart disease. So most frequency having heart disease is ST even though we see less counts than other 2 categories.

In [320]:
train.crosstab('HeartDisease', 'ExerciseAngina').show()
+---------------------------+---+---+
|HeartDisease_ExerciseAngina|  N|  Y|
+---------------------------+---+---+
|                          1|144|240|
|                          0|262| 40|
+---------------------------+---+---+

HeartDisease and ExerciseAngina shows patients having heart disease and also Exercise induced angina is 240 i.e. 0.35% out of total ExerciseAngina records from available data.

In [321]:
train.crosstab('HeartDisease', 'ST_Slope').show()
+---------------------+----+----+---+
|HeartDisease_ST_Slope|Down|Flat| Up|
+---------------------+----+----+---+
|                    1|  36| 282| 66|
|                    0|   9|  59|234|
+---------------------+----+----+---+

Contingency table forHeartDisease and ST_Slope display Flat category having more heart disease carriers than Down and Up categories.

Till now we see tabular form of the frequencies between HeartDisease and other categorical variables. Let's visualize these using Bar Plots which will help us to differenciate these results quickly. Let's have a quick look below.

Bar Plot**¶

In [496]:
table = pd.crosstab(cat_data.HeartDisease, cat_data.Sex)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Sex")
plt.show()

Same as above contingency table for HeartDisease and Sex, we obseve that mostly men carries heart disease than females.

In [494]:
table = pd.crosstab(cat_data.HeartDisease, cat_data.ChestPainType)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Chest Pain Type")
plt.show()

Bar plot for HeartDisease vs ChestPainType displays Type ASY is having most cases of heart disease.

In [497]:
table = pd.crosstab(cat_data.HeartDisease, cat_data.FastingBS)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Fasting BS")
plt.show()

Bar plot for HeartDisease vs FastingBS displays less correlation between heart disease and having Fasting Blood Sugar. Having FastingBS is having relatively less number of heart disease than non-having FastingBS. .

In [498]:
table = pd.crosstab(cat_data.HeartDisease, cat_data.RestingECG)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Resting ECG")
plt.show()

Barplot for HeartDisease vs RestingECG shows there is not much impact on LVH for disease carrier vs non-carrier.

In [488]:
plt.style.use('fivethirtyeight')
table = pd.crosstab(cat_data.HeartDisease, cat_data.ExerciseAngina)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by Exercise Angina")
plt.legend(loc="upper center")
plt.show()

Bar plot HeartDisease vs ExerciseAngina display more number of heart disease carrier when there is ExerciseAngina is true vs less disease carriers where ExerciseAngina is false.

In [413]:
plt.style.use('fivethirtyeight')
table = pd.crosstab(cat_data.HeartDisease, cat_data.ST_Slope)
table.plot.bar()
plt.title("Bar Plot of Heart Disease by ST Slope")
plt.show()

Bar plot for HeartDisease vs ST_Slope shows Flat category is having more disease carriers vs Up category is having less number of non-disease carriers. We observed the same in contingency table.

Scatter Plot**¶

Let's see couple of scatter plot visualization for our data

In [490]:
import seaborn as sns
sns.scatterplot(pd_train['Age'],pd_train['Oldpeak'], hue=pd_train['HeartDisease'])
Out[490]:
<AxesSubplot:xlabel='Age', ylabel='Oldpeak'>

The above scatter plot displays the older patients having heart disease, are having higher oldpeak. Oldpeak is appearing mostly in the range of 0 to 2.

In [491]:
sns.scatterplot(pd_train['MaxHR'],pd_train['ChestPainType'], hue=pd_train['HeartDisease']) 
Out[491]:
<AxesSubplot:xlabel='MaxHR', ylabel='ChestPainType'>

As we already observed, ASY chest pain type is having more heart disease carriers than other chest pain types.

In [476]:
sns.violinplot(x=cat_data["ChestPainType"],y=pd_train["MaxHR"],hue=cat_data["HeartDisease"],palette="viridis")
plt.xlabel("Chest Pain Type")
plt.ylabel("Maximum heart rate achieved")
plt.title("Maximum heart rate achieved vs Chest Pain Type vs Heart Disease Carrier")
plt.legend(loc=4);

The above violinplot from seaborn library displays visualization between MaxHR, ChaistPainType and HeartDisease. There is MaxHR for ATA chaistpaintype but seems like low number of hear disease carriers. The most heart disease carriers are chestpaintype is ASY having maxHR between 50 to 200.

Above we looked into available data and analyze it accordingly. Now its time to deep dive into prediction.

Modeling¶

The goal of Statistical Modeling is to summarize test results such a way that researcher can observe the data patterns and draw conclusion to take efficient business decisions.

Preventing heart disease is very necessary to save lives of people. For healthy lives, good data driven system to predict heart disease can improve the research and prevent this disease where machine learning comes into picture. We will be proceeding to use some machine learning models to predict the heart disease.

Also we will be using MLlib pipeline which is API for ML algorithms to combine multiple algorithms into single pipeline or workflow.

This report contains the Heart Disease Data to predict if patient is having heart disease or not which is our response variable. Since our response variable is binary categorical variable, we will use classification algorithms as below.

  • Logistic Regression
  • Random Forest Classifier
  • Gradient Boosted Tree Classifier
  • Decision Tree Classifier

Let's import required libraries.

In [329]:
#import required library
from pyspark.ml.feature import StringIndexer, OneHotEncoder, SQLTransformer, VectorAssembler

Data PreProcessing
Machine learning models require numerical data. We do have categorical variables in our data like Sex, ChestPainType, ... So we need to convert our categorical variables into numeric forms.

  • StringIndexer maps string column of labels to an ML column of label indices. Let's say our Sex input column is having Male and Female which will convert to 0.0 and 1.0 indices.

  • One Hot Encoder Categorical features are turned into binary features that are “one-hot” encoded, meaning that if a feature is represented by that column, it receives a 1. Otherwise, it receives a 0

  • Log Transormation - We have numerical variables in different ranges. This will highly impact our prediction. To reduce this skewness, we will transform our RestingBP, Cholesterol and MaxHR into log form. RestingBP and Cholesterol is having 0 value which is causing issue when we transform to log and run the models. so we have added 1 to each column value of RestingBP and Cholesterol and then have taken the log.

In [330]:
sex_indexer = StringIndexer(inputCol = "Sex", outputCol="SexIndex")

chestPain_indexer = StringIndexer(inputCol = "ChestPainType", outputCol="ChestPainTypeIndex")

RestingECG_indexer = StringIndexer(inputCol = "RestingECG", outputCol="RestingECGIndex")

ExerciseAngina_indexer = StringIndexer(inputCol = "ExerciseAngina", outputCol="ExerciseAnginaIndex")

ST_Slope_indexer  = StringIndexer(inputCol = "ST_Slope", outputCol="ST_SlopeIndex")
In [331]:
encoder = OneHotEncoder().setInputCols(["SexIndex", "ChestPainTypeIndex", "RestingECGIndex",
                                        "ExerciseAnginaIndex", "ST_SlopeIndex"])\
                                 .setOutputCols(["Sex_encoded", "ChestPainType_encoded",
                                                 "RestingECG_encoded", "ExerciseAngina_encoded",
                                                 "ST_Slope_encoded"])

SQLTransformer - After Data Processing, we are using transformer. SQLTransformer implements transformations defined by below sql statement and keep HeartDisease as label to use for modeling.

In [332]:
sqlTrans1 = SQLTransformer(
    statement = "SELECT Age, Sex_encoded, ChestPainType_encoded,"+
                "log(RestingBP+1) as log_RestingBP," + 
                "log(Cholesterol+1) as log_Cholesterol, FastingBS, RestingECG_encoded," +
                "log(MaxHR) as log_MaxHR, ExerciseAngina_encoded, Oldpeak," +
                "ST_Slope_encoded," +
                "HeartDisease as label FROM __THIS__"
)

VectorAssembler - VectorAssembler help to merge all predictors into one vector to use as features variable while modeling.

In [333]:
assembler = VectorAssembler(inputCols = ["Age","Sex_encoded", "ChestPainType_encoded","FastingBS","RestingECG_encoded",
                                        "ExerciseAngina_encoded","Oldpeak","ST_Slope_encoded","log_MaxHR","log_RestingBP","log_Cholesterol"],
                            outputCol = "features",
                            handleInvalid = 'keep')

Logistic Regression¶

Logistic Regression is a type of Generalized Linear model which can have both numerical and categorical predictors. Logistic Regression is used for binary response variable. So this is perfect to model our data. It models average of HeartDisease success represent the probability of patients having heart disease. Basic logistic Regression models success probability using the logistic function. Logistic Regression's range is bounded between 0 and 1. Logistic Regression uses a loss function as MLE which is a conditional probability. If the probability is greater than 0.5, the predictions will be classified as 0. Otherwise, 1 will be assigned.

In [561]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()

MLlib Pipeline¶

As mentioned above, MLlib Pipeline is an API , make easier to combine multiple algorithms into a single workflow. Pipeline works in the sequence of stages. It contains below stages.

  • Transformer - Transformer will take dataframe, read column, apply necessary transformations like we did earlier to as sqlTransform, VectorAssembler, log-transform and provide output as features and target variable as label to estimator to use for modeling. Basically Transformer perform transform and outputs transformed Dataframe for estimator.
  • Estimator - Estimator will take the transformed dataset and learn the algorithm. so here we have Logistic Regression as estimator which will fit data and produce model.
In [562]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
                              ExerciseAngina_indexer, ST_Slope_indexer, encoder,
                              sqlTrans1,  assembler, lr])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show()
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[40.0,1.0,0.0,0.0...|
|    1|[49.0,0.0,0.0,1.0...|
|    0|(15,[0,1,4,8,11,1...|
|    1|(15,[0,2,6,9,10,1...|
|    0|[54.0,1.0,0.0,1.0...|
|    0|[39.0,1.0,0.0,1.0...|
|    0|(15,[0,4,6,8,11,1...|
|    0|[54.0,1.0,0.0,0.0...|
|    1|[37.0,1.0,1.0,0.0...|
|    0|(15,[0,4,6,8,11,1...|
|    0|(15,[0,3,6,8,11,1...|
|    1|(15,[0,1,4,9,10,1...|
|    0|[39.0,1.0,0.0,0.0...|
|    1|[49.0,1.0,1.0,0.0...|
|    0|(15,[0,3,8,11,12,...|
|    0|[54.0,0.0,0.0,0.0...|
|    1|[38.0,1.0,1.0,0.0...|
|    0|(15,[0,4,6,8,11,1...|
|    1|[60.0,1.0,1.0,0.0...|
|    1|[36.0,1.0,0.0,0.0...|
+-----+--------------------+
only showing top 20 rows

Cross Validation¶

Using only splitting of data into train and test set, it may split randomly. So if any variables having less type of categories, more data points having those categories may endup exists for training set or testing set. As a result, our model may not learn or predict properly. To avoid such issue, we could consider the data split multiple ways and averaging over the results.
We are using 5-Fold Cross Validation

Evaluator¶

For our binary response variable, we are using Binary Classification Evaluator as evaluator which uses metric as 'areaUnderROC'
AreaUnderROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.

In [563]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.1, 0.5, 1.0, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10, 20, 50])
             .build())


# Evaluate model
lrevaluator = BinaryClassificationEvaluator()

# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = lrevaluator,
                          numFolds = 5)
In [564]:
cvmodel = crossval.fit(train)
In [565]:
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
Out[565]:
[(0.921698236655667,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.921876533412266,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9253820866178969,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9231798740796417,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9228313885886124,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9217520666549949,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9199346225920013,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9206359851919063,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9229306982354585,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9221178894247135,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9215982780450014,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9200155520121336,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9207239794636595,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9235616160296888,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9227068378545418,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9216927707575282,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9205062484893496,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9206764960828429,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9217284815436905,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9224004483985779,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9216990345602316,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9213131743075629,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.920880294362944,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9238808776379728,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9241391446709429,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.921698236655667,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9218508175035522,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9244825533299477,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9243307067255726,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9249160927743874,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9220537329751081,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.919317068002891,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.92019071555804,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9256304374225555,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9241867701942522,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9207711003828134,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9177818330553108,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9187030441093909,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9219464268113003,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9203431651041946,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.919006094267317,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9149693355269335,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.915611839718522,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9155929586882892,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9175641589905652,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9157544718287107,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.909549044425528,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9052989932359772,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9035818069180428,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9055935834347858,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.921698236655667,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9214301633822402,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9235642257233458,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9237412199002757,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9248698194438002,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.9084701515090448,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9032914986959106,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.901107403245561,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8995250455244649,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9008647731590198,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.8400116447064009,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.8373774271437777,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.8391005740423633,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8241198861175963,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8241198861175963,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.921698236655667,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9212270623180709,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9230339581265148,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.923257114480347,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9241171933977016,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.8400116447064009,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.8400116447064009,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.8400116447064009,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8374548265245827,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8400116447064009,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.921698236655667,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.9217135653603661,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.9222601536649523,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9219214918776323,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9228879198999813,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.25,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.75,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 1}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 5}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.5,
  {Param(parent='LogisticRegression_09daa73064c4', name='regParam', doc='regularization parameter (>= 0).'): 2.0,
   Param(parent='LogisticRegression_09daa73064c4', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 1.0,
   Param(parent='LogisticRegression_09daa73064c4', name='maxIter', doc='max number of iterations (>= 0).'): 50})]
In [566]:
# use the best model

cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
|Age|  Sex_encoded|ChestPainType_encoded|     log_RestingBP|  log_Cholesterol|FastingBS|RestingECG_encoded|        log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label|            features|       rawPrediction|         probability|prediction|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
| 29|(1,[0],[1.0])|        (3,[2],[1.0])| 4.795790545596741|5.497168225293202|        0|     (2,[0],[1.0])|5.075173815233827|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|[29.0,1.0,0.0,0.0...|[2.08694552674690...|[0.88962786326173...|       0.0|
| 30|    (1,[],[])|            (3,[],[])|  5.14166355650266|5.472270673671475|        0|     (2,[1],[1.0])|5.135798437050262|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|(15,[0,7,8,11,12,...|[2.36858881296343...|[0.91440046835339...|       0.0|
| 35|    (1,[],[])|        (3,[0],[1.0])|4.9344739331306915|5.214935757608986|        0|     (2,[0],[1.0])|5.204006687076795|         (1,[0],[1.0])|    1.4|   (2,[1],[1.0])|    0|[35.0,0.0,1.0,0.0...|[1.26141899229498...|[0.77927028268269...|       0.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|              0.0|        1|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.2|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[-2.5756088964456...|[0.07072478302241...|       1.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|5.293304824724492|        0|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.6|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[-1.4607612263304...|[0.18835092524094...|       1.0|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
only showing top 5 rows

In [567]:
lrROC = lrevaluator.evaluate(cvmodel.transform(test))
print(lrROC)
0.9363052568697732

Random Forest Classifier¶

RandomForest is Tree based algorithm. It average across many fitted trees. It create multiple trees from bootstrap samples. It don't use all predictors for each bootstrap sample fit. RF randomly select subset of predictors and provides output with strong predictor subset having more accuracy.

In [568]:
### Random Forest
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
                              ExerciseAngina_indexer, ST_Slope_indexer, encoder,
                              sqlTrans1,  assembler, rf])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[40.0,1.0,0.0,0.0...|
|    1|[49.0,0.0,0.0,1.0...|
+-----+--------------------+
only showing top 2 rows

In [569]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
    .addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
    .build()
# Evaluate model
rfevaluator = BinaryClassificationEvaluator()

# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = rfevaluator,
                          numFolds = 5)

cvmodel = crossval.fit(train)
In [570]:
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
Out[570]:
[(0.9111788690211999,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.8972880540427625,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}),
 (0.8969978063647439,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 10,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25}),
 (0.921390158382225,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.9194762757480442,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}),
 (0.9194762757480442,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 30,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25}),
 (0.9226541832127487,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.9246970601500079,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15}),
 (0.9248342566413711,
  {Param(parent='RandomForestClassifier_1f369599bd84', name='numTrees', doc='Number of trees to train (>= 1).'): 50,
   Param(parent='RandomForestClassifier_1f369599bd84', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 25})]
In [571]:
# use the best model

cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
|Age|  Sex_encoded|ChestPainType_encoded|     log_RestingBP|  log_Cholesterol|FastingBS|RestingECG_encoded|        log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label|            features|       rawPrediction|         probability|prediction|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
| 29|(1,[0],[1.0])|        (3,[2],[1.0])| 4.795790545596741|5.497168225293202|        0|     (2,[0],[1.0])|5.075173815233827|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|[29.0,1.0,0.0,0.0...|[49.8957446062709...|[0.99791489212541...|       0.0|
| 30|    (1,[],[])|            (3,[],[])|  5.14166355650266|5.472270673671475|        0|     (2,[1],[1.0])|5.135798437050262|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|(15,[0,7,8,11,12,...|[43.9101496768557...|[0.87820299353711...|       0.0|
| 35|    (1,[],[])|        (3,[0],[1.0])|4.9344739331306915|5.214935757608986|        0|     (2,[0],[1.0])|5.204006687076795|         (1,[0],[1.0])|    1.4|   (2,[1],[1.0])|    0|[35.0,0.0,1.0,0.0...|[31.9546509845149...|[0.63909301969029...|       0.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|              0.0|        1|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.2|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[3.58490108022818...|[0.07169802160456...|       1.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|5.293304824724492|        0|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.6|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[3.07678419711129...|[0.06153568394222...|       1.0|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
only showing top 5 rows

In [572]:
rfROC = lrevaluator.evaluate(cvmodel.transform(test))
print(rfROC)
0.9318623058542413

Gradient-Boosted Tree Classifier¶

Gradient Boosting is a complex and slow learner algorithm of Tree Classifier. It is a grouping of Gradient descent and Boosting. In gradient boosting, each new model minimizes the loss function from its predecessor using the Gradient Descent Method. This procedure continues until a more optimal estimate of the target variable has been achieved

In [549]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(maxIter=10)
In [550]:
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
                              ExerciseAngina_indexer, ST_Slope_indexer, encoder,
                              sqlTrans1,  assembler, gbt])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[40.0,1.0,0.0,0.0...|
|    1|[49.0,0.0,0.0,1.0...|
+-----+--------------------+
only showing top 2 rows

In [551]:
# Create 5-fold CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

crossval = CrossValidator(estimator = pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = evaluator,
                          numFolds = 5)

cvmodel = crossval.fit(train)
In [552]:
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
Out[552]:
[(0.9096798679414457,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9156012613588078,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9085977987092215,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9135681150201598,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9114822513993694,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9171626983143366,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.9047018979423522,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9098797841912523,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8851504202139524,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.886531155652834,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8704223486989228,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8803257386042476,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20})]
In [553]:
# use the best model

cvmodel.transform(test).show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
|Age|  Sex_encoded|ChestPainType_encoded|     log_RestingBP|  log_Cholesterol|FastingBS|RestingECG_encoded|        log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label|            features|       rawPrediction|         probability|prediction|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
| 29|(1,[0],[1.0])|        (3,[2],[1.0])| 4.795790545596741|5.497168225293202|        0|     (2,[0],[1.0])|5.075173815233827|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|[29.0,1.0,0.0,0.0...|[1.29550740802887...|[0.93028106501865...|       0.0|
| 30|    (1,[],[])|            (3,[],[])|  5.14166355650266|5.472270673671475|        0|     (2,[1],[1.0])|5.135798437050262|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|(15,[0,7,8,11,12,...|[0.84109146598563...|[0.84319337096572...|       0.0|
| 35|    (1,[],[])|        (3,[0],[1.0])|4.9344739331306915|5.214935757608986|        0|     (2,[0],[1.0])|5.204006687076795|         (1,[0],[1.0])|    1.4|   (2,[1],[1.0])|    0|[35.0,0.0,1.0,0.0...|[0.96705524005898...|[0.87370369024086...|       0.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|              0.0|        1|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.2|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[-1.1716780737366...|[0.08759531203936...|       1.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|5.293304824724492|        0|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.6|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[-1.2415565571794...|[0.07705052522522...|       1.0|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
only showing top 5 rows

In [554]:
gbtROC = evaluator.evaluate(cvmodel.transform(test))
print(gbtROC)
0.9171146953405022

Decision Tree Classifier¶

It classify the group based on Tree classifier algorithm. It splits the data based on question yes/no till it split data appropriately into each classes. This algorithm is easy to understand and interpret output. Predictors don't need to be scaled but small change may lead to large affect on prediction.

In [555]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label',  maxDepth = 3)
In [556]:
pipeline = Pipeline(stages = [sex_indexer, chestPain_indexer, RestingECG_indexer,
                              ExerciseAngina_indexer, ST_Slope_indexer, encoder,
                              sqlTrans1,  assembler, dt])
model = pipeline.fit(sql_heart).transform(sql_heart)
model.select("label", "features").show(2)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[40.0,1.0,0.0,0.0...|
|    1|[49.0,0.0,0.0,1.0...|
+-----+--------------------+
only showing top 2 rows

In [557]:
dtevaluator = BinaryClassificationEvaluator()

# Create ParamGrid for Cross Validation
dtparamGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [2, 5, 10])
             .addGrid(dt.maxBins, [10, 20, 40, 80, 100])
             .build())


# Create 5-fold CrossValidator
crossval = CrossValidator(estimator = pipeline,
                          estimatorParamMaps = dtparamGrid,
                          evaluator = dtevaluator,
                          numFolds = 5)

cvmodel = crossval.fit(train)
In [558]:
# check which model is best
list(zip(cvmodel.avgMetrics, paramGrid))
Out[558]:
[(0.7792443486530496,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.7792443486530496,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.7792443486530496,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.7792443486530496,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.7792443486530496,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8434679871865438,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8204128078958767,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8073469584329684,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 4,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.7986465938728258,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.7945293721569198,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20}),
 (0.8307560098307648,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.8255715819650933,
  {Param(parent='GBTClassifier_5c51dd42dc44', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 6,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 60,
   Param(parent='GBTClassifier_5c51dd42dc44', name='maxIter', doc='max number of iterations (>= 0).'): 20})]
In [576]:
# use the best model

dtpred = cvmodel.transform(test)
dtpred.show(5)
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
|Age|  Sex_encoded|ChestPainType_encoded|     log_RestingBP|  log_Cholesterol|FastingBS|RestingECG_encoded|        log_MaxHR|ExerciseAngina_encoded|Oldpeak|ST_Slope_encoded|label|            features|       rawPrediction|         probability|prediction|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
| 29|(1,[0],[1.0])|        (3,[2],[1.0])| 4.795790545596741|5.497168225293202|        0|     (2,[0],[1.0])|5.075173815233827|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|[29.0,1.0,0.0,0.0...|[49.8957446062709...|[0.99791489212541...|       0.0|
| 30|    (1,[],[])|            (3,[],[])|  5.14166355650266|5.472270673671475|        0|     (2,[1],[1.0])|5.135798437050262|         (1,[0],[1.0])|    0.0|   (2,[1],[1.0])|    0|(15,[0,7,8,11,12,...|[43.9101496768557...|[0.87820299353711...|       0.0|
| 35|    (1,[],[])|        (3,[0],[1.0])|4.9344739331306915|5.214935757608986|        0|     (2,[0],[1.0])|5.204006687076795|         (1,[0],[1.0])|    1.4|   (2,[1],[1.0])|    0|[35.0,0.0,1.0,0.0...|[31.9546509845149...|[0.63909301969029...|       0.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|              0.0|        1|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.2|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[3.58490108022818...|[0.07169802160456...|       1.0|
| 35|(1,[0],[1.0])|        (3,[0],[1.0])| 4.795790545596741|5.293304824724492|        0|     (2,[0],[1.0])|4.867534450455582|             (1,[],[])|    1.6|   (2,[0],[1.0])|    1|[35.0,1.0,1.0,0.0...|[3.07678419711129...|[0.06153568394222...|       1.0|
+---+-------------+---------------------+------------------+-----------------+---------+------------------+-----------------+----------------------+-------+----------------+-----+--------------------+--------------------+--------------------+----------+
only showing top 5 rows

In [560]:
dtROC = dtevaluator.evaluate(cvmodel.transform(test))
print(dtROC)
0.8903076463560335
In [598]:
sns.set_style("whitegrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("AreaUnderROC")

plt.xlabel("Algorithms")
sns.lineplot(data=range(1,100), x=["Logistic",'RandomForest','GradientBosting','DecisionTree'], 
             y=[0.9363052568697732,0.9318623058542413,0.9171146953405022,0.8903076463560335])
plt.show()

Based on above graph, we observe the AreaUnderROC is covered mostly with Logistic Regression i.e. 93.63%

Algorithm AreaUnderROC
Logistic Regression 93.63%
Random Forest 93.18%
Gradient Bosting 91.71%
Decision Tree 89.03%

Conclusion¶

Whether the patient is having heart disease or not?¶

In this report, we proposed 4 classification algorithms in which comparative analysis is done and promising results are achieved based on available data. The conclusion we found is that based on our binary classification evaluator , associated metric AreaUnderROC and parameters provided for each algorithm, logistic regression model performed at best to classify if patient is having heart disease or not. Logistic Regression took comparatively more time to learn the model and predict the results but able to get more promising prediction. Random Forest is also having about to same prediction result to classify heart disease carrier vs non-carriers. So we should be able to achieve good results by tuning hyper parameters for Random Forest Model.

Key Findings
It is also observed that the statistical analysis is a necessary task in the combination of latest technologies like MLlib Pipeline. When a dataset is analyzed, it can handle better way. We observed Oldpeak and MaxHR are two strongest predictors for this dataset. Then the outlier’s detection is also important to understand how data may behave/skewed. Data preprocessing is necessary to provide proper data for modeling to achieve better prediction results.

Future Scope
This prediction is not sufficient to predict real world case. To produce an even more accurate heart disease prediction model, it would be helpful to obtain a larger dataset as well as a more recent dataset. Also proper actions to be taken to handle outliers.