Data Science and ML interview questions

Question 1) How to choose the value of the regularisation parameter (λ)?

Selecting the regularisation parameter is a tricky business. If the value of λ is too high, it will lead to extremely small values of the regression coefficient, which will lead to the model underfitting (high bias — low variance). On the other hand, if the value of λ is 0 (very small), the model will tend to overfit the training data (low bias — high variance).

There is no proper way to select the value of λ. What you can do is have sub-samples of data and run the algorithm multiple times on different sets with values of λ. Here, the person has to decide how much variance can be tolerated. Once the user is satisfied with the variance, that value of λ can be chosen for the full dataset

One thing to be noted is that the value of λ selected here was optimal for that subset, not for the entire training data.

Question 2) Can we use linear regression for time series analysis?

One can use linear regression for time series analysis, but the results are not promising. So, it is generally not advisable to do so. The reasons behind this are —

1. Time series data is mostly used for the prediction of the future, but linear regression seldom gives good results for future prediction as it is not meant for extrapolation.

2. Mostly, time series data have a pattern, such as during peak hours, festive seasons, etc., which would most likely be treated as outliers in the linear regression analysis.

Question 3) What value is the sum of the residuals of a linear regression close to? Justify.

The sum of the residuals of a linear regression is 0. Linear regression works on the assumption that the errors (residuals) are normally distributed with a mean of 0, i.e.

Question 4) How does multicollinearity affect the linear regression?

Multicollinearity occurs when some of the independent variables are highly correlated (positively or negatively) with each other. Multicollinearity causes a problem as it is against the basic assumption of linear regression. The presence of multicollinearity does not affect the predictive capability of the model. So, if you just want predictions, the presence of multicollinearity does not affect your output. However, if you want to draw some insights from the model and apply them in, let’s say, some business model, it may cause problems.

One of the major problems caused by multicollinearity is that it leads to incorrect interpretations and provides wrong insights. The coefficients of linear regression suggest the mean change in the target value if a feature is changed by one unit. So, if multicollinearity exists, this does not hold true as changing one feature will lead to changes in the correlated variable and consequent changes in the target variable. This leads to wrong insights and can produce hazardous results for a business.

A highly effective way of dealing with multicollinearity is the use of VIF (Variance Inflation Factor). Higher the value of VIF for a feature, more linearly correlated is that feature. Simply remove the feature with very high VIF value and re-train the model on the remaining dataset.

Question 5) What is the normal form (equation) of linear regression? When should it be preferred to the gradient descent method?

The normal equation for linear regression is —

Question 6) You run your regression on different subsets of your data, and in each subset, the beta value for a certain variable varies wildly. What could be the issue here?

This case implies that the dataset is heterogeneous. So, to overcome this problem, the dataset should be clustered into different subsets, and then separate models should be built for each cluster. Another way to deal with this problem is to use non-parametric models, such as decision trees, which can deal with heterogeneous data quite efficiently.

Question 7) Your linear regression doesn’t run and communicates that there is an infinite number of best estimates for the regression coefficients. What could be wrong?

This condition arises when there is a perfect correlation (positive or negative) between some variables. In this case, there is no unique value for the coefficients, and hence, the given condition arises.

- Part 3

Question 1) How do you interpret the residual vs fitted value curve?

The residual vs fitted value plot is used to see whether the predicted values and residuals have a correlation or not. If the residuals are distributed normally, with a mean around the fitted value and a constant variance, our model is working fine; otherwise, there is some issue with the model.

The most common problem that can be found when training the model over a large range of a dataset is heteroscedasticity (this is explained in the answer below). The presence of heteroscedasticity can be easily seen by plotting the residual vs fitted value curve.

Question 2) What is heteroscedasticity? What are the consequences, and how can you overcome it?

A random variable is said to be heteroscedastic when different sub-populations have different variabilities (standard deviation).

The existence of heteroscedasticity gives rise to certain problems in the regression analysis as the assumption says that error terms are uncorrelated and, hence, the variance is constant. The presence of heteroscedasticity can often be seen in the form of a cone-like scatter plot for residual vs fitted values.

One of the basic assumptions of linear regression is that heteroscedasticity is not present in the data. Due to violation of the assumptions, the Ordinary Least Squares (OLS) estimators are not the Best Linear Unbiased Estimators (BLUE). Hence, they do not give the least variance than other Linear Unbiased Estimators (LUEs).

There is no fixed procedure to overcome heteroscedasticity. However, there are some ways that may lead to the reduction of heteroscedasticity. They are —

1. Logarithmising the data: A series that is increasing exponentially often results in increased variability. This can be overcome using the log transformation.

2. Using weighted linear regression: Here, the OLS method is applied to the weighted values of X and Y. One way is to attach weights directly related to the magnitude of the dependent variable.

Question 3) What is VIF? How do you calculate it?

Variance Inflation Factor (VIF) is used to check the presence of multicollinearity in a dataset. It is calculated as —

Question 4) How do you know that linear regression is suitable for any given data?

To see if linear regression is suitable for any given data, a scatter plot can be used. If the relationship looks linear, we can go for a linear model. But if it is not the case, we have to apply some transformations to make the relationship linear. Plotting the scatter plots is easy in case of simple or univariate linear regression. But in case of multivariate linear regression, two dimensional pair-wise scatter plots, rotating plots, and dynamic graphs can be plotted.

Question 5) How is hypothesis testing used in linear regression?

Hypothesis testing can be carried out in linear regression for the following purposes:

1. To check whether a predictor is significant for the prediction of the target variable. Two common methods for this are — ○ By the use of p-values: If the p-value of a variable is greater than a certain limit (usually 0.05), the variable is insignificant in the prediction of the target variable.

By checking the values of the regression coefficient: If the value of regression coefficient corresponding to a predictor is zero, that variable is insignificant in the prediction of the target variable and has no linear relationship with it.

2. To check whether the calculated regression coefficients are good estimators of the actual coefficients.

Question 6) Explain gradient descent with respect to linear regression.

Gradient descent is an optimisation algorithm. In linear regression, it is used to optimise the cost function and find the values of the βs (estimators) corresponding to the optimised value of the cost function.

Gradient descent works like a ball rolling down a graph (ignoring the inertia). The ball moves along the direction of the greatest gradient and comes to rest at the flat surface (minima).

Question 2) What is robust regression?

A regression model should be robust in nature. This means that with changes in a few observations, the model should not change drastically. Also, it should not be much affected by the outliers.

A regression model with OLS (Ordinary Least Squares) is quite sensitive to the outliers. To overcome this problem, we can use the WLS (Weighted Least Squares) method to determine the estimators of the regression coefficients. Here, less weights are given to the outliers or high leverage points in the fitting, making these points less impactful.

Question 3) Which graphs are suggested to be observed before model fitting?

Before fitting the model, one must be well aware of the data, such as what the trends, distribution, skewness, etc. in the variables are. Graphs such as histograms, box plots, and dot plots can be used to observe the distribution of the variables. Apart from this, one must also analyse what the relationship between dependent and independent variables is. This can be done by scatter plots (in case of univariate problems), rotating plots, dynamic plots, etc.

Question 5) What is the generalized linear model?

The generalized linear model is the derivative of the ordinary linear regression model. GLM is more flexible in terms of residuals and can be used where linear regression does not seem appropriate. GLM allows the distribution of residuals to be other than a normal distribution. It generalizes the linear regression by allowing the linear model to link to the target variable using the linking function. Model estimation is done using the method of maximum likelihood estimation.

Question 6) Explain the bias-variance trade-off.

Bias refers to the difference between the values predicted by the model and the real values. It is an error. One of the goals of an ML algorithm is to have low bias.

Variance refers to the sensitivity of the model to small fluctuations in the training dataset. Another goal of an ML algorithm is to have low variance.

For a dataset that is not exactly linear, it is not possible to have both bias and variance low at the same time. A straight line model will have low variance but

high bias, whereas a high-degree polynomial will have low bias but high variance.

There is no escaping the relationship between bias and variance in machine learning.

1. Decreasing the bias increases the variance.

2. Decreasing the variance increases the bias.

So, there is a trade-off between the two; the ML specialist has to decide, based on the assigned problem, how much bias and variance can be tolerated. Based on this, the final model is built.

Question 7) How can learning curves help create a better model?

Learning curves give the indication of the presence of overfitting or underfitting.

In a learning curve, the training error and cross-validating error are plotted against the number of training data points. A typical learning curve looks like this:

If the training error and true error (cross-validating error) converge to the same value and the corresponding value of error is high, it indicates that the model is underfitting and is suffering from high bias.

If there is a significant gap between the converging values of the training and cross-validating errors, i.e. the cross-validating error is significantly higher than the training error, it suggests that the model is overfitting the training data and is suffering from high variance.

Question 8) Your model is not working well. Is getting more data always a solution?

No, getting more data is not always a solution. If the model is suffering from high bias, getting more data will not help after a point of time. Also, the more the training data, the more is the storage required. This leads to more

computational power being needed for training the model. So, before trying to get more data, the cost associated with it must also be considered.

Question 9) What are parametric and non-parametric machine learning algorithms?

Assumptions can greatly simplify the learning process but can also limit what can be learned. Algorithms that simplify a function to a known form are called parametric machine learning algorithms. Algorithms that do not make strong assumptions about the form of a mapping function are called non-parametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

Examples of parametric machine learning algorithms are —

● Logistic regression

● Linear discriminant analysis

● Naive Bayes

● Simple neural networks

Examples of non-parametric machine learning algorithms are —

● K-Nearest Neighbors

● Decision trees, such as CART and C4.5

● Support vector machines

Practice Questions

Answer: a

Answer : d

Answer: a

Logistic Regression

- Part 1

Question 7) How to interpret the results of a logistic regression model? Or, what are the meanings of alpha and beta in a logistic regression model?

Alpha is the baseline in a logistic regression model. It is the log odds for an instance when all the attributes (X1, X2,………….Xk) are zero. In practical scenarios, the probability of all the attributes being zero is very low. In another interpretation, Alpha is the log odds for an instance when none of the attributes is taken into consideration.

Beta is the value by which the log odds change by a unit change in a particular attribute by keeping all other attributes fixed or unchanged (control variables).

Question 8) What is odds ratio?

Odds ratio is the ratio of odds between two groups. For example, let’s assume that we are trying to ascertain the effectiveness of a medicine. We administered this medicine to the ‘intervention’ group and a placebo to the ‘control’ group.

Odds ratio (OR) = odds of the intervention group/odds of the control group

Interpretation

If odds ratio = 1, then there is no difference between the intervention group and the control group

If odds ratio is greater than 1, then the intervention group is better than the control group

If odds ratio is less than 1, then the control group is better than the intervention group.

Question 9) What is the formula for calculating odds ratio?

In the formula above, X1 and X0 stand for two different groups for which odds ratio needs to be calculated. X1i stands for the instance ‘i’ in group X1. X0i stands for the instance ‘i’ in group X0. stands for the coefficient of the logistic regression model. Note that the baseline is not included in this formula.

Question 10) Why can’t linear regression be used in place of logistic regression for binary classification?

The reasons why linear regressions cannot be used in case of binary classification are as follows:

Distribution of error terms: The distribution of data in case of linear and logistic regression is different. Linear regression assumes that error terms are normally distributed. In case of binary classification, this assumption does not hold true.

Model output: In linear regression, the output is continuous. In case of binary classification, an output of a continuous value does not make sense. For binary classification problems, linear regression may predict values that can go beyond 0 and 1. If we want the output in the form of probabilities, which can be mapped to two different classes, then its range should be restricted to 0 and 1. As the logistic regression model can output probabilities with logistic/sigmoid function, it is preferred over linear regression.

Variance of Residual errors: Linear regression assumes that the variance of random errors is constant. This assumption is also violated in case of logistic regression.

- Part 2

Question 1) Is the decision boundary linear or nonlinear in the case of a logistic regression model?

The decision boundary is a line that separates the target variables into different classes. The decision boundary can either be linear or nonlinear. In case of a logistic regression model, the decision boundary is a straight line.

Logistic regression model formula = α+1X1+2X2+….+kXk. This clearly represents a straight line. Logistic regression is only suitable in such cases where a straight line is able to separate the different classes. If a straight line is not able to do it, then nonlinear algorithms should be used to achieve better results.

Question 2) What is the likelihood function?

Ans The likelihood function is the joint probability of observing the data. For example, let’s assume that a coin is tossed 100 times and we want to know the probability of getting 60 heads from the tosses. This example follows the binomial distribution formula.

p = Probability of heads from a single coin toss

n = 100 (the number of coin tosses)

x = 60 (the number of heads — success)

n-x = 30 (the number of tails)

Pr(X=60 |n = 100, p)

The likelihood function is the probability that the number of heads received is 60 in a trail of 100 coin tosses, where the probability of heads received in each coin toss is p. Here the coin tosses follow the binomial distribution.

This can be reframed as follows:

Pr(X=60|n=100,p) = c x p60x(1-p)100–60 = L()

c = constant

p = unknown parameter

The likelihood function gives the probability of observing the results using unknown parameters.

Question 3) What is the Maximum Likelihood Estimator (MLE)?

The MLE chooses those sets of unknown parameters (estimator) that maximise the likelihood function. The method to find the MLE is to use calculus and setting the derivative of the logistic function with respect to an unknown parameter to zero, and solving it will give the MLE. For a binomial

model, this will be easy, but for a logistic model, the calculations are complex. Computer programs are used for deriving MLE for logistic models.

(Here’s another approach to answering the question.)

MLE is a statistical approach to estimating the parameters of a mathematical model. MLE and ordinary square estimation give the same results for linear regression if the dependent variable is assumed to be normally distributed. MLE does not assume anything about independent variables.

Question 4) What are the different methods of MLE and when is each method preferred?

In case of logistics regression, there are two approaches of MLE. They are conditional and unconditional methods. Conditional and unconditional methods are algorithms that use different likelihood functions. The unconditional formula employs joint probability of positives (for example, churn) and negatives (for example, non-churn). The conditional formula is the ratio of the probability of observed data to the probability of all possible configurations.

The unconditional method is preferred if the number of parameters is lower compared to the number of instances. If the number of parameters is high compared to the number of instances, then conditional MLE is to be preferred. Statisticians suggest that conditional MLE is to be used when in doubt. Conditional MLE will always provide unbiased results.

Question 5) What are the advantages and disadvantages of conditional and unconditional methods of MLE?

Conditional methods do not estimate unwanted parameters. Unconditional methods estimate the values of unwanted parameters also. Unconditional formulas can directly be developed with joint probabilities. This cannot be done with conditional probability. If the number of parameters is high relative to the number of instances, then the unconditional method will give biased results. Conditional results will be unbiased in such cases.

Question 6) What is the output of a standard MLE program?

The output of a standard MLE program is as follows:

Maximised likelihood value: This is the numerical value obtained by replacing the unknown parameter values in the likelihood function with the MLE parameter estimator.

Estimated variance-covariance matrix: The diagonal of this matrix consists of estimated variances of the ML estimates. The off-diagonal consists of the covariances of the pairs of the ML estimates.

Question 7) Why can’t we use Mean Square Error (MSE) as a cost function for logistic regression?

In logistic regression, we use the sigmoid function and perform a non-linear transformation to obtain the probabilities. Squaring this non-linear transformation will lead to non-convexity with local minimums. Finding the global minimum in such cases using gradient descent is not possible. Due to this reason, MSE is not suitable for logistic regression. Cross-entropy or log loss is used as a cost function for logistic regression. In the cost function for logistic regression, the confident wrong predictions are penalised heavily. The confident right predictions are rewarded less. By optimising this cost function, convergence is achieved.

- Part 3

Question 1) Why is accuracy not a good measure for classification problems?

Accuracy is not a good measure for classification problems because it gives equal importance to both false positives and false negatives. However, this may not be the case in most business problems. For example, in case of cancer prediction, declaring a cancer as benign is more serious than wrongly informing the patient that he is suffering from cancer. Accuracy gives equal importance to both cases and cannot differentiate between them.

Question 2) What is the importance of a baseline in a classification problem?

Most classification problems deal with imbalanced datasets. Examples include telecom churn, employee attrition, cancer prediction, fraud detection, online advertisement targeting, and so on. In all these problems, the number of the positive classes will be very low when compared to the negative classes. In some cases, it is common to have positive classes that are less than 1% of the total sample. In such cases, an accuracy of 99% may sound very good but, in reality, it may not be. Here, the negatives are 99%, and hence, the baseline will remain the same. If the algorithms predict all the instances as negative, then also the accuracy will be 99%. In this case, all the positives will be predicted wrongly, which is very important for any business. Even though all

the positives are predicted wrongly, an accuracy of 99% is achieved. So, the baseline is very important, and the algorithm needs to be evaluated relative to the baseline.

Question 3) What are false positives and false negatives?

False positives are those cases in which the negatives are wrongly predicted as positives. For example, predicting that a customer will churn when, in fact, he is not churning.

False negatives are those cases in which the positives are wrongly predicted as negatives. For example, predicting that a customer will not churn when, in fact, he churns.

Question 4) What are the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR)?

TPR refers to the ratio of positives correctly predicted from all the true labels. In simple words, it is the frequency of correctly predicted true labels.

TPR = TP/TP+FN

TNR refers to the ratio of negatives correctly predicted from all the false labels. It is the frequency of correctly predicted false labels.

TNR = TN/TN+FP

FPR refers to the ratio of positives incorrectly predicted from all the true labels. It is the frequency of incorrectly predicted false labels.

FPR = FP/TN+FP

FNR refers to the ratio of negatives incorrectly predicted from all the false labels. It is the frequency of incorrectly predicted true labels.

FNR = FN/TP+FN

Question 5) What are precision and recall?

Precision is the proportion of true positives out of predicted positives. To put it in another way, it is the accuracy of the prediction. It is also known as the ‘positive predictive value’.

Precision = TP/TP+FP

Recall is same as the true positive rate (TPR).

Question 6) What is F-measure?

It is the harmonic mean of precision and recall. In some cases, there will be a trade-off between the precision and the recall. In such cases, the F-measure will drop. It will be high when both the precision and the recall are high. Depending on the business case at hand and the goal of data analytics, an appropriate metric should be selected.

F-measure = 2* (Precision * Recall / Precision+Recall)

Question 7) What is accuracy?

It is the number of correct predictions out of all predictions made.

Accuracy = TP+TN / The total number of Predictions

Question 8) What are sensitivity and specificity?

Specificity is the same as true negative rate, or it is equal to 1 — false positive rate.

Specificity = TN / TN + FP.

Sensitivity is the true positive rate.

Sensitivity = TP / TP + FN

Question 9) How to choose a cutoff point in case of a logistic regression model?

The cutoff point depends on the business objective. Depending on the goals of your business, the cutoff point needs to be selected. For example, let’s consider loan defaults. If the business objective is to reduce the loss, then the specificity needs to be high. If the aim is to increase the profits, then it is an entirely different matter. It may not be the case that profits will increase by avoiding giving loans to all predicted default cases. But it may be the case that the business has to disburse loans to default cases that are slightly less risky to increase the profits. In such a case, a different cutoff point, which maximises profit, will be required. In most of the instances, businesses will operate around many constraints. The cutoff point that satisfies the business objective will not be the same with and without limitations. The cutoff point needs to be selected considering all these points. As a thumb rule, choose a cutoff value that is equivalent to the proportion of positives in a dataset.

Question 10) How does logistic regression handle categorical variables?

The inputs to a logistic regression model need to be numeric. The algorithm cannot handle categorical variables directly. So, they need to be converted into a format that is suitable for the algorithm to process. The various levels of a categorical variable will be assigned a unique numeric value known as the dummy variable. These dummy variables are handled by the logistic regression model as any other numeric value.

- Part 4

Question 1) What are lift curves?

The lift is the improvement in model performance (increase in true positive rate) when compared to random performance. Random performance means if 50% of the instances is targeted, then it is expected that it will detect 50% of the positives. Lift is in comparison to the random performance of a model. If a model’s performance is better than its random performance, then its lift will be greater than 1.

In a lift curve, lift is plotted on the Y-axis and the percentage of the population (sorted in descending order) on the X-axis. At a given percentage of the target population, a model with a high lift is preferred.

Question 2) Which algorithm is better at handling outliers logistic regression or SVM?

Logistic regression will find a linear boundary if it exists to accommodate the outliers. Logistic regression will shift the linear boundary in order to accommodate the outliers. SVM is insensitive to individual samples. There will not be a major shift in the linear boundary to accommodate an outlier. SVM comes with inbuilt complexity controls, which take care of overfitting. This is not true in case of logistic regression.

Question 3) How will you deal with the multiclass classification problem using logistic regression?

The most famous method of dealing with multiclass classification using logistic regression is using the one-vs-all approach. Under this approach, a number of models are trained, which is equal to the number of classes. The models work in a specific way. For example, the first model classifies the datapoint depending on whether it belongs to class 1 or some other class; the second model classifies the datapoint into class 2 or some other class. This way, each data point can be checked over all the classes.

Question 4) Explain the use of ROC curves and the AUC of an ROC Curve.

An ROC (Receiver Operating Characteristic) curve illustrates the performance of a binary classification model. It is basically a TPR versus FPR (true positive rate versus false positive rate) curve for all the threshold values ranging from 0 to 1. In an ROC curve, each point in the ROC space will be associated with a different confusion matrix. A diagonal line from the bottom-left to the top-right on the ROC graph represents random guessing. The Area Under the Curve (AUC) signifies how good the classifier model is. If the value for AUC is high (near 1), then the model is working satisfactorily, whereas if the value is low (around 0.5), then the model is not working properly and just guessing randomly.

Question 5) How can you use the concept of ROC in a multiclass classification?

The concept of ROC curves can easily be used for multiclass classification by using the one-vs-all approach. For example, let’s say that we have three classes ‘a’, ’b’, and ‘c’. Then, the first class comprises class ‘a’ (true class) and the second class comprises both class ‘b’ and class ‘c’ together (false class). Thus, the ROC curve is plotted. Similarly, for all the three classes, we will plot three ROC curves and perform our analysis of AUC.

Question 6) What is a cumulative response curve (CRV)?

In order to convey the results of an analysis to management, a ‘cumulative response curve’ is used, which is more intuitive than the ROC curve. An ROC curve is very difficult to understand for someone outside the field of data science. A CRV consists of the true positive rate or the percentage of positives correctly classified on the Y-axis and the percentage of the population targeted on the X-axis. It is important to note that the percentage of the population will be ranked by the model in descending order (either the probabilities or the expected values). If the model is good, then by targeting a top portion of the ranked list, all high percentages of positives will be captured. As with the ROC curve, there will be a diagonal line which represents random performance. Let’s understand this random performance with an example. Assuming that 50% of the list is targeted, it is expected that it will capture 50% of the positives. This expectation is captured by the diagonal line, which is similar to the ROC curve.

Clustering:

- Part 1

Question 1) What is clustering and why is it used?

Clustering is an unsupervised technique used to group data objects. The groups are created in such a way that the points in a group are more similar to each other and different from the points in another group. The purpose of clustering can be divided into two broad categories, understanding and utility.

Clustering is done to understand the natural clusters or patterns that exist in data. For example, a business that uses clustering to segment its customer base needs to understand the characteristics of different segments.

Clustering is also done for utility. In this case, a prototype is created to act as a representation of the data objects in a particular cluster. This prototype is used for the purpose of analysis. For example, summarisation. If the dataset is very large, then it becomes very difficult to apply machine learning algorithms to it. In such cases, clustering can be used to create prototypes. Machine learning algorithms are used on the reduced dataset, which only consists of cluster prototypes. Other examples of using prototypes for utility include compressing images, finding nearest neighbours efficiently, etc.

Question 2) What are the different types of clustering?

Hierarchical versus partitional: Hierarchical clustering forms nested clusters. In this type, the subclusters exist in other clusters. In the partitional type, subclusters do not exist, and all the data objects belong to non-overlapping clusters.

Exclusive, overlapping, versus fuzzy: In the exclusive type cluster, a data object will exclusively belong to a single cluster. In case of an overlapping cluster, a data object will belong to multiple clusters. For example, an UpGrad employee who is also a student of one of UpGrad’s programs will belong to both the ‘employee’ and ‘student’ categories. In a fuzzy cluster, a data object will belong to all the clusters with a probability. The probability of a point belonging to different clusters will be different, and it will sum up to 1.

Complete versus partial: Complete is a type of cluster in which all the data objects are assigned to a cluster. In the partial type, data objects are left out without getting assigned to any cluster. For example, leaving out outliers without assigning them to any cluster.

Question 3) What are the different types of clusters?

Well-separated: These are clusters in which data objects exclusively belong to a single cluster.

Prototype: The data objects in this type of cluster are more similar to the prototype of the cluster it belongs to than to the prototype of a different cluster.

Graph: In this type, the data is represented in the form of a graph. The nodes will be the data objects, and the edges will represent the connections between the data objects.

Density: Denser regions are clustered separately from the less dense regions of data objects.

Conceptual: The data points in a cluster of this type share the same property. In this type of clustering, the algorithm searches for a specific concept to group the data objects together.

Question 4) What is K-means clustering? How does its algorithm work?

It is a prototype-based partition algorithm. The K-means algorithm works by finding out a centroid, which acts as a prototype for its cluster. The centroid for the K-means algorithm will be the mean of all the data points in a

particular cluster. As the centroid will be the mean, it may not correspond to any data point in the cluster.

The algorithm for K-means algorithm is as follows:

Select initial centroids. The input regarding the number of centroids should be given by the user.

● Assign the data points to the closest centroid

Recalculate the centroid for each cluster and assign the data objects again

Follow the same procedure until convergence. Convergence is achieved when there is no more assignment of data objects from one cluster to another, or when there is no change in the centroid of clusters.

Question 5) What are the different proximity functions or distance metrics used for the K-means algorithm?

Euclidean, Manhattan, Cosine, and Bregman divergence are some distance metrics used for the K-means algorithm. Euclidean is the squared distance from a data point to the centroid. Manhattan is the absolute distance from a data point to the centroid. Cosine is the cosine distance from a data point to the cluster centroid. Bregman divergence is a class of distance metrics that

includes Euclidean, Mahalanobis, and Cosine. Basically, Bregman divergence includes all those distance metrics for which the mean is a centroid.

- Part 2

Question 1) What are the issues with random initialisation of centroids in K-means algorithm and how to overcome it?

Initiation of the centroids in a cluster is one of the most important steps of the K-means algorithm. Many times, random selection of initial centroid does not lead to an optimal solution. In order to overcome this problem, the algorithm is run multiple times with different random initialisations. The sum of squared errors (SSE) are calculated for different initial centroids. The set of centroids with the minimum SSE is selected. Even though this is a very simple method, it is not foolproof. The results of multiple random cluster initialisations will depend on the dataset and the number of clusters selected, however, that still will not give an optimum output every time.

The other method involves first selecting the centroid of all the data points. Then, for each successive initial centroid, select the point which is the farthest from the already selected centroid. This procedure will ensure that the selection is random, and the centroids are far apart. The disadvantage of

this method is that calculating the farthest point will be expensive. In order to avoid this problem, initialisation is carried out on a subset of the dataset.

The other methods of handling this are bisecting K-means (this is covered in another question below) and taking care of the issues once clustering is done post processing.

Question 2) How are outliers handled by the K-means algorithm?

Handling of outliers differs from case to case. In some cases, it will provide very useful information, and in some cases, it will severely affect the results of the analysis. Having said that, let’s learn about some of the issues that arise due to outliers in the K-means algorithm below.

The centroids will not be a true representation of a cluster in the presence of outliers. The sum of squared errors (SSE) will also be very high in case of outliers. Small clusters will bond with outliers, which may not be the true representation of the natural patterns of clusters in data. Due to these reasons, outliers need to be removed before proceeding with clustering on the data.

Question 3) What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?

Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimise the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.

Question 4) What is Bisecting K-means?

In Bisecting K-means, all the data objects are split into two clusters. One of these clusters is selected and divided into two groups. This will continue until a pre-decided K number of clusters is formed. The selection of the cluster for splitting will depend on a number of factors such as the size of the cluster, the largeness of the SSE, etc. The choice of the splitting criteria will decide the type of clusters formed. The centroids of Bisecting K-means is used by the K-means algorithm as initial centroids for further refining the clusters. K-means algorithm will be able to find the local minimum by using the SSE as the objective function. In case of Bisecting K-means, the local optimum is related to the clusters being split and not the complete dataset. So, the final results obtained will not be ‘local’ to the dataset with respect to the total SSE.

Bisecting K-means does not have issues with initialisation because when different splits are tried out, a split with a low SSE is preferred. Additionally, Bisecting K-means only deals with two centroids at a time. Using Bisecting K-means as a base for K-means improves the performance of the latter.

Question 5) Is K-means clustering suitable for all shapes and sizes of clusters?

K-means is not suitable for all shapes, sizes, and densities of clusters. If the natural clusters of a dataset are vastly different from a spherical shape, then K-means will face great difficulties in detecting it. K-means will also fail if the sizes and densities of the clusters are different by a large margin. This is mostly due to using SSE as the objective function, which is more suited for spherical shapes. SSE is not suited for clusters with non-spherical shapes, varied cluster sizes, and densities.

- Part 3

Question 1) What are the advantages and disadvantages of K-means clustering?

K-means is a simple and efficient algorithm though it requires multiple runs. Bisecting K-means, a flavour of K-means, is more efficient and handles

initialisation well. The disadvantages of K-means is that they can handle all shapes and sizes of clusters, but it is not good at handling outliers. This algorithm is only applicable to the data for which a centroid exists.

Question 2) What is hierarchical clustering?

There are two types of hierarchical clustering. They are agglomerative clustering and divisive clustering.

Agglomerative clustering: In this algorithm, initially every data object will be treated as a cluster. In each step, the nearest clusters will fuse together and form a bigger cluster. Ultimately, all the clusters will merge together. Finally, a single cluster, which encompasses all the data points, will remain.

Divisive clustering: This is the opposite of the agglomerative clustering. In this type, all the data objects will be considered as single clusters. In each step, the algorithm will split the cluster. This will repeat until only single data points remain, which will be considered as singleton clusters.

Question 3) What are the proximity measures between clusters and when is it to be used?

Single link, complete link, group average, and ward’s are some of the proximity measures for hierarchical clustering.

Single link: Cluster proximity in case of single linkage is the distance of two nearest points in two different clusters. This method is good at handling non-elliptical shapes. However, it is susceptible to noise and outliers.

Complete link: It is the distance between the two farthest points of two different clusters. It favours globular shapes and is not susceptible to noise and outliers. This method of proximity measurement breaks large clusters.

Group average: This cluster proximity is the average of all pairwise distances (pairwise distance is the distance between two points in two different clusters; all combinations of pairs are considered for calculating the average). This is an intermediate approach between the single and complete linkages.

Ward’s method: The proximity between two clusters is the increase in SSE that is the result of merging both the clusters.

Question 4) What are the disadvantages of agglomerative hierarchical clustering?

Objective function: SSE is the objective function for K-means. Likewise, there exists no global objective function for hierarchical clustering. It considers proximity locally before merging two clusters.

Time and space complexity: The time and space complexity of agglomerative clustering is more than K-means clustering, and in some cases, it is prohibitive.

Final merging decisions: The merging decisions, once given by the algorithm, cannot be undone at a later point in time. Due to this, a local optimisation criteria cannot become global criteria. Note that there are some advanced approaches available to overcome this problem.

Question 5) What are the strengths and weaknesses of the agglomerative hierarchical algorithm?

This algorithm is helpful where a hierarchy is required for resolving a business problem. Research suggests that this algorithm gives better results. But it has some disadvantages such as high storage and time costs. For this particular algorithm, as merges once created are final, it is not suitable for high-dimensional and noisy data.

Question 6) Is validation required for clustering? If yes, then why is it required?

Clustering algorithms have a tendency to cluster even when the data is random. It is essential to validate if a non-random structure is present in the data. It is also required to validate whether the number of clusters formed is appropriate or not. Evaluation of clusters is done with or without external reference to check the fitness of the data. Evaluation is also done to compare clusters and decide the better among them.

Question 7) What are the different types of validation or evaluation measures?

Two different types of cluster validation are unsupervised and supervised methods.

Unsupervised: In the unsupervised method, there is no reference to the external information. Due to this, unsupervised methods are also known as internal indices. Unsupervised methods are further divided into two methods, cluster cohesion and cluster separation. Cluster cohesion is the measure of tightness between the elements of a cluster. Cluster separation is the measure of how well separated one cluster is from the other clusters.

Supervised: In supervised methods, the cluster is evaluated with external information provided. Due to this, they are known as external indices. Entropy is one such measure. It compares how the cluster labels fare with externally supplied label information.

Relative measures are those which make use of either supervised or unsupervised methods to compare different clusters. For example, two K-mean clusters compared with the help of SSE.

Question 8) How are prototype clusters evaluated using cohesion and separation?

In prototype clustering, cohesion is defined as the sum of proximities between the data objects and the centroid. Separation is of two kinds. It is either the proximity between the two cluster centroids or the proximity between the cluster centroid and the overall centroid of all the data objects.

Cohesion = Sum of proximities between data objects and the centroid

Separation = proximity between the two cluster centroids (or)

Separation = proximity between the cluster centroid and the overall centroid.

An overall validity of cluster validation is calculated with a weighted sum of cohesion and separation measures. Different kinds of weights can be used for calculating the overall measure. Typically, these weights are a measure of cluster sizes.

Question 9) How to decide the number of clusters in K-means clustering?

A suitable K needs to be decided by trial and error. This can be decided by plotting SSE for various values of number of clusters. An optimal number of K is where there is an elbow or dip in the graph. Silhouette coefficient (explained in the question below) can also be used instead of SSE to find the optimal number of clusters.

Question 10) What is silhouette coefficient in clustering?

Silhouette coefficient is a cluster evaluation method that combines cohesion and separation measures.

Advanced Regression

Q2. When do we need to create new features to build a linear regression model?

When there are cases where the target variable cannot be expressed as a linear combination of predictor variables. In such cases, we create some function of the explanatory variables to best explain the data points. These functions

capture the non-linearity in the data The derived features could be combinations of two or more attributes and/or transformations of individual attributes. These combinations and transformations could be linear or non-linear.

Q3. Given a dataset with 80 features and 1 million data points, will you create a linear model considering all 80 features?

Since creating a linear model with 80 features is computationally intensive and infeasible, and at the same time, it will make the model extremely complex and difficult to comprehend. Moreover, as the model complexity grows, it will become less generalisable and difficult to apply.

Q4. What can be a possible solution to the above problem?

One of the methods to solve it is to create a model that takes into account only the most important of the features thus resulting in a reasonably simple and robust model.

Q5. How do we select the appropriate features to build a model upon?

Features’ subset selection can be performed using two different methods:

1. Best subset selection

2. Stepwise selection

3. Lasso regression

In the Best Subset Selection algorithm, we start with 0 features, i.e. a null model M0 with no features. Now, as we increase the number of features, we consider every model that has all combinations of a certain number of features. and select a model which results in the least RSS (or largest R2). This gives us a model Md with d features. We continue this iteration by increasing the value of d by one till you reach d is equal to the number of features in the dataset and find the models M0, M1, M2,….., Mp. Out of all these models M0, M1, M2,….., Mp, select the best one, as measured by measures such as Cp, AIC, BIC, Adjusted R2 or mean cross-validated error.

The other alternative is Lasso regression — it reduces the coefficients of many of the insignificant variables to zero and thus, in effect, performs feature selection.

Q6. What is the advantage of using measures such as Cp, AIC, BIC, Adjusted R2 or mean cross-validated error?

They penalise the model for being too complex (i.e. for overfitting), and thus are more representative of the unseen ‘test error’.

Q7. Are there any drawbacks of using best subset selection? How can they be corrected?

We can see that the total number of models that need to be analysed for Best Subset Selection is 2^p where p is the total number of predictors. If we have 20 predictors, the total number of models is 220 = 1048576 which is over a million. Hence, it becomes computationally infeasible to perform best subset selection for the number of predictors greater than 40. We can solve this by using stepwise selection.

Q8. What do you mean by stepwise selection? Explain in detail.

Contrary to subset selection, where we create all possible models and then select the one that is the best, in the stepwise selection, you start with the null model and build upon it gradually. This can be accomplished in two ways, forward and backward stepwise selection. In forward stepwise selection, we start with the null model that has been built upon zero features and then build

all possible models with one feature. Choose the nest one of them. Now we build upon this model, that is to try all other features and select one model with 2 features. Now you build (p-2) models on this model and choose the best till you have created a model with p features. In backward stepwise selection, you start with a model with all features and gradually move to a model with zero features. Both processes will give you p+1 models, on which we then measure Cp, AIC, BIC, Adjusted R2 or mean cross-validated error and choose the best one possible.

- II

Q1. How will you deal with the problem of overfitting in linear regression?

Apart from feature selection, another way to deal with overfitting is regularisation. There are two ways in which regularisation can be applied in regression, Lasso and Ridge regression. The two algorithms put a penalty on the complexity of the model produced in two different methods. While lasso puts the penalty as the sum of the absolute values if the coefficients, ridge considers the sum of the squares of the coefficients.

Q2. Can we use regularized regression for the purpose of selecting the features for model building?

Yes. Lasso regression can be used for the purpose of feature selection because if you run lasso on a dataset, it just so happens that the coefficients it returns are zero for variables that are colinear or generally unimportant for the purpose of model selection. Hence while building a model, these variables can be neglected and the model is built on all other variables.

Q3. What is the hyperparameter applicable in regularized regression?

The hyperparameter applicable in generalized regression is denoted by lambda. It is the wight given to the sum of the absolute values of the coefficients in case of lasso regression and the sum of the squares of coefficients in case of ridge regression in the overall cost function. A higher value of lambda implies that the model is highly regularized while a low value means that the model is essentially unregularized.

Q4. As we know that Lasso regression can be used for feature selection while ridge cannot be, why should we even consider using ridge?

Though lasso regression can be used for feature selection while ridge regression cannot be, it comes at a huge computational cost. Since it does not convert into a nice invertible function, it is to be solved using an iterative process which has significantly more computational requirements compared to ridge regression which demands a simple tweak to the simple linear regression solution and can be converted to an invertible matrix and can thus be solved using matrix operations and thus has significantly lower computational costs associated with it.

Decision Tree:

- Part 1

Question 1) What are the merits and demerits of the decision tree model over the linear regression model?

Merits:

Decision tree is a non-parametric model (no assumptions about the data), while the regression model is parametric

The decision tree model can learn non-linear decision boundaries also, which is not possible in regression models such as linear and logistic regressions

● It can be used for both regression and classification problems

● It is easy to visualise (in case of deep trees, visualisation is difficult)

Demerits:

● Getting business insights from the decision tree model is very difficult

Decision trees are weak classifiers and have to be used with bagging, random forests, and other algorithms to perform efficiently

Decision trees cannot explain the marginal effects of a variable on the target, i.e. how the target will change if a variable is changed by 1 unit

● Hyperparameter tuning is required in decision trees

Question 2) There are 50 predictors in a dataset. You have built two models on this dataset: bagged decision tree and random forest. Let the number of predictors used on a single split in a bagged decision tree be X and random forest be Y. What can you say about X and Y?

X>=Y. Random forest uses a random subset of the predictors, while the bagged decision trees always use all the predictors on a single split. Also, there may be cases when the random forest uses all the features. In such a case, X and Y will be equal, so overall, X>=Y.

Question 3) Explain the random forests algorithm.

Random forest is an ensemble learning method, which tries to build a multitude of decision tree (CART) models to predict the target variable. It builds many decision trees with random sample and decision variables. If used for regression purposes, it outputs the mean of all the decision trees. In case of a classification problem, the mode of the outputs of the decision trees is the final output.

Note: For details, you can refer to the main content and revise the concepts.

Question 4) Do random forests overfit?

Random forests do not generally overfit. Random forests, being an ensemble learning, try to reduce the variance. But at certain times, due to some useless features present in the data and a large number of trees, it might overfit by a small margin.

Question 5) What is ‘random’ in the random forest algorithm?

Random forest tries to build many decision trees with different sample and decision variables. The selection of sample and decision variables in a decision tree is made randomly. Hence, it’s called random forest.

Question 6) What is meant by bagging?

Bagging refers to Bootstrap Aggregation, which is the technique used to reduce the variance of decision tree models. In the Bagging algorithm, several subsets of the data are randomly chosen with replacement and are fed into separate decision tree models, creating an ensemble of decision tree models. The output is the average of the predictions made by all the decision trees, making it more robust and reducing the variance.

Question 7) What do you mean by boosting?

The idea behind boosting is to create a ‘powerful’ committee of many weak classifiers. Here, samples are collected from the training data without replacement and are fed into the tree models. The point to note here is that

this sampling is not parallel, it is done sequentially; and in each step, more weightage is given to the wrongly classified labels. The final output is the weighted majority vote of all the models trained in this ensemble learning technique. The main aim of the boosting algorithm is to increase the predictive accuracy of the weak classifiers.

Note: To get in-depth information about boosting models, refer to this optional content.

- Part 2

Question 1) What is the difference between the random forest and Bagging algorithms?

The random forest algorithm is an extension of the Bagging algorithm. In the Bagging algorithm, the data is sampled randomly, and many trees are trained on it using all the features, whereas in the random forest algorithm, the features are also chosen randomly, i.e. not all the features are used in all the trees.

Question 2) Describe the decision trees algorithm.

As the name suggests, decision trees are tree-based models that are used for decision-making purposes. Each leaf of a decision tree represents a class label, and every internal node represents an attribute. The algorithm pseudocode is —

1. Place the best feature of the given data at the root of the decision tree

2. Split the tree into branches (subsets) so that the data in each branch (subset) has the same value of the root feature

3. Repeat steps 1 and 2 for each subset until you reach the leaf node with the predicted class value

The values of the features are preferred to be categorical in nature. If the values are numerical, then they are discretized.

The main task is to know how to select the feature at the root of the tree. For this, the two most common methods are —

1. Information gain

2. Gini index

Question 3) Describe information gain for feature selection.

Information gain is the amount by which the entropy has decreased after the split of a particular feature. The more the entropy decreases, the higher is the information gain and the more important the feature. The entropy of a dataset is calculated using the following method:

Question 6) What is pruning? How is it done?

Pruning is the process of trimming a tree. The part of the tree that is not contributing to the decision-making of the tree is pruned. The benefits of pruning include less complexity and prevention of overfitting. One of the most common pruning techniques is reduced error pruning. In this method, a subtree is replaced by a single leaf constituting the majority of the output at that node. If the accuracy of the prediction is not reduced significantly on the cross-validation data after pruning, then the pruning action is retained; otherwise, it is rejected.

For more details, go through this PDF and this Wikipedia page.

Question 7) How to perform a regression analysis using decision trees?

When a single regression line is unable to give a good fit for the entire data, a decision tree regressor is used. Here, the tree is built as usual in the classification problem, but the difference lies in the leaves. In classification trees, the leaves contain labels, but in regression trees, the leaves contain a regression model. So, basically, decision trees help in creating a segmentation in the dataset, and then on each segment, a separate linear regression model is trained, and continuous output is given.

Question 8) Explain the gradient boosting algorithm.

The gradient boosting algorithm is an ensemble learning algorithm based on weak learners such as decision trees. Weak learners are grown in the boosting fashion to improve the performance of the model. Here also, like any other supervised learning algorithm, a loss function is defined, and the algorithm tries to minimise it.

Practice Questions

The following questions are non-graded and are solely for practice.

Decision Trees

Which of the following statements is true about gradient boosting models?

1. At each stage, a new tree is grown to overcome the error of the previous tree

2. At each stage, a new tree is grown, which is independent of the previous tree

3. Gradient descent is used for optimising the cost function

Only 3

2 and 3

1 and 3

Only 1

Answer : 1and 3

Decision Trees

Which of the following is more time-consuming

Gini index

Information gain

Both take an equal amount of time

Answer : 2

Decision Trees

A decision tree-based model is grown in order to achieve the minimum entropy, which may lead the model to

Overfit

Underfit

Depends upon the data

Answer : 1

Decision Trees

A decision tree is built in which fashion

Top-down fashion

Bottom-up fashion

Both

Answee: 1

Support Vector Machines

Practice Questions

Maximum Margin Classifier

Maximum Margin Classifier

Consider the above graph plotted on a database. Triangle and Pentagon stand for two different classes in which a data point can lie. Answer the Questions that follow.

Q: Maximum margin classifier

Which of the following points are used to create a decision boundary?

Circled in Red

Circled in Green

All except circled in Red

All

Answer:

Circled in Red

Feedback :

Since the points in red would be nearest to any hyperplane dividing the plane into two classes. Any change to these two points will lead to a change in the hyperplane.

Q: Maximum margin classifier

How many hyperplanes can be used to classify the data points into the two classes?

One

Infinite

Eight

Five

Answer :

Infinite

Feedback :

A line that is used to classify one class from another is called a hyperplane. And in this case, there can be infinite hyperplanes.

Soft Margin Classifier

Soft Margin Classifier

Consider the above graph plotted on a database. Green and Orange stand for two different classes in which a data point can lie. Answer the Questions that follow.

Q: Soft margin classifier

In the case of Maximum Margin Classifier, which of the following hyperplanes would you choose to classify the data into the two classes?

Brown classifier

Pink Classifier

You can choose either. It would make no difference.

None of the two is desirable.

Answer : 1

Q: Soft margin classifier

Which of the two classifiers will you choose if you are using the Soft Margin Classifier?

Brown classifier

Pink Classifier

You can choose either. It would make no difference.

None of the two is desirable.

Answer: 2

SVM

Subjective Questions

Introduction

Q1. How is Soft Margin Classifier different from Maximum Margin Classifier?

A: Where Maximum Margin Classifier does not allow for any misclassification of data points and might lead to overfitting in certain circumstances, Soft Margin Classifiers allows for certain points to be misclassified to arrive at the best classifier and not succumb to the perils of overfitting.

Q2. What does the slack variable Epsilon (ε) represent?

A:Epsilon stands for the permissible error in the SVM. If the value is equal to zero, it means that the point is correctly classified and it lies outside the margin. If the value lies between zero and one it means that though the point is correctly classified, it lies inside the margin. However if the epsilon value is greater than one, the data point is wrongly classified.

Note that each data point has a slack value associated with it. The value ranging from zero to infinity. And the purpose of the analysis is to determine a classifier while keeping in check the sum of the slack variables.

Q3. How do you measure the cost function in SVM? What does the value of C signify?

A:The cost function C is the sum of the slack variables for all the data points.

When C is large, the slack variables can be large, i.e. it allows a larger number of data points to be misclassified or to violate the margin. So you get a hyperplane where the margin is wide and misclassifications are allowed. In this case, the model is flexible, more generalisable, and less likely to overfit. In other words, it has a high bias.

When C is small, you force the individual slack variables to be small, i.e. you do not allow many data points to fall on the wrong side of the margin or the hyperplane. So the margin is narrow and there are few misclassifications. In this case, the model is less flexible, less generalisable, and more likely to overfit. In other words, it has a high variance.

Kernels

Q1. Given the above dataset where red and blue points represent the two classes, how will you use SVM to classify the data?

A:Since the dataset cannot be classified into two classes (red and blue) using a linear hyperplane, a maximum or a soft margin classifier is not possible in this case directly. Firstly, the data will have to be transformed from the 2-D attribute space to a linear data in 3-D feature space. Once that Data is linear, we can apply the Maximum Margin Classification or the Soft Margin Classification as applicable.

Q2. What do you mean by feature transformation?

A:The process of transforming the original attributes into a new feature space is called feature transformation. As the number of attributes increases, there is an exponential increase in the number of dimensions in the transformed feature space. Suppose you have four variables in your data set, then considering only a polynomial transformation with degree 2, you end up making 15 features in the new feature space.

Q3. What is the role of the kernel in the transformation of data from non-linear attribute space to linear feature space?

A:To resolve the complexity of feature transformation, a kernel is used. It basically transforms the attribute space to feature space which is then fed into the SVM algorithm. To find the best fit model, the learning algorithm uses the inner product that is the dot product of the observations and never the individual observations in isolation. Kernels use this knowledge to bypass the explicit transformation and rather perform it implicitly. Major kernel types are Linear, Polynomial and the RBF kernel.

Q4. What is the difference between a Linear and a Polynomial kernel?

A:Linear kernel is the same as a Simple Margin Classifier that creates a linear hyperplane without any feature transformation. Whereas a polynomial kernel creates a non-linear classifier using a transformation.

Q5. What is the only necessary precondition for applying either the Linear or a polynomial kernel?

A:The data classes should be separable using a single curve which could be a straight line or a complex polynomial

Q6. When do you use an RBF kernel?

A:We use a Radial basis function to transform highly non-linear feature space to linear attribute space. Such as the one in the above example (refer to

question 1of this section). It is used when the classifier need not be a line (straight or curved) but in cases a closed curve(ellipse).

Q7. How do you control the non-linearity of an RBF kernel?

A:Gamma is used to control the non-linearity of the RBF kernel. A high value of Gamma means high non-linearity leading to overfitting. On the contrast, if the Gamma value is small, it implies low non-linearity which might lead to gross underfitting.

Boosting:

- Part 1

Question 1:

What’s the difference between a random forest algorithm and a gradient boosted tree algorithm?

Answer 1:

Both of these algorithms involve creating multiple decision trees. However, a random forest algorithm is based on the concept called bagging whereas a gradient boosted tree is based on boosting.

In a random forest, multiple samples of the data are created. A decision tree is fit on each of these samples parallelly. Finally, the result is calculated by voting or averaging the results from each of those fits. A random forest generally reduces bias.

On the other hand, a gradient boosted tree model involves creating sequential tree models one after the other. Each tree, known as a weak learner, improves the predictions of the previous learner. We can mention a stopping criterion

such as a stop after creating n number of trees. A gradient boosted tree generally reduces the bias and the variance.

Question 2:

What’s the difference between boosted trees and XGBoost?

Answer 2:

XGBoost is an extension to boosting algorithm. The underlying principles are the same. There are a few more features that make XGBoost one of the most famous algorithms on machine learning competitive platforms such as Kaggle. These features are listed as follows:

1. Regularisation: Regularisation reduces overfitting by tweaking the loss function in a particular way. Normal boosted trees don’t have regularization and hence are prone to overfitting. XGBoost performs better on the test data.

2. Handles missing values: XGBoost handles missing values better than how a boosting algorithm does.

3. Handles sparse data: XGBoost handles one-hot-encoded and sparse matrices better than boosting algorithm.

4. Runs on multiple cores: XGBoost can make use of parallel computing by using multiple cores on the system. It doesn’t create multiple trees in parallel because boosting is a sequential algorithm. Rather, it creates multiple branches in a tree parallelly.

Questions 3:

Explain ensemble modelling and the concept of a weak learner?

Answer 3:

Creating an ensemble model involves creating multiple machine learning models and combining them to get better results than anyone algorithm.

A good analogy to think of ensemble modelling is to think of a project with multiple people working on it. When there’s an important decision to make about the project, instead of taking opinion from a single person, all the

members of the project are asked about their opinion. Each person can be thought of as an independent machine learning model in this scenario.

Now, it’s not that blindly combining multiple models will give you better results. Consider this hypothetical situation. Suppose you’ve created ten classifiers for a binary prediction task. Now, suppose the predictions of all of these models are exactly the same. Combining the predictions of these ten models will not improve the results as all the models have the exact same predictions. In ensemble modelling like Gradient Boosting, we use a weak learner as the basic model. A weak learner is one which is better than a model with random predictions, that is, it is indeed making sensible predictions. The hypothesis of ensemble modelling is that by combining the predictions from multiple weak learners, we can create a strong learner. However, there’s a condition. There should be minimum correlated between all the weak learners. In case, the correlation between all the weak learners is strong, the ensemble model is less likely to perform better than any individual weak learner.

- Part 2

Question 1:

Explain the different hyperparameters of a boosted tree model.

Answer 1:

There are three main hyperparameters in a boosted tree algorithm. These are:

1. Learning rate: Learning rate decides how fast or slow you want your algorithm to fit or learn the training data. Using a high learning rate (more than 0.5) would quickly overfit the training set without requiring a large number of trees. Using a small learning rate (Around 0.01 to 0.03) would result in slow learning and you’d have to use a large number of trees. This parameter is also called shrinkage.

2. Number of trees: The number of trees that you want to fit on the training set. Using a high value would result in an overfitted model. Whereas using a low value would result in an underfitted model. A good way to choose this number of by cross-validation and keeping an eye on the training and testing error.

3. Subsample: This is an optional parameter which allows you to enable bagging in a boosted tree. If you want to fit a tree on a sample of the dataset, you can use this parameter and specify a fraction. For example, setting subsample as 0.7 would mean that a tree is created on a random sample with 70% rows from the training set.

Question 2:

Which model is generally used as a weak learner in an ensemble model and why?

Answer 2:

Generally, a decision tree is used as a weak learner. The purpose of using a weak learner is to have multiple uncorrelated models. The lower the correlation among models, the better ensemble you’ll get. Decision trees are unstable models, that is, the variance in decision trees is very high. If you fit two decision tree models in a training set, they’ll give you two very different results. That’s what we exactly want in a weak learner — we want them to be uncorrelated. Therefore, a decision tree is a good choice and the most popular choice to build a boosted ensemble model.

Question 3:

What’s the difference between GBM and AdaBoost?

Answer 3:

AdaBoost stands for adaptive boosting. In adaptive boosting, weak classifiers, generally, decision trees, are created sequentially. After fitting a tree, the performance is evaluated and weights assigned to each observation is adjusted (adapted) accordingly. The misclassified points are assigned large weights and the correctly classified points are assigned smaller weights. As a result, the next classifier focuses on the misclassified observations of the previous classifier.

GBM stands for gradient boosted machine. When you fit a GBM on training data, sequential decision trees are created. After the first tree gets created, the algorithm calculates the error and the next classifier is fit on the error, called the residual, and tries to minimise the loss function using gradient descent. It does not explicitly assign weights to each of the data points but it happens intrinsically during the loss minimization process.

Lead Data Scientist

Lead Data Scientist