Image for post
Image for post

Today we are going to talk about how to regularize linear regression, that is make our model more generalized to overfitting while training.

Linear regression is simple yet powerful method as it provide us with quick prediction time once trained which is one of most important features to consider when developing a machine learning model because in real world there are customers waiting for predictions and longer they wait, customer experience is going to decrease.

When linear regression is underfitting there is no other way (given you can’t add more data) then to increase complexity of the model making it…


Image for post
Image for post
Photo by Emily Morter on Unsplash

Today we are going to go through breast_cancer dataset from Sklearn to understand different types of performance metrics for classification problems and why sometimes one would be preferred over the other.

Even though it is a simple topic answer does not immediately arise when someone asks “what is precision/recall?”, by going over examples I hope everyone including myself get firm grasp on this topic.

Most people often use “Accuracy” as performance metrics in classification problems sometimes it is okay to use however when we have imbalanced labels we should avoid using it.

Metrics are what we use to compare different…


Image for post
Image for post
Photo by Maksym Kaharlytskyi on Unsplash

From multiple Data Scientist interviews there are few questions that I have been asked frequently and one of them is “explain how Logistic Regression works”. I understood the concept at a high level and knew how to implement it however I got stuck when I was asked to explain “what is sigmoid function”, “what cost function does Logistic Regression use and why?”, etc… I could not answer them. …


Image for post
Image for post
Photo by Stephen H on Unsplash

Today I am going to talk about one of most topics that has confused me for a while, Central Limit Theorem(CLT).

I’ve always convinced myself that having a lots of data will generate normal distribution which is non-sense if you take your time to think of it because why would collecting large number of data point lead to normal distribution?(unless data distribution is normally distributed)

Before going into learning about CLT, it is important to clearly understand difference between data distribution and sampling distribution.

Data Distribution: A function or listing showing all possible values of data, how often each data…


Image for post
Image for post
Image by Dmitry Gladkikh from Unsplash

Hi everyone👏.

Today we will go over one of bagging method called Random Forest to predict one of air pollutant in China. Model is called “forest” because it is built with multiple decision trees and “random” since it selects subset of rows for each tree and uses subset of features(columns) at each node split.

I assume readers know what Bagging and decision tree is, if not it is recommended that you read my previous blogs to get a brief overview: Decision Tree part I, Decision Tree part II and Ensemble Learning methods.

Before we get started I’ve noticed lot of…


Image for post
Image for post

Hi everyone🙌.

Today we will go over very powerful machine learning algorithm that has been dominating Kaggle competitions, Ensemble Learning.

There exists multiple methods of Ensemble Learning therefore today we will go over each and every methods at a high level so that non-technical people can understand it as well.

So, what is Ensemble Learning?

Ensemble — a group of things or people acting or taken together as a whole, especially a group of musicians who regularly play together.

Above definition is taken straight from Cambridge Dictionary.

Just replace musicians with machine learning models and that is Ensemble learning. So…


Image for post
Image for post

Hi everyone🙌.

Previously in part I, we built a decision tree model to classify flower type.

We will go over model created previously and try to understand questions the model asked and why it ask certain questions.

For those of you who didn’t read part I, it is recommended you read it since this blog is a continuation.

Here are some keywords that will be covered:

  • Information Gain
  • Impurity measures: Gini Impurity, Entropy, Classification error
  • Tree Pruning
  • Overfitting

Let’s visualize our decision tree.

Tree class in sklearn has plot_tree method which allow us to visualize decision tree via matplotlib.


Data preprocessing to building Decision Tree model with python.

Image for post
Image for post

Hi everyone🙌.

Today I am going to talk about Decision Tree however it will be broken down into two parts.

Part I : Decision tree concepts, data preprocessing, model building.

Part II : Math behind the scenes.

Developed in 1984 by Leo Breimann and others, decision tree is a machine learning algorithm that falls under supervised learning where the goal is to separate data into classes(classification)/bin of numbers(regression) by asking series of questions.

Think of it as a reversed tree where root(root node) is at the top and leaves(leaf node) are…

Haneul Kim

Data Scientist passionate about helping the environment.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store