Everything You need know to about Decision Tree¶

A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It's called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree. Decision trees are helpful, not only because they are graphics that help you 'see' what you are thinking, but also because making a decision tree requires a systematic, documented thought process. Often, the biggest limitation of our decision making is that we can only select from the known alternatives. Decision trees help formalize the brainstorming process so we can identify more potential solutions.

Basically there are two types of decision tree¶

Decision Tree based on Gini Index
Decision Tree based on Information Gain

Gini Index¶

The Gini coefficient or Index is a measure of statistical dispersion intended to represent the income distribution of a nation's residents, and is the most commonly used measure of inequality.

Information Gain¶

In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information. It is used to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.

Import Library Import other necessary libraries like pandas, numpy...

import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
%config IPCompleter.greedy = True
%matplotlib inline

Data¶

SO the data.. The data I downloaded from the UCI repository (Open Source). Its a data of premium white wines in the world. With its general specifications like(Acidity, Residual Sugar, Chloride contain , Density, pH levels, alchohol contain etc ) ANd we have a survey of quality of wine. What we are gonna do is try to classify the quality with the given specifications.

quality= pd.read_csv("/home/kirtiman/winequality_white.csv")
data_wine=pd.read_csv("/home/kirtiman/winequality.csv")
quality_test=pd.read_csv("/home/kirtiman/quality_test.csv")
wine_test=pd.read_csv("/home/kirtiman/wine_test.csv")

Lets Describe and visualize the data

data_wine.describe()

data_wine['fixed acidity'].unique()
data_wine['fixed acidity'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f3fb7352f98>

data_wine['volatile acidity'].unique()
data_wine['volatile acidity'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79029d6c50>

data_wine['citric acid'].unique()
data_wine['citric acid'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79028e0c50>

data_wine['residual sugar'].unique()
data_wine['residual sugar'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79028732e8>

data_wine['chlorides'].unique()
data_wine['chlorides'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79027da748>

data_wine['free sulfur dioxide'].unique()
data_wine['free sulfur dioxide'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7902775320>

data_wine['total sulfur dioxide'].unique()
data_wine['total sulfur dioxide'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79027038d0>

data_wine['pH'].unique()
data_wine['pH'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f790268cf98>

data_wine['density'].unique()
data_wine['density'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7902b16f98>

data_wine['sulphates'].unique()
data_wine['sulphates'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79025754e0>

data_wine['alcohol'].unique()
data_wine['alcohol'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f79025007b8>

Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset Create tree object

model = tree.DecisionTreeClassifier(criterion='gini')

For classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
model = tree.DecisionTreeRegressor() for regression
Train the model using the training sets and check score

model.fit(data_wine, quality)
model.score(data_wine, quality)

1.0

Predict Output¶

predicted= model.predict(x_test)

predicted= model.predict(wine_test)

acc = round(accuracy_score(predicted,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")

40.0%

model2 = tree.DecisionTreeClassifier(criterion='entropy')

model2.fit(data_wine, quality)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

predicted2= model2.predict(wine_test)

acc = round(accuracy_score(predicted2,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")

48.0%

Random Forest¶

In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

from sklearn.ensemble import RandomForestClassifier

use RandomForestRegressor for regression problem
Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
Create Random Forest object

model3= RandomForestClassifier(n_estimators=1000)

Train the model using the training sets

model3.fit(data_wine, quality)

/home/kirtiman/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  if __name__ == '__main__':

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Predict Output

predicted3= model3.predict(wine_test)

acc = round(accuracy_score(predicted3,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")

54.0%

Conclusion¶

SO these are the wayswe can use Tree algorithm to classify. More of it in next blog stay tuned

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
count	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000
mean	6.957212	0.275818	0.341097	6.364063	0.045631	35.574000	140.488625	0.994188	3.191968	0.489238	10.461225
std	0.844625	0.100184	0.122548	5.091631	0.021960	16.910471	43.264248	0.002993	0.152897	0.114696	1.215563
min	4.200000	0.080000	0.000000	0.600000	0.009000	2.000000	9.000000	0.987130	2.720000	0.220000	8.000000
25%	6.400000	0.210000	0.270000	1.700000	0.036000	24.000000	109.000000	0.991880	3.090000	0.410000	9.400000
50%	6.900000	0.260000	0.320000	5.200000	0.043000	34.000000	137.000000	0.993900	3.180000	0.470000	10.300000
75%	7.400000	0.320000	0.400000	9.800000	0.050000	46.000000	170.000000	0.996258	3.290000	0.550000	11.300000
max	14.200000	1.005000	1.660000	65.800000	0.346000	146.500000	366.500000	1.038980	3.820000	1.060000	14.200000

Open Data Science World

Thursday, 8 December 2016

Guide to Trees