Thursday, 8 December 2016

Guide to Trees

Decision Tree Everything

Everything You need know to about Decision Tree

A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It's called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree. Decision trees are helpful, not only because they are graphics that help you 'see' what you are thinking, but also because making a decision tree requires a systematic, documented thought process. Often, the biggest limitation of our decision making is that we can only select from the known alternatives. Decision trees help formalize the brainstorming process so we can identify more potential solutions.

Basically there are two types of decision tree

  • Decision Tree based on Gini Index
  • Decision Tree based on Information Gain

Gini Index

The Gini coefficient or Index is a measure of statistical dispersion intended to represent the income distribution of a nation's residents, and is the most commonly used measure of inequality.

Information Gain

In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information. It is used to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.

Import Library Import other necessary libraries like pandas, numpy...

In [3]:
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
%config IPCompleter.greedy = True
%matplotlib inline 

Data

SO the data.. The data I downloaded from the UCI repository (Open Source). Its a data of premium white wines in the world. With its general specifications like(Acidity, Residual Sugar, Chloride contain , Density, pH levels, alchohol contain etc ) ANd we have a survey of quality of wine. What we are gonna do is try to classify the quality with the given specifications.

In [4]:
quality= pd.read_csv("/home/kirtiman/winequality_white.csv")
data_wine=pd.read_csv("/home/kirtiman/winequality.csv")
quality_test=pd.read_csv("/home/kirtiman/quality_test.csv")
wine_test=pd.read_csv("/home/kirtiman/wine_test.csv")

Lets Describe and visualize the data

In [5]:
data_wine.describe()
Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
count 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000
mean 6.957212 0.275818 0.341097 6.364063 0.045631 35.574000 140.488625 0.994188 3.191968 0.489238 10.461225
std 0.844625 0.100184 0.122548 5.091631 0.021960 16.910471 43.264248 0.002993 0.152897 0.114696 1.215563
min 4.200000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987130 2.720000 0.220000 8.000000
25% 6.400000 0.210000 0.270000 1.700000 0.036000 24.000000 109.000000 0.991880 3.090000 0.410000 9.400000
50% 6.900000 0.260000 0.320000 5.200000 0.043000 34.000000 137.000000 0.993900 3.180000 0.470000 10.300000
75% 7.400000 0.320000 0.400000 9.800000 0.050000 46.000000 170.000000 0.996258 3.290000 0.550000 11.300000
max 14.200000 1.005000 1.660000 65.800000 0.346000 146.500000 366.500000 1.038980 3.820000 1.060000 14.200000
In [6]:
data_wine['fixed acidity'].unique()
data_wine['fixed acidity'].hist()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3fb7352f98>
In [67]:
data_wine['volatile acidity'].unique()
data_wine['volatile acidity'].hist()
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79029d6c50>
In [68]:
data_wine['citric acid'].unique()
data_wine['citric acid'].hist()
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79028e0c50>
In [69]:
data_wine['residual sugar'].unique()
data_wine['residual sugar'].hist()
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79028732e8>
In [70]:
data_wine['chlorides'].unique()
data_wine['chlorides'].hist()
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79027da748>
In [71]:
data_wine['free sulfur dioxide'].unique()
data_wine['free sulfur dioxide'].hist()
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7902775320>
In [72]:
data_wine['total sulfur dioxide'].unique()
data_wine['total sulfur dioxide'].hist()
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79027038d0>
In [73]:
data_wine['pH'].unique()
data_wine['pH'].hist()
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f790268cf98>
In [74]:
data_wine['density'].unique()
data_wine['density'].hist()
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7902b16f98>
In [75]:
data_wine['sulphates'].unique()
data_wine['sulphates'].hist()
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79025754e0>
In [76]:
data_wine['alcohol'].unique()
data_wine['alcohol'].hist()
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79025007b8>

Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset Create tree object

In [7]:
model = tree.DecisionTreeClassifier(criterion='gini') 
  • For classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
  • model = tree.DecisionTreeRegressor() for regression
  • Train the model using the training sets and check score
In [8]:
model.fit(data_wine, quality)
model.score(data_wine, quality)
Out[8]:
1.0

Predict Output

predicted= model.predict(x_test)

In [9]:
predicted= model.predict(wine_test)
In [10]:
acc = round(accuracy_score(predicted,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
40.0%
In [11]:
model2 = tree.DecisionTreeClassifier(criterion='entropy') 
In [12]:
model2.fit(data_wine, quality)
Out[12]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [13]:
predicted2= model2.predict(wine_test)
In [14]:
acc = round(accuracy_score(predicted2,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
48.0%

Random Forest

In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

In [15]:
from sklearn.ensemble import RandomForestClassifier 
  • use RandomForestRegressor for regression problem
  • Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
  • Create Random Forest object
In [16]:
model3= RandomForestClassifier(n_estimators=1000)
  • Train the model using the training sets
In [17]:
model3.fit(data_wine, quality)
/home/kirtiman/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  if __name__ == '__main__':
Out[17]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
  • Predict Output
In [18]:
predicted3= model3.predict(wine_test)
In [19]:
acc = round(accuracy_score(predicted3,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
54.0%

Conclusion

SO these are the wayswe can use Tree algorithm to classify. More of it in next blog stay tuned

No comments:

Post a Comment