Everything You need know to about Decision Tree¶
A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It's called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree. Decision trees are helpful, not only because they are graphics that help you 'see' what you are thinking, but also because making a decision tree requires a systematic, documented thought process. Often, the biggest limitation of our decision making is that we can only select from the known alternatives. Decision trees help formalize the brainstorming process so we can identify more potential solutions.
Basically there are two types of decision tree¶
- Decision Tree based on Gini Index
- Decision Tree based on Information Gain
Gini Index¶
The Gini coefficient or Index is a measure of statistical dispersion intended to represent the income distribution of a nation's residents, and is the most commonly used measure of inequality.
Information Gain¶
In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information. It is used to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.
Import Library Import other necessary libraries like pandas, numpy...
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics
%config IPCompleter.greedy = True
%matplotlib inline
Data¶
SO the data.. The data I downloaded from the UCI repository (Open Source). Its a data of premium white wines in the world. With its general specifications like(Acidity, Residual Sugar, Chloride contain , Density, pH levels, alchohol contain etc ) ANd we have a survey of quality of wine. What we are gonna do is try to classify the quality with the given specifications.
quality= pd.read_csv("/home/kirtiman/winequality_white.csv")
data_wine=pd.read_csv("/home/kirtiman/winequality.csv")
quality_test=pd.read_csv("/home/kirtiman/quality_test.csv")
wine_test=pd.read_csv("/home/kirtiman/wine_test.csv")
Lets Describe and visualize the data
data_wine.describe()
data_wine['fixed acidity'].unique()
data_wine['fixed acidity'].hist()
data_wine['volatile acidity'].unique()
data_wine['volatile acidity'].hist()
data_wine['citric acid'].unique()
data_wine['citric acid'].hist()
data_wine['residual sugar'].unique()
data_wine['residual sugar'].hist()
data_wine['chlorides'].unique()
data_wine['chlorides'].hist()
data_wine['free sulfur dioxide'].unique()
data_wine['free sulfur dioxide'].hist()
data_wine['total sulfur dioxide'].unique()
data_wine['total sulfur dioxide'].hist()
data_wine['pH'].unique()
data_wine['pH'].hist()
data_wine['density'].unique()
data_wine['density'].hist()
data_wine['sulphates'].unique()
data_wine['sulphates'].hist()
data_wine['alcohol'].unique()
data_wine['alcohol'].hist()
Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset Create tree object
model = tree.DecisionTreeClassifier(criterion='gini')
- For classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
- model = tree.DecisionTreeRegressor() for regression
- Train the model using the training sets and check score
model.fit(data_wine, quality)
model.score(data_wine, quality)
Predict Output¶
predicted= model.predict(x_test)
predicted= model.predict(wine_test)
acc = round(accuracy_score(predicted,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
model2 = tree.DecisionTreeClassifier(criterion='entropy')
model2.fit(data_wine, quality)
predicted2= model2.predict(wine_test)
acc = round(accuracy_score(predicted2,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
Random Forest¶
In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
from sklearn.ensemble import RandomForestClassifier
- use RandomForestRegressor for regression problem
- Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
- Create Random Forest object
model3= RandomForestClassifier(n_estimators=1000)
- Train the model using the training sets
model3.fit(data_wine, quality)
- Predict Output
predicted3= model3.predict(wine_test)
acc = round(accuracy_score(predicted3,quality_test['quality']),2)
accuracy = acc * 100
print (str(accuracy)+"%")
Conclusion¶
SO these are the wayswe can use Tree algorithm to classify. More of it in next blog stay tuned
No comments:
Post a Comment