ymforest尤克里里完全入门24课里

点击联系发帖人 时间：2016-05-09 13:33

尤克里里调音

距离上篇决策树的文章已经有段日子了,这次我们来聊聊CART和Random forest。之所以把CART和Random forest放在一起，完全是因为在rf的原文献中Leo breiman使用cart作为基树，其实我们完全可以使用任意一种决策树作为基树，例如，基于python的机器学习插件orange使用的就是c4.5。既然原作者使用CART作为基树，那么我们在这里也把cart和rf的实现放在一起。首先来看看cart，cart的全称classification and regression tree分类回归树,是一个既能分类也能预测的强大工具。cart和id3、c4.5的最大区别是其本身是一个二叉树，每一个节点只分裂成左右两颗子树。同样，节点的分裂标准也不相同，id3、c4,5分别使用信息增益和信息增益比率来选择最佳特征值进行节点分裂而CART则使用impurity不纯度来进行节点分裂。对于目标变量分别为离散和连续时，不纯度的计算公式分别为：Gini index:节点t的Gini index g(t) 定义为：其中i和j分别为待分裂离散特征的取值i和j(其他取值others),且：其中：: 为目标变量在取值j时的先验概率；: 为节点t下目标变量取值j时的样本数量；: 为根节点下目标变量取值j时的样本数量； gini index也可以写成：根据这个公式我们可以看出当目标变量的可能取值均匀分布时，gini index取得其在该节点上的最大值1-1/k, 其中k 为目标变量的可能取值个数。当所有样本的目标变量取值都相同时，则gini index取得最小值0。据此，我们可以进一步定义离散目标变量的节点分裂指标：公式一：其中::节点t分裂后，分布到左子节点的样本比例；:节点t分裂后，分布到右子节点的样本比例；因此，分裂的标准是选取能使公式一的取值最大化，既最大化降低impurity统计量！Least Squared Deviation:对于连续类型的目标变量，使用least squared deviation(LSD) impurity统计量来进行节点分裂。其中：:节点t的加权样本数； :样本i的权值；
:目标变量重复出现的频率；:目标变量值； :节点t下样本目标变量的加权平均；据此，我们可以进一步定义连续目标变量的节点分裂指标：公式二：怎么理解purity纯度和impurity不纯度这两个词呢？我们可以把purity理解为样本间目标变量的相似程度，在一个完全纯的树节点上所有样本的目标变量具有相同的值。CART的建树步骤如下：1.对于每一个特征，寻找最佳分裂点：连续特征：首先对该特征的所有取值进行从小到大排序，然后依次选取每一个值作为分裂点并根据公式二来计算分裂后的左右子节点的impurity统计量，最后选取使得节点分裂后impurity统计量下降最大的值作为该特征的最佳分裂点；离散特征：首先对该离散特征的每一个取值进行去重并罗列，依次选取每一个取值作为分裂点，其余取值统统当作“others”来对待，换而言之原样本在该特征维度上被划分为两个子集subset。然后计算每一个取值作为分裂点后左右自节点的impurity统计量，最后选取使得节点分裂后impurity统计量下降最大的值作为该离散特征的最佳分裂点；2.确定节点的最佳分裂点：综合所有特征的最佳分裂点，选取使得impurity统计量下降最大的特征的最佳分裂点作为节点的最佳分裂点；3.检查停止条件：如果不触发停止条件，则节点被2中的方法分裂为两个子节点；4.递归。对每一个子节点，使用3进行节点分裂。停止条件：停止条件决定如何停止节点分裂，CART算法会持续进行树的生长直到所有的叶子节点都满足至少一种停止条件！节点纯度100%，该节点下所有样本的目标变量值都相同；节点下所有样本的待选特征都具有相同的值；当前节点所处的深度达到树的指定最大深度；节点下的样本数量少于指定的最小父节点/样本量；节点分裂后产生的子节点下的样本数量少于指定的最小子节点/样本量；节点分裂后产生的impurity统计量下降小于指定的最小impurity差；代码实现(Python):#首先引入相关模块，numpy是python 的一个第三方模块，提供多种高效的线性代数计算。注意到，引入了上一篇中的c45模块，主要是为了使用其中的训练集和测试集生成函数。import bumpy as npimport operatorimport C45import copy＃初始化必要参数，对于离散特征，存储其唯一取值对于后续计算会很方便''will implement CART'''class CART:def __init__(self,target_type,features):self.targetType=target_typeself.N=<span style="font-kerning: color: #
'''for classification only,unique values of all features,
including the class'''self.uniqueDiscValue={}
'''description of the features used for constructing tree'''self.features=features
'''features used for constructing tree'''self.featurUniqueDiscValue={}＃从文本文件加载原始数据，对连续特征和离散特征分别进行相应处理'''return dataSet, continues features are treated as floats''''''discrete features are encoded as ints, ex:1,2,3'''def loadFromFile(self,filename):dataSet=[]
'''store the discrete feature mapping,each unique discrete feature will have a unique key'''
'''{feature_n:[A,B,C]}, each discrete value in the data set will be replaced by the index of the unique set'''dis_dic={}f=open(filename,'r')contiSet=[i for i in range(len(self.features)) if self.features[i][<span style="font-kerning: color: #]==&continuous&]discSet=[i for i in range(len(self.features)) if self.features[i][<span style="font-kerning: color: #]==&discrete&]for idx in range(len(self.features)):dis_dic[idx]=[]
'''the class column's index is len(features)'''dis_dic[len(self.features)]=[]
'''{feature_n:[A,B,C]}'''for line in f:t=line.strip('.\n').split(',')for idx in range(len(t)):
'''generate unique value list of features'''
if (t[idx]!=&?& and t[idx] not in dis_dic[idx]):
dis_dic[idx].append(t[idx])self.uniqueDiscValue=copy.deepcopy(dis_dic)f=open(filename,'r')for line in f:t=line.strip('.\n').split(',')if &?& not in t:
d=[dis_dic[idx].index(t[idx]) for idx in range(len(t))]
if len(t)!=<span style="font-kerning: color: #:
dataSet.append(d)return dataSet＃根据特征取值当前节点下的样本数据集分裂，产生左右两个节点分别包含两组子数据集'''this function assumes that the dataSet is formed using numpy array'''def binSplitDataSet(self,dataSet,feature,value,feature_type):mat0=None;mat1=Noneif feature_type==&continuous&:mat0=dataSet[np.nonzero(dataSet[:,feature]&value)[<span style="font-kerning: color: #],:][<span style="font-kerning: color: #]mat1=dataSet[np.nonzero(dataSet[:,feature]&=value)[<span style="font-kerning: color: #],:][<span style="font-kerning: color: #]elif feature_type==&discrete&:mat0=dataSet[np.nonzero(dataSet[:,feature]==value)[<span style="font-kerning: color: #],:][<span style="font-kerning: color: #]mat1=dataSet[np.nonzero(dataSet[:,feature]!=value)[<span style="font-kerning: color: #],:][<span style="font-kerning: color: #]else:
print &illegal feature type!!&return None,None
return mat0,mat1＃对于离散类型的目标变量，每个节点都应该有一个决策分类，这样才符合mdl（minimum descriptive length）最小描述长度准则。因此，我们使用该节点下最常见的类标签作为输出def majorityCnt(self,classList):classCount={}for vote in classList:if vote not in classCount.keys():
classCount[vote]=<span style="font-kerning: color: #classCount[vote]+=<span style="font-kerning: color: #
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(<span style="font-kerning: color: #),reverse=True)return sortedClassCount[<span style="font-kerning: color: #][<span style="font-kerning: color: #]'''return the majority count of class labels'''def claLeaf(self,dataSet):return self.majorityCnt(dataSet[:,-<span style="font-kerning: color: #].T.tolist()[<span style="font-kerning: color: #])＃对于连续类型的目标变量，使用该节点下样本目标变量取值的平均值作为输出'''the mean of last column'''def regLeaf(self,dataSet):return np.mean(dataSet[:,-<span style="font-kerning: color: #])＃当目标变量为离散类型时，使用gini index来计算impurity统计量'''calculate gini index'''def GiniIndex(self,dataSet):m,n=np.shape(dataSet)Gini=<span style="font-kerning: color: #p_c_t_list=[]if len(set(dataSet[:,-<span style="font-kerning: color: #].T.tolist()[<span style="font-kerning: color: #]))==<span style="font-kerning: color: #:return <span style="font-kerning: color: #for c in set(dataSet[:,-<span style="font-kerning: color: #].T.tolist()[<span style="font-kerning: color: #]):N_t=len(dataSet[np.nonzero(dataSet[:,-<span style="font-kerning: color: #]==c)[<span style="font-kerning: color: #],:][<span style="font-kerning: color: #])p_c_t=float(N_t)/mp_c_t_list.append(p_c_t)sigma_p=<span style="font-kerning: color: #for p in p_c_t_list:sigma_p+=p*pGini=<span style="font-kerning: color: #-sigma_preturn Gini＃当目标变量为连续类型时，使用总方差（为简单起见原公示二中权重和频率参数都用1来代替）来计算impurity统计量'''最后一列的总方差'''def regErr(self,dataSet):return np.var(dataSet[:,-<span style="font-kerning: color: #])*np.shape(dataSet)[<span style="font-kerning: color: #]＃计算最佳分裂点
def chooseBestSplit(self,dataSet, leafType, impurity, ops=(<span style="font-kerning: color: #.1,<span style="font-kerning: color: #)):tolS=ops[<span style="font-kerning: color: #];tolN=ops[<span style="font-kerning: color: #]if len(set(dataSet[:,-<span style="font-kerning: color: #].T.tolist()[<span style="font-kerning: color: #]))==<span style="font-kerning: color: #:return None, leafType(dataSet)m,n=np.shape(dataSet)if n==<span style="font-kerning: color: #:
print &no more feature to split!&return None, leafType(dataSet)g=impurity(dataSet)bestS=np.NINF;bestIndex=<span style="font-kerning: color: #;bestValue=<span style="font-kerning: color: #for featIndex in range(n-<span style="font-kerning: color: #):
for splitVal in set(dataSet[:,featIndex].T.tolist()[<span style="font-kerning: color: #]):
mat0,mat1=self.binSplitDataSet(dataSet,featIndex,splitVal,self.features[featIndex][1])
if (np.shape(mat0)[<span style="font-kerning: color: #]&tolN) or (np.shape(mat1)[<span style="font-kerning: color: #]&tolN):
p_l=(float)(np.shape(mat0)[<span style="font-kerning: color: #])/m
p_r=(float)(np.shape(mat1)[<span style="font-kerning: color: #])/m
phi=g-p_l*impurity(mat0)-p_r*impurity(mat1)
if phi&bestS:
bestIndex=featIndex
bestValue=splitVal
bestS=phimat0,mat1=self.binSplitDataSet(dataSet,bestIndex,bestValue,self.features[bestIndex][<span style="font-kerning: color: #])if (np.shape(mat0)[<span style="font-kerning: color: #]&tolN) or (np.shape(mat1)[<span style="font-kerning: color: #]&tolN):return None, leafType(dataSet)return bestIndex, bestValue＃递归建树def createTree(self,dataSet, ops=(<span style="font-kerning: color: #.1,<span style="font-kerning: color: #)):leafType=Noneimpurity=Noneif self.targetType==&reg&:leafType=self.regLeafimpurity=self.regErrelif self.targetType==&cla&:leafType=self.claLeafimpurity=self.GiniIndexelse:
raise Exception(&illegal feature type exception!&)classList=[str(c) for c in dataSet[:,-<span style="font-kerning: color: #].T.tolist()[<span style="font-kerning: color: #]]cases=np.shape(dataSet)[<span style="font-kerning: color: #]feat,val=self.chooseBestSplit(dataSet, leafType, impurity, ops)<p style="font-size: 11 line-height: font-机器学习深度学习实战原创交流(MLwizard)　
　文章为作者独立观点，不代表大不六文章网立场
MLwizard致力于分享机器学习算法研究、实践经验，交流实战干货，拒绝简单copy & paste。热门文章最新文章MLwizard致力于分享机器学习算法研究、实践经验，交流实战干货，拒绝简单copy & paste。Tom｜LOFTER（乐乎） - 记录生活，发现同好
LOFTER for ipad —— 记录生活，发现同好
记录生活，发现同好
145位喜爱 #Tom 的小伙伴邀你来玩
查看高清大图
喜欢并收藏内容
关注达人获取动态
评论私信与同好交流
10秒注册，查看更多优质内容
网易公司版权所有　&　ICP备：浙B2-增值电信业务经营许可证：浙B2-
{if x.type==1}
{if !!x.title}${x.title}{/if}
{if !!x.digest}${x.digest}{/if}
{if x.type==2}
{if x.type==3}
{if x.type==4}
加入 LOFTER 开通功能特权
查看高清大图
喜欢并收藏内容
关注达人获取动态
评论私信与同好交流美妙的琴弦 - 网易云音乐
美妙的琴弦
一弦一柱思华年。
播放：51次
网易云音乐多端下载
同步歌单，随时畅听320k好音乐
网易公司版权所有(C)
杭州乐读科技有限公司运营：}

快乐无忧网