0%

决策树

决策树

类型

  • ID3
  • C4.5
  • CART(回归树)

优缺点

优点:计算复杂度不高,输出易于理解,对缺失值不敏感,可以处理不相关特征数据

缺点:可能会产生过度匹配问题,过拟合

使用数据类型:数值型和布尔型

香农熵(信息熵)

熵定义为信息的期望值,表示信息的无序程度

变量的不确定性越大,熵也就越大,把它搞清楚所需要的信息量也越大
$$
l(x_i)=-log_2{p(xi)}
$$

$$
H=-\sum_{i=1}^np(x_i)log_2p(x_i)
$$

l(x)表示信息x的定义,p(x)表示x的出现概率,$p(x_i)log_2p(x_i)$是信息$x_i$的熵(信息量的期望),H是总信息的熵

经验熵

  • 举个例子:我们要区分什么是鱼

    有很多数据:不浮出水可以生存,有脚蹼,是鱼;不浮出水可以生存,无脚蹼,不是鱼….等等

    不浮出水是否生存和有无脚蹼就是判断依据,是不是鱼就是目标信息(bool型)

经验熵就是对目标信息计算香农熵

条件熵

$$
H(Y|X)=\sum_{i=1}^np_iH(Y|X=x_i) //p_i=P(X=x_i)
$$

H(Y|X):X给定条件下Y的熵,如上X有n个分类,pi表示第i个分类在数据中的比例,H(Y|X=xi)表示第i个分类的数据在一起时的熵。

举个例子:现在数据被拆分成两个部分1和2,数据的比例分别为为p1,p2,数据的熵分别为h1,h2。那么按这个拆分的条件熵就是$p1h1+p2h2$

信息增益

$$
g(D,A)=H(D)-H(D|A)
$$

目的是是g(D,A)最大,就是让熵减小的更多,就是在分类(拆分)后,熵的和最小

注意,这里的H都是经验熵

信息增益比(信息增益率)

$$
split_{infoA}(S)=-\sum_{j=1}^v{|S_j|\over |S|}log_2{({|S_j|\over |S|})}
$$

split定义为分裂信息,上面求的就是S按照特征A的分类的熵, $S_i(1≤i≤v)$是样本即S在特征A上的划分,在v种不同的划分。可以发现这个也是熵,不过划分条件就不像是目标信息那样了,而是按照特征A的分类来算熵

信息增益率就是
$$
Gain_{Ratio(A)}={Gain(A)\over {split_{infoA}(S)}}
$$
Gain(A)就是g(D,A),还是按A的分类来的经验熵

信息增益率简单来说就是$目标信息划分后的熵(经验熵)\over 特征A的自己的熵$

信息增益与信息增益率区分

img

信息增益就相当于当前以分类A为基准,把其他的拆开,所能减少的熵,只是单向考虑熵(如果我每个都分成一类,熵肯定是最小的,所以没有考虑树枝的情况)

信息增益率:考虑了树枝的情况,就是把树枝的分支也算进了熵的计算里面,避免了分支太多的情况

如上H(A)就是信息增益率中考虑树的枝条的情况,作为分母可以理解成均摊到每个枝条的熵

基尼不纯度

  • 定义:将来自集合中的某种结果随机应用于集合中某一数据项的预期误差率

$$
I_G(f)=\sum_{i=1}^nf_i(1-f_i)=\sum_{i=1}^n(f_i-f_i^2)={\sum_{i=1}^nfi}-\sum_{i=1}^nfi_i^2=1-\sum_{i=1}^nf_i^2
$$

也能描述信息的无序程度,用于CART的分类

平方误差

$$
\sum_{x_i\in R_m}(y_i-f(x_i))^2
$$

yi表示正确结果,f(xi)表示学习之后的结果,一般f用Rm集合中的均值表示

算法框架

决策树主函数

主函数是一个递归函数,主要功能是按照某种规则生长出决策树的各个分支节点,并根据终止条件结束算法

递归边界:

  • 1、当前同一个叶子节点的所有数据的目标类型相同,就算有还有多个分类标记也无所谓
  • 2、当前只剩下一个分类标记,此时返回出现评率最大的目标分类

如果信息增益(信息增益率,基尼不纯度增益……)小于阈值,则选择最大的目标分类

计算最优特征子函数

3中决策树采用了三种不同的策略:

  • 1、ID3算法计算信息增益
  • 2、C4.5算法计算信息增益率
  • 3、CART分类时算法计算基尼不纯度,回归时计算最小剩余平方误差

分类器

通过遍历整棵决策树,使测试集数据找到决策树中叶子节点对应的类别标签,最后将测试数据定义为叶子节点所属的类型

流程:根据分类标记不断的往下走,走到叶子节点,返回次叶子节点的标记

区分

  • ID3:信息增益 = 划分前熵 - 划分后熵。信息增益越大,则意味着使用属性 a 来进行划分所获得的 “纯度提升” 越大 。也就是说,用属性 a 来划分训练集,得到的结果中纯度比较高。

  • C4.5:C4.5 克服了 ID3 仅仅能够处理离散属性的问题,以及信息增益偏向选择取值较多特征的问题(如果取属性取值比较多,则划分完之后信息增益会很大,就是搞了个n叉的树枝出来),使用信息增益比来选择特征。信息增益比 = 信息增益 / 划分前熵 选择信息增益比最大的作为最优特征。

    C4.5和ID3都是单变量决策树(就是全程只用一个变量来划分比例计算熵)

  • CART用基尼不纯度,搞出来的一定是二叉树

这些都是网上的诉说

然而,我自己打的时候

这三棵树都能是多叉树,都能处理连续值和离散值,都能多分类,只要遇到连续值就二分类就能解决,虽然不是很好,不过也可以把连续值划分区间然后来做离散值

C4.5对连续属性的处理

  • 1、对特征取值排序
  • 2、两个特征取值之间额中点作为可能的分裂点,将数据集分成两部分,计算每个可能的分裂点的信息增益
  • 3、选择修正后信息增益最大的分裂点作为该特征的最佳分裂点
  • 4、计算最佳分裂点的信息增益率。注意,此处需对最佳分裂点的信息增益进行修正:减去log2(N-1)/|D|(N是连续特征的取值个数,D是训练数据数目,此修正的原因在于:当离散属性和连续属性并存时,C4.5算法倾向于选择连续特征做最佳树分裂点)

连续属性的取值就很多,所以不用ID3

树回归

当数据拥有许多特征并且特征之间的关系十分复杂时,构建全局模型的想法就显得太难了,而且生活中很多问题都是非线性的,不可能用全局线性模型来拟合任何数据。

所以我们可以考虑将数据集切分成很多份易建模的数据,然后利用线性回归来建模。如果切完之后还难以拟合,就继续切分。

优缺点

  • 优点:可以对复杂和非线性的数据建模
  • 缺点:结果不易理解
  • 使用数据类型:数据性和标准型数据

回归树的叶子节点是连续性而不是离散型。

树回归的一般流程

  • 收集数据:采用任意方法收集数据
  • 准备数据:需要数值型数据,标准型数据应该映射成二值型数据
  • 分析数据:绘出数据的二维可视化显示结果,以字典方式生成树
  • 训练算法:大部分时间都花费在叶节点树模型的构建上
  • 测试算法:使用测试数据上的R2值来分析模型的效果
  • 使用算法:使用训练出的树做预测,预测结果还可以用来做很多事情

树剪枝

剪枝根据我的理解可以分为:针对树的形态上的剪枝和针对正确率的剪枝(树的形态的剪枝往往很能影响正确率,因为会提高泛化能力—->一般的后剪枝都能提高正确率)

预剪枝

预剪枝就是在生成节点的时候,如果信息增益小于阈值或节点个数当前已经很少,那么就把当前的点作为叶子节点

递归边界也是预剪枝

还有一种预剪枝是在生成树的时候,每次切分后,判断一下正确率,如果降低了就不切分。然后后面有一种错误率降低剪枝要比这个好

后剪枝

错误率降低剪枝(REP)

  • 建完树
  • 递归从下到上(实现的时候用逆向DFS序),判断减掉之后正确率会不会提高,若提高就剪掉
1
2
3
4
5
6
7
8
9
10
11
#错误率降低剪枝
def REPrunning(self):
for i in range(tsize[0],0,-1):
if (son[i] == 1): continue
pre = self.access()
az[i] = 1
now = self.access()
if(now > pre):
continue
else: az[i] = 0
return self.access()

az表示是否删掉,若删掉就不走,这个是最好打的后剪枝

悲观剪枝(PEP)

思想
  • 二项分布的正态逼近

听起来很高级,但是实际的思想很简单,就是把二项分布弄的正态分布一样,有$\mu和\delta$

然后用$[\mu-\delta,\mu+\delta]$作为置信区间(也可以自己变一变,可以区间的一边不封闭)

  • 对于一个子树来说

$$
p=\frac{err+alpha*son}{sum} //err是错误分类个数,alpha是自定义参数,son是叶子个数,sum是节点总数
$$

$$
均值 mean=p*sum
$$

$$
因为是二项分布 方差val=\sqrt{sump(1-p)}
$$

引进alpha的目的就是调节概率模型,使其逼近正态分布。而且可以量化叶子节点的代价

  • 步骤(从上到下,从下到上好像正确率要低一点)
    • 算出当前子树的meanpre和valpre
    • 算出把当前节点当做叶节点的meannow和valnow
    • 如果meannow在置信区间内就能减,因为减了影响不是很大
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def PEPrunningCalcDU(self, tree, alpha):
if ('feat' not in tree):
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum']*pnow
meanpre = tree['sum']*ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow)and(meanpre - valpre < meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
def PEPrunningCalcUD(self, tree, alpha):
if ('feat' not in tree):
return tree
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum'] * pnow
meanpre = tree['sum'] * ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow)and(meanpre - valpre < meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
return tree
def PEPrunning(self, alpha = 0.1):
self.tree = self.PEPrunningCalcUD(self.tree, alpha)
return self.access()

代价复杂度剪枝(改版)

$$
f(x)=sum*Ent + alpha//对于每个叶子节点,sum是叶子中的数据总数,Ent是叶子节点的熵(基尼…..)
$$

然后目标是是$\sum f(x)最小$

那么就动态规划

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#代价复杂度后剪枝
def costPrunningCalc(self, tree, alpha):
x = tree['num']
if('feat' not in tree):
f[x][0] = tree['sum']*tree['Ent'] + alpha
f[x][1] = None
return
if (tree['feat'] in self.strkey):
f[x][0] = 0
val = tree['val']
for i in val:
son = tree[i]['num']
self.costPrunningCalc(tree[i], alpha)
o = f[son][0]
if (f[son][1] != None): o = min(o, f[son][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
return
self.costPrunningCalc(tree['left'], alpha)
self.costPrunningCalc(tree['right'], alpha)
le, ri = tree['left']['num'], tree['right']['num']
f[x][0] = f[le][0]
if (f[le][1] != None): f[x][0] = min(f[x][0], f[le][1])
o = f[ri][0]
if (f[ri][1] != None): o = min(o, f[ri][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
def costPrunningDel(self, tree, bz):
x = tree['num']
tree['Ent'] = f[x][bz]
if ('feat' not in tree):
return tree
feat = tree['feat']
if (bz == 1):
if (feat in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (feat in self.strkey):
val = tree['val']
lenx = len(val)
for bin in range(1<<lenx):
tot = -1
o = 0
for i in val:
tot += 1
l = 0
if ((bin & (1 << tot)) > 0): l = 1
son = tree[i]['num']
if (f[son][l] == None):
tot = -2
break
o += f[son][l]
if (o != f[x][bz]): tot = -2
if (tot != -2):
for i in val:
tot += 0
l = 0
if ((bin & (1 << tot)) > 0): l = 1
self.costPrunningCalc(tree[i], l)
return tree
le, ri = tree['left']['num'], tree['right']['num']
yi = er =0
for i in range(2):
if (f[le][i] == None): continue
for j in range(2):
if (f[ri][j] == None): continue
if (f[le][i] + f[ri][j] == f[x][bz]):
yi, er = i, j
self.costPrunningDel(tree['left'], yi)
self.costPrunningDel(tree['right'], er)
return tree
def costPrunning(self, alpha = 10):
self.costPrunningCalc(self.tree, alpha)
bz = 0
if (f[1][0] > f[1][1]): bz = 1
tsize = [0]
self.tree = self.costPrunningDel(self.tree, bz)
return self.access()

随机森林

顾名思义:有很多棵决策树,还是建立在随机之上

随机的东西

  • 数据集:用自助法采数据集
  • 每个树要用到的feature:每棵树只用k个feature,这个随机,k建议去log2(特征总数)

然后对于每个测试样本,投票或均值

随机森林正确率真的很高!!!

可视化

用graphviz

Python(熬测时候的裁员数据集)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
from pandas import *
from numpy import *
import matplotlib.pyplot as plt
from math import log
import operator
import time
from graphviz import Digraph
#决策树
class DecisionTree:
dataSet = []; testSet = []; dataO = []
labelSet = []; testLabel = []; key = {}
strkey = []; valkey = []
replaceTo = {}; replaceBack = {}
algor = 'None'
Thresh = (0, 0)
treshVal = 0.5
tree = {}
# 下载数据
def loadDataSet(self, filename):
fr = read_csv(filename)
fr.replace('?', nan, inplace=True)
return fr
# 删掉缺失值多的key
def delColumn(self,dataSet):
dataIndex = dataSet.keys()
n = dataSet.shape[0]
delVal = n//3
for inx in dataIndex:
if (dataSet[inx].isnull().sum() >= delVal):
dataSet = dataSet.drop(inx, axis=1)
dataSet.drop('EmployeeNumber', axis=1)
dataSet.drop('EmployeeCount', axis=1)
return dataSet
# 寻找异常值
def boxDiagram(self, dataSet):
nIndex = dataSet.keys()
n = dataSet.shape[0]
throw = set([])
tot = -1
for inx in nIndex:
tot += 1
if (tot in self.strkey): continue
if (inx == 'Attrition'): continue
data = dataSet[inx]
# 排出缺失值
data = data[data.notnull()]
# 箱型图
Q1, Q3 = percentile(data, 25), percentile(data, 75)
IQR = Q3 - Q1
throw = throw | set(data[data > Q3 + 3 * IQR].keys())
throw = throw | set(data[data < Q1 - 3 * IQR].keys())
# 温和异常值判为缺失值
dataSet.loc[(dataSet[inx] > Q3 + 1.5 * IQR), inx] = None
dataSet.loc[(dataSet[inx] < Q1 - 1.5 * IQR), inx] = None
for i in throw:
dataSet = dataSet.drop(i, axis=0)
dataSet.dropna(axis=0, how="all", inplace=True)
dataSet = dataSet.reindex(range(dataSet.shape[0]))
return dataSet
# 计算距离
def OjiDist(self,dataSet, x, y, m):
z = 0
for i in range(m):
if (str(dataSet.iloc[x, i]) == 'nan') or (str(dataSet.iloc[y, i]) == 'nan'): continue;
z += (dataSet.iloc[x, i] - dataSet.iloc[y, i]) ** 2
return sqrt(z)
# 处理缺失值
def solveMissing(self,dataSet):
nIndex = dataSet.keys()
n, m = dataSet.shape[0], dataSet.shape[1]
"""for i in range(n):
if (dataSet.iloc[i].isnull().sum() == 0): continue
mini = 0
distMin = inf
for j in range(n):
if (i == j): continue
if (dataSet.iloc[j].isnull().sum() > 0): continue
distx = self.OjiDist(dataSet, i, j, m)
if (distx < distMin):
distMin = distx
mini = j
for j in range(m):
if (str(dataSet.iloc[i, j]) == 'nan'): dataSet.iloc[i, j] = dataSet.iloc[mini, j]"""
tot = -1
for i in nIndex:
tot += 1
data = dataSet[i]
if (i=='Attrition'): meanx = median(data[data.notnull()])
elif (tot in self.strkey): meanx = dataSet[i].mode()[0]
else: meanx = mean(data[data.notnull()])
for j in range(n):
if (str(dataSet.loc[j, i]) == 'nan'):
dataSet.loc[j, i] = meanx
return dataSet
# 记录离散值和连续值的key
def splitStrVal(self):
tot = -1
for inx in self.dataO.keys():
tot += 1
if (inx == 'Attrition'): continue
x = self.dataO.loc[1, inx]
if not (isinstance(x, str)):
self.valkey.append(tot)
else:
self.strkey.append(tot)
# 将离散值标号
def replaceStr(self,dataSet):
nIndex = dataSet.keys()
n, m = dataSet.shape[0], dataSet.shape[1]
tot = 0
for inx in nIndex:
x = self.dataO.loc[1, inx]
if not (isinstance(x, str)): continue
for j in range(n):
y = dataSet.loc[j, inx]
if (y not in self.replaceTo):
tot+=1
self.replaceTo[y] = tot
self.replaceBack[tot] = y
self.dataO.loc[j, inx] = self.replaceTo[y]
return dataSet
# 特征标准化处理
def standardize(self,dataSet):
n, m = dataSet.shape[0], dataSet.shape[1]
keyN = dataSet.keys()
for inx in keyN:
if (inx == 'Outcome'): continue
data = dataSet[inx]
meanx = data.mean()
stdx = data.std()
for j in range(n):
dataSet.loc[j, inx] = (dataSet.loc[j, inx] - meanx) / stdx
return dataSet
# 预处理模块
def PreProcess(self):
self.dataO = self.delColumn(self.dataO)
self.dataO['Attrition'].replace('Yes', 1, inplace=True)
self.dataO['Attrition'].replace('No', 0, inplace=True)
self.splitStrVal()
self.dataO = self.boxDiagram(self.dataO)
self.dataO = self.solveMissing(self.dataO)
#self.dataO = self.standardize(self.dataO)

# 数据划分留出法
def dataSplit(self,P = 0.8):
dataSet = []; testSet = []
labelSet = []; testLabel = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
n = self.dataO.shape[0]
S = 0
for i in self.dataO['Attrition']:
if (i == 1.0): S += 1
N = n - S; S = P * S
N = P * N; SS = NN = 0
for i in range(n):
#p = random.random()
p = 0.1*(i%10)
data = list(self.dataO.iloc[i])
if (p <= P):
if (data[1] == 1.0) and (SS >= S): p = 1.0 - p
if (data[1] == 0.0) and (NN >= N): p = 1.0 - p
if (p <= P):
if (data[1] == 1.0): SS += 1
else: NN += 1
dataSet.append(data[:1]+data[2:]), labelSet.append(data[1])
else:
testSet.append(data[:1]+data[2:]), testLabel.append(data[1])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 自助法划分
def BSsplit(self):
n = self.dataO.shape[0]
dataSet = []; testSet = []
labelSet = []; testLabel = []
dataIndex = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
for i in range(n):
x = random.randint(0, n - 1)
dataIndex.append(x)
setIndex = set(dataIndex)
for i in range(n):
data = list(self.dataO.iloc[dataIndex[i]])
dataSet.append(data[:1]+data[2:]), labelSet.append(data[1])
data = self.dataO.iloc[i]
if (i not in setIndex): testSet.append(data[:1]+data[2:]), testLabel.append(data[1])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 随机k个特征
def keyRandom(self, p=0.7):
m = self.dataSet.shape[1]
for i in range(m):
keybz[i] = 1
upx = m*p;
arr = arange(m)
random.shuffle(arr)
for i in range(round(upx)):
keybz[arr[i]] = 0
#香农熵
def calcShannonEnt(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Ent = 0
for key in classX.keys():
prob = float(classX[key]/n)
Ent -= prob*log(prob,2)
return Ent
#基尼不纯度
def calcGink(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Gink = 1
for key in classX.keys():
prob = float(classX[key] / n)
Gink -= prob*prob
return Gink
#计算标准差*长度
# 计算均值
def calcMean(self, labelSet):
return mean(labelSet)
def calcSqrError(self, labelSet):
meanx = mean(labelSet)
Dvalue = labelSet - tile(meanx,(labelSet.shape[0], 1))
Dvalue = Dvalue**2
sumx = sum(Dvalue)
return sumx
#return 1

calc = {'ID3': calcShannonEnt, 'C45': calcShannonEnt, 'CARTC': calcGink
,'CARTR': calcSqrError}
#统计label的各个值的数量
def calcClass(self, labelSet):
classX = {}
for i in range(labelSet.shape[0]):
lab = labelSet[i]
classX[lab] = classX.get(lab, 0) + 1
return classX
#按照feat和val分离数据
def splitToFeat(self, dataSet, labelSet, feat, val):
datA = dataSet[nonzero(dataSet[:, feat] < val)[0]]
labA = labelSet[nonzero(dataSet[:, feat] < val)[0]]
datB = dataSet[nonzero(dataSet[:, feat] >= val)[0]]
labB = labelSet[nonzero(dataSet[:, feat] >= val)[0]]
return datA, labA, datB, labB
#
def splitToSame(self, dataSet, labelSet, feat, val):
dat = dataSet[nonzero(dataSet[:, feat] == val)[0]]
lab = labelSet[nonzero(dataSet[:, feat] == val)[0]]
return dat, lab
#计算叶子节点的标签
def calcLeave(self, labelSet, algor):
treshVal = self.treshVal
# 找label中的均值
if (algor == 'CARTR'):
o = self.calcMean(labelSet)
return o
# 找label中的最多的
classX = self.calcClass(labelSet)
sortX = sorted(classX.items(), key=operator.itemgetter(1), reverse=True)
return sortX[0][0]
#找当前分界点
def chooseBest(self, dataSet, labelSet, algor, Thresh):
n, m = shape(dataSet)
if (sum(labelSet==labelSet[0])==n):
return None, labelSet[0]
if (m == 1):
return None, self.calcLeave(labelSet, algor)
EntO = self.calc[algor](self,labelSet)
BEnt = inf; Bfeat = Bval = 0
for feat in range(m):
if (feat in self.strkey): continue
a = dataSet[:, feat]
if (max(a)==min(a)): continue
setX = list(sorted(set(a)))
val = mean(a)
#for j in range(len(setX)-1):
#for val in setX:
#if(val==max(setX))or(val==min(setX)): continue
#val = (setX[j] + setX[j+1]) / 2.0
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, feat, val)
probA = float((datA.shape[0])/n)
probB = float((datB.shape[0])/n)
if (algor == 'CARTR'):
probA = probB = 1.0
Ent = probA*self.calc[algor](self,labA) + probB*self.calc[algor](self,labB)
if (algor == 'C45'):
Ent = Ent/(self.calc[algor](self,a) + 0.1)
if (Ent < BEnt):
BEnt, Bfeat, Bval = Ent, feat, val
for feat in range(m):
if (feat not in self.strkey): continue
a = dataSet[:, feat]
setX = list(sorted(set(a)))
Ent = 0
for val in setX:
dat, lab = self.splitToSame(dataSet, labelSet, feat, val)
numx = lab.shape[0]
prob = numx*1.0/n
if (algor == 'CARTR'): prob = 1.0
Ent += prob*self.calc[algor](self, lab)
if (algor == 'C45'):
Ent = Ent / (self.calc[algor](self, a) + 0.1)
if (Ent < BEnt):
BEnt, Bfeat = Ent, feat
if (EntO-BEnt<Thresh[0]):
return None, self.calcLeave(labelSet, algor)
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
if (datA.shape[0] < Thresh[1])and(datB.shape[0] < Thresh[1]):
return None, self.calcLeave(labelSet, algor)
return Bfeat, Bval
#建树
def buildTree(self, dataSet, labelSet):
algor = self.algor
Thresh = self.Thresh
Bfeat, Bval = self.chooseBest(dataSet, labelSet, algor, Thresh)
if (Bfeat == None):
tsize[0] += 1
son.append(1)
tree = {'val': Bval, 'num': tsize[0], 'Ent': self.calc[algor](self, labelSet)
, 'sum': dataSet.shape[0], 'mx': self.calcLeave(labelSet, algor),
'err': sum(labelSet != self.calcLeave(labelSet, algor)) * 1.0,
'son': 1, 'errs': sum(labelSet != self.calcLeave(labelSet, algor)) * 1.0}
if (algor == 'CARTR'):
tree['errs'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i] - tree['mx']) > self.treshVal): tree['errs'] += 1
tree['err'] = tree['errs']
return tree
tsize[0] += 1
son.append(0)
tree = {}
tree['feat'] = Bfeat
tree['val'] = Bval
tree['num'] = tsize[0]
tree['Ent'] = self.calc[algor](self, labelSet)
tree['mx'] = self.calcLeave(labelSet, algor)
tree['err'] = sum(labelSet != tree['mx']) * 1.0
if (algor == 'CARTR'):
tree['err'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i] - tree['mx']) > self.treshVal): tree['err'] += 1
tree['sum'] = dataSet.shape[0]
if (Bfeat in self.strkey):
a = dataSet[:, Bfeat]
setX = list(sorted(set(a)))
Ent = 0
tree['val'] = setX
tree['son'] = tree['errs'] = 0
for val in setX:
dat, lab = self.splitToSame(dataSet, labelSet, Bfeat, val)
if (lab.shape[0] == 0):
ans = 1
tree[val] = self.buildTree(dat, lab)
tree['son'] += tree[val]['son']
tree['errs'] += tree[val]['errs']
return tree
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
tree['left'] = self.buildTree(datA, labA)
tree['right'] = self.buildTree(datB, labB)
tree['son'] = tree['left']['son'] + tree['right']['son']
tree['errs'] = tree['left']['errs'] + tree['right']['errs']
return tree
#对测试集分类
def classifyTest(self, tree, test):
x = tree['num']
if (az[x]==1):
return tree['mx']
if ('feat' not in tree):
return tree['val']
feat = tree['feat']
val = tree['val']
if (feat in self.strkey):
for i in val:
if (i == test[feat]):
return self.classifyTest(tree[i], test)
return tree['mx']
if (test[feat] < val): return self.classifyTest(tree['left'], test)
else: return self.classifyTest(tree['right'], test)
#评估
def access(self):
n = self.testLabel.shape[0]
err = 0;yi=0
for i in range(n):
lab = self.classifyTest(self.tree, self.testSet[i])
if (self.algor == 'CARTR'):
if (fabs(lab - self.testLabel[i]) > self.treshVal): err += 1
else:
if (lab != self.testLabel[i]): err += 1
return (n-err)*1.0/n
# 评估forest
def accessForest(self, forest):
n = self.testLabel.shape[0]
m = len(forest)
err = 0;
yi = 0
for i in range(n):
lab = 0
for j in range(m):
lab += self.classifyTest(forest[j],self.testSet[i])

if (self.algor == 'CARTR'):
if (fabs(lab/m - self.testLabel[i]) > self.treshVal): err += 1
else:
if (lab > m - lab): lab = 1.0
else: lab = 0.0
if (lab != self.testLabel[i]): err += 1
return (n - err) * 1.0 / n
#代价复杂度后剪枝
def costPrunningCalc(self, tree, alpha):
x = tree['num']
if('feat' not in tree):
f[x][0] = tree['sum']*tree['Ent'] + alpha
f[x][1] = None
return
if (tree['feat'] in self.strkey):
f[x][0] = 0
val = tree['val']
for i in val:
son = tree[i]['num']
self.costPrunningCalc(tree[i], alpha)
o = f[son][0]
if (f[son][1] != None): o = min(o, f[son][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
return
self.costPrunningCalc(tree['left'], alpha)
self.costPrunningCalc(tree['right'], alpha)
le, ri = tree['left']['num'], tree['right']['num']
f[x][0] = f[le][0]
if (f[le][1] != None): f[x][0] = min(f[x][0], f[le][1])
o = f[ri][0]
if (f[ri][1] != None): o = min(o, f[ri][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
def costPrunningDel(self, tree, bz):
x = tree['num']
tree['Ent'] = f[x][bz]
if ('feat' not in tree):
return tree
feat = tree['feat']
if (bz == 1):
if (feat in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (feat in self.strkey):
val = tree['val']
lenx = len(val)
for bin in range(1<<lenx):
tot = -1
o = 0
for i in val:
tot += 1
l = 0
if ((bin & (1 << tot)) > 0): l = 1
son = tree[i]['num']
if (f[son][l] == None):
tot = -2
break
o += f[son][l]
if (o != f[x][bz]): tot = -2
if (tot != -2):
for i in val:
tot += 0
l = 0
if ((bin & (1 << tot)) > 0): l = 1
self.costPrunningCalc(tree[i], l)
return tree
le, ri = tree['left']['num'], tree['right']['num']
yi = er =0
for i in range(2):
if (f[le][i] == None): continue
for j in range(2):
if (f[ri][j] == None): continue
if (f[le][i] + f[ri][j] == f[x][bz]):
yi, er = i, j
self.costPrunningDel(tree['left'], yi)
self.costPrunningDel(tree['right'], er)
return tree
def costPrunning(self, alpha = 10):
self.costPrunningCalc(self.tree, alpha)
bz = 0
if (f[1][0] > f[1][1]): bz = 1
tsize = [0]
self.tree = self.costPrunningDel(self.tree, bz)
return self.access()
#错误率降低剪枝
def REPrunning(self):
for i in range(tsize[0],0,-1):
if (son[i] == 1): continue
pre = self.access()
az[i] = 1
now = self.access()
if(now > pre):
continue
else: az[i] = 0
return self.access()
#悲观剪枝
def PEPrunningCalcDU(self, tree, alpha):
if ('feat' not in tree):
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum']*pnow
meanpre = tree['sum']*ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow)and(meanpre - valpre < meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
def PEPrunningCalcUD(self, tree, alpha):
if ('feat' not in tree):
return tree
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum'] * pnow
meanpre = tree['sum'] * ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow)and(meanpre - valpre < meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
return tree
def PEPrunning(self, alpha = 0.1):
self.tree = self.PEPrunningCalcUD(self.tree, alpha)
return self.access()
#决策树可视化
def treePaint(self, tree, dotx):
keyx = self.key
x = tree['num']
if ('feat' not in tree):
dotx.node(name=str(x), label=str(tree['val']), color='green')
return
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
self.treePaint(tree[i], dotx)
else:
self.treePaint(tree['left'], dotx)
self.treePaint(tree['right'], dotx)
dotx.node(name=str(x), label='split to '+str(keyx[tree['feat']]), color='red', shape='box')
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
son = tree[i]['num']
dotx.edge(str(x), str(son), label='is ' + str(self.replaceBack[i]), color='black')
else:
le, ri = tree['left']['num'], tree['right']['num']
dotx.edge(str(x), str(le), label='<' + str(tree['val'].round(3)), color='black')
dotx.edge(str(x), str(ri), label='>=' + str(tree['val'].round(3)), color='black')
def treeVisualize(self, nameadd):
dotx = Digraph(name='Decision'+'('+nameadd+')', format="png")
self.treePaint(self.tree, dotx)
dotx.node(name='a', label='acc: '+str(DCTree.access())+'\n', color='blue', shape='box')
dotx.view()

#ID3决策树
class ID3(DecisionTree):
algor = 'ID3'
Thresh = (0.0001, 8)
treshVal = 0.5
#C45决策树
class C45(DecisionTree):
algor = 'C45'
Thresh = (0.01, 10)
treshVal = 0.5
#CART决策树,带入参数,C是分类,R是回归
class CART(DecisionTree):
algor = ''
def __init__(self, attr):
if (attr == 'C'): self.algor = 'CARTC'
else: self.algor = 'CARTR'
Thresh = (0.001, 8)
treshVal = 0.5
# 随机森林
def randomForest(num, filename):
forest = []
treex = CART('C')
treex.dataO = treex.loadDataSet(filename)
treex.splitStrVal()
treex.dataO = treex.replaceStr(treex.dataO)
sum = 0
for i in range(num):
tsize[0] = 0
son[0] = 0
treex.BSsplit()
pp = log(treex.dataSet.shape[1], 2)/treex.dataSet.shape[1]
treex.keyRandom(p=pp)
treex.tree=treex.buildTree(treex.dataSet, treex.labelSet)
for j in range(tsize[0]+1):
f[j][0] = f[j][1] = az[j] = 0
treex.PEPrunning()
forest.append(treex.tree)
treex.dataSplit()
return treex.accessForest(forest)
if __name__ == '__main__':
tsize = [0]
son = [0]
DCTree = CART('R')
#DCTree.dataO = DCTree.loadDataSet('diabetesN.csv')
DCTree.dataO = DCTree.loadDataSet('employeeFiredN.csv')
f = [[0, 0] for i in range(DCTree.dataO.shape[0] + 1)] # costPrunning
az = [0 for i in range(DCTree.dataO.shape[0] + 1)] # REPrunning
keybz = [0 for i in range(DCTree.dataO.shape[0] + 1)] # randomForest
#DCTree.PreProcess()
#DCTree.dataO.to_csv('employeeFiredN.csv', index=False)
DCTree.splitStrVal()
DCTree.dataO = DCTree.replaceStr(DCTree.dataO)
DCTree.dataSplit()
for i in range(len(DCTree.strkey)): DCTree.strkey[i] -= 1
for i in range(len(DCTree.valkey)): DCTree.valkey[i] -= 1
DCTree.tree=DCTree.buildTree(DCTree.dataSet, DCTree.labelSet)
print(DCTree.access())
print(DCTree.PEPrunning())
#DCTree.treeVisualize('PEP')
print(randomForest(20,'employeeFiredN.csv'))

Python糖尿病数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
from pandas import *
from numpy import *
import matplotlib.pyplot as plt
from math import log
import operator
from graphviz import Digraph
class DecisionTree:
dataSet = []; testSet = []; dataO = []
labelSet = []; testLabel = []; key = {}
algor = 'None'
Thresh = (0, 0)
treshVal = 0.5
tree = {}
# 下载数据
def loadDataSet(self, filename):
fr = read_csv(filename)
return fr
# 删掉缺失值多的key
def delColumn(self,dataSet):
dataIndex = dataSet.keys()
n = dataSet.shape[0]
delVal = n//3
for inx in dataIndex:
if (inx == 'Outcome') or (inx == 'Pregnancies'): continue
dataSet[inx].replace(0,nan,inplace=True)
if (dataSet[inx].isnull().sum() >= delVal):
dataSet = dataSet.drop(inx, axis=1)
return dataSet
# 寻找异常值
def boxDiagram(self, dataSet):
nIndex = dataSet.keys()
n = dataSet.shape[0]
throw = set([])
for inx in nIndex:
if (inx == 'Outcome'): continue
data = dataSet[inx]
# 排出缺失值
data = data[data.notnull()]
# 箱型图
Q1, Q3 = percentile(data, 25), percentile(data, 75)
IQR = Q3 - Q1
throw = throw | set(data[data > Q3 + 3 * IQR].keys())
throw = throw | set(data[data < Q1 - 3 * IQR].keys())
# 温和异常值判为缺失值
dataSet.loc[(dataSet[inx] > Q3 + 1.5 * IQR), inx] = None
dataSet.loc[(dataSet[inx] < Q1 - 1.5 * IQR), inx] = None
for i in throw:
dataSet = dataSet.drop(i, axis=0)
dataSet.dropna(axis=0, how="all", inplace=True)
dataSet = dataSet.reindex(range(dataSet.shape[0]))
return dataSet
# 计算距离
def OjiDist(self,dataSet, x, y, m):
z = 0
for i in range(m):
if (str(dataSet.iloc[x, i]) == 'nan') or (str(dataSet.iloc[y, i]) == 'nan'): continue;
z += (dataSet.iloc[x, i] - dataSet.iloc[y, i]) ** 2
return sqrt(z)
# 处理缺失值
def solveMissing(self,dataSet):
nIndex = dataSet.keys()
n, m = dataSet.shape[0], dataSet.shape[1]
"""for i in range(n):
if (dataSet.iloc[i].isnull().sum() == 0): continue
mini = 0
distMin = inf
for j in range(n):
if (i == j): continue
if (dataSet.iloc[j].isnull().sum() > 0): continue
distx = self.OjiDist(dataSet, i, j, m)
if (distx < distMin):
distMin = distx
mini = j
for j in range(m):
if (str(dataSet.iloc[i, j]) == 'nan'): dataSet.iloc[i, j] = dataSet.iloc[mini, j]"""
for i in nIndex:
data = dataSet[i]
if(i == 'Outcome'): meanx = median(data[data.notnull()])
else: meanx = mean(data[data.notnull()])
for j in range(n):
if (str(dataSet.loc[j, i]) == 'nan'):
dataSet.loc[j, i] = meanx
return dataSet
# 特征标准化处理
def standardize(self,dataSet):
n, m = dataSet.shape[0], dataSet.shape[1]
keyN = dataSet.keys()
for inx in keyN:
if (inx == 'Outcome'): continue
data = dataSet[inx]
meanx = data.mean()
stdx = data.std()
for j in range(n):
dataSet.loc[j, inx] = (dataSet.loc[j, inx] - meanx) / stdx
return dataSet
# 预处理模块
def PreProcess(self):
self.dataO = self.delColumn(self.dataO)
self.dataO = self.boxDiagram(self.dataO)
self.dataO = self.solveMissing(self.dataO)
#self.dataO = self.standardize(self.dataO)
# 数据划分留出法
def dataSplit(self,P = 0.7):
dataSet = []; testSet = []
labelSet = []; testLabel = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
n = self.dataO.shape[0]
S = 0
for i in self.dataO['Outcome']:
if (i == 1.0): S += 1
N = n - S; S = P * S
N = P * N; SS = NN = 0
for i in range(n):
p = random.random()
#p = 0.1*(i%10)
data = self.dataO.iloc[i]
if (p <= P):
if (data[7] == 1.0) and (SS >= S): p = 1.0 - p
if (data[7] == 0.0) and (NN >= N): p = 1.0 - p
if (p <= P):
if (data[7] == 1.0): SS += 1
else: NN += 1
dataSet.append(data[:7]), labelSet.append(data[7])
else:
testSet.append(data[:7]), testLabel.append(data[7])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 自助法划分
def BSsplit(self):
n = self.dataO.shape[0]
dataSet = []; testSet = []
labelSet = []; testLabel = []
dataIndex = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
for i in range(n):
x = random.randint(0, n - 1)
dataIndex.append(x)
setIndex = set(dataIndex)
for i in range(n):
data = self.dataO.iloc[dataIndex[i]]
dataSet.append(data[:7]), labelSet.append(data[7])
data = self.dataO.iloc[i]
if (i not in setIndex): testSet.append(data[:7]), testLabel.append(data[7])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 随机k个特征
def keyRandom(self, p=0.7):
m = self.dataSet.shape[1]
for i in range(m):
keybz[i] = 1
upx = m*p;
arr = arange(m)
random.shuffle(arr)
for i in range(round(upx)):
keybz[arr[i]] = 0
#香农熵
def calcShannonEnt(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Ent = 0
for key in classX.keys():
prob = float(classX[key]/n)
Ent -= prob*log(prob,2)
return Ent
#基尼不纯度
def calcGink(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Gink = 1
for key in classX.keys():
prob = float(classX[key] / n)
Gink -= prob*prob
return Gink
#计算标准差*长度
def calcSqrError(self, labelSet):
meanx = mean(labelSet)
Dvalue = labelSet - tile(meanx, (labelSet.shape[0], 1))
Dvalue = Dvalue ** 2
sumx = sum(Dvalue)
return sumx
#return 1
#计算均值
def calcMean(self, labelSet):
return mean(labelSet)
calc = {'ID3': calcShannonEnt, 'C45': calcShannonEnt, 'CARTC': calcGink
,'CARTR': calcSqrError}
#统计label的各个值的数量
def calcClass(self, labelSet):
classX = {}
for i in range(labelSet.shape[0]):
lab = labelSet[i]
classX[lab] = classX.get(lab, 0) + 1
return classX
#按照feat和val分离数据
def splitToFeat(self, dataSet, labelSet, feat, val):
datA = dataSet[nonzero(dataSet[:, feat] < val)[0]]
labA = labelSet[nonzero(dataSet[:, feat] < val)[0]]
datB = dataSet[nonzero(dataSet[:, feat] >= val)[0]]
labB = labelSet[nonzero(dataSet[:, feat] >= val)[0]]
return datA, labA, datB, labB
#计算叶子节点的标签
def calcLeave(self, labelSet, algor):
treshVal = self.treshVal
# 找label中的均值
if (algor == 'CARTR'):
o = self.calcMean(labelSet)
return o
# 找label中的最多的
classX = self.calcClass(labelSet)
sortX = sorted(classX.items(), key=operator.itemgetter(1), reverse=True)
return sortX[0][0]
#找当前分界点
def chooseBest(self, dataSet, labelSet, algor, Thresh):
n, m = shape(dataSet)
if (sum(labelSet==labelSet[0])==n):
return None, labelSet[0]
if (m == 1):
return None, self.calcLeave(labelSet, algor)
EntO = self.calc[algor](self,labelSet)
BEnt = inf; Bfeat = Bval = 0
for feat in range(m):
if (keybz[feat] == 1): continue
a = dataSet[:, feat]
if (max(a)==min(a)): continue
setX = list(sorted(set(a)))
val = mean(a)
#for j in range(len(setX)-1):
#for val in setX:
#if(val==max(setX))or(val==min(setX)): continue
#val = (setX[j] + setX[j+1]) / 2.0
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, feat, val)
probA = float((datA.shape[0])/n)
probB = float((datB.shape[0])/n)
if (algor == 'CARTR'):
probA = probB = 1.0
Ent = probA*self.calc[algor](self,labA) + probB*self.calc[algor](self,labB)
if (algor == 'C45'):
Ent = Ent / (self.calc[algor](self,a) + 0.1)
if (Ent < BEnt):
BEnt, Bfeat, Bval = Ent, feat, val
if (EntO-BEnt<Thresh[0]):
return None, self.calcLeave(labelSet, algor)
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
if (datA.shape[0] < Thresh[1])and(datB.shape[0] < Thresh[1]):
return None, self.calcLeave(labelSet, algor)
return Bfeat, Bval
#建树
def buildTree(self, dataSet, labelSet):
algor = self.algor
Thresh = self.Thresh
Bfeat ,Bval= self.chooseBest(dataSet, labelSet, algor, Thresh)
if (Bfeat == None):
tsize[0] += 1
son.append(1)
tree = {'val': Bval, 'num': tsize[0], 'Ent': self.calc[algor](self, labelSet)
, 'sum': dataSet.shape[0], 'mx': self.calcLeave(labelSet, algor),
'err': sum(labelSet != self.calcLeave(labelSet, algor)) * 1.0,
'son': 1, 'errs': sum(labelSet != self.calcLeave(labelSet, algor)) * 1.0}
if (algor == 'CARTR'):
tree['errs'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i] - tree['mx']) > self.treshVal): tree['errs'] += 1
tree['err'] = tree['errs']
return tree
tsize[0] += 1
son.append(0)
tree = {}
tree['feat'] = Bfeat
tree['val'] = Bval
tree['num'] = tsize[0]
tree['Ent'] = self.calc[algor](self, labelSet)
tree['mx'] = self.calcLeave(labelSet, algor)
tree['err'] = sum(labelSet != tree['mx'])*1.0
if (algor == 'CARTR'):
tree['err'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i] - tree['mx']) > self.treshVal): tree['err'] += 1
tree['sum'] = dataSet.shape[0]
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
tree['left'] = self.buildTree(datA, labA)
tree['right'] = self.buildTree(datB, labB)
tree['son'] = tree['left']['son'] + tree['right']['son']
tree['errs'] = tree['left']['errs'] + tree['right']['errs']
return tree
#对测试集分类
def classifyTest(self, tree, test):
x = tree['num']
if (az[x]==1):
return tree['mx']
if ('feat' not in tree):
return tree['val']
feat = tree['feat']
val = tree['val']
if (test[feat] < val): return self.classifyTest(tree['left'], test)
else: return self.classifyTest(tree['right'], test)
#评估
def access(self):
n = self.testLabel.shape[0]
err = 0; yi = 0
for i in range(n):
lab = self.classifyTest(self.tree, self.testSet[i])
if (self.algor == 'CARTR'):
if (fabs(lab - self.testLabel[i]) > self.treshVal): err += 1
else:
if (lab != self.testLabel[i]): err += 1
return (n - err) * 1.0 / n
# 评估forest
def accessForest(self, forest):
n = self.testLabel.shape[0]
m = len(forest)
err = 0;
yi = 0
for i in range(n):
lab = 0
for j in range(m):
lab += self.classifyTest(forest[j], self.testSet[i])
if (self.algor == 'CARTR'):
if (fabs(lab / m - self.testLabel[i]) > self.treshVal): err += 1
else:
if (lab > m - lab):
lab = 1.0
else:
lab = 0.0
if (lab != self.testLabel[i]): err += 1
return (n - err) * 1.0 / n
#代价复杂度后剪枝
def costPrunningCalc(self, tree, alpha):
x = tree['num']
if('feat' not in tree):
f[x][0] = tree['sum']*tree['Ent'] + alpha
f[x][1] = None
return
self.costPrunningCalc(tree['left'], alpha)
self.costPrunningCalc(tree['right'], alpha)
le, ri = tree['left']['num'], tree['right']['num']
f[x][0] = f[le][0]
if (f[le][1] != None): f[x][0] = min(f[x][0], f[le][1])
o = f[ri][0]
if (f[ri][1] != None): o = min(o, f[ri][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
def costPrunningDel(self, tree, bz):
x = tree['num']
tree['Ent'] = f[x][bz]
if ('feat' not in tree):
return tree
if (bz == 1):
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
le, ri = tree['left']['num'], tree['right']['num']
yi = er =0
for i in range(2):
if (f[le][i] == None): continue
for j in range(2):
if (f[ri][j] == None): continue
if (f[le][i] + f[ri][j] == f[x][bz]):
yi, er = i, j
self.costPrunningDel(tree['left'], yi)
self.costPrunningDel(tree['right'], er)
return tree
def costPrunning(self, alpha = 1):
self.costPrunningCalc(self.tree, alpha)
bz = 0
if (f[1][0] > f[1][1]): bz = 1
tsize = [0]
self.tree = self.costPrunningDel(self.tree, bz)
return self.access()
#错误率降低剪枝
def REPrunning(self):
for i in range(tsize[0],0,-1):
if (son[i] == 1): continue
pre = self.access()
az[i] = 1
now = self.access()
if(now > pre):
continue
else: az[i] = 0
return self.access()
#悲观剪枝
def PEPrunningCalcDU(self, tree, alpha):
if ('feat' not in tree):
return tree
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
le = tree['left']
ri = tree['right']
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum']*pnow
meanpre = tree['sum']*ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if(meanpre + valpre > meannow):
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
def PEPrunningCalcUD(self, tree, alpha):
if ('feat' not in tree):
return tree
le = tree['left']
ri = tree['right']
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum'] * pnow
meanpre = tree['sum'] * ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow)and(meanpre - valpre < meannow):
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
tree['left'] = self.PEPrunningCalcUD(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcUD(tree['right'], alpha)
return tree
def PEPrunning(self, alg, alpha = 0.1):
if (alg == 'DU'): self.tree = self.PEPrunningCalcDU(self.tree, alpha)
else: self.tree = self.PEPrunningCalcUD(self.tree, alpha)
return self.access()
#决策树可视化
def treePaint(self, tree, dotx):
keyx = self.key
x = tree['num']
if ('feat' not in tree):
dotx.node(name=str(x), label=str(tree['val']), color='green')
return
self.treePaint(tree['left'], dotx)
self.treePaint(tree['right'], dotx)
le, ri = tree['left']['num'], tree['right']['num']
dotx.node(name=str(x), label='split to '+str(keyx[tree['feat']]), color='red', shape='box')
dotx.edge(str(x), str(le), label='<' + str(tree['val'].round(3)), color='black')
dotx.edge(str(x), str(ri), label='>=' + str(tree['val'].round(3)), color='black')
def treeVisualize(self, nameadd):
dotx = Digraph(name='Decision'+'('+nameadd+')', format="png")
self.treePaint(self.tree, dotx)
dotx.node(name='a', label='acc: '+str(DCTree.access())+'\n', color='blue', shape='box')
dotx.view()

#ID3决策树
class ID3(DecisionTree):
algor = 'ID3'
Thresh = (0.0001, 8)
treshVal = 0.5
#C45决策树
class C45(DecisionTree):
algor = 'C45'
Thresh = (0.01, 10)
treshVal = 0.5
#CART决策树,带入参数,C是分类,R是回归
class CART(DecisionTree):
algor = ''
def __init__(self, attr):
if (attr == 'C'): self.algor = 'CARTC'
else: self.algor = 'CARTR'
Thresh = (0.001, 8)
treshVal = 0.5
# 随机森林
def randomForest(num):
forest = []
treex = CART('R')
treex.dataO = treex.loadDataSet('diabetesN.csv')
sum = 0
for i in range(num):
tsize[0] = 0
son[0] = 0
treex.BSsplit()
pp = log(treex.dataSet.shape[1], 2)/treex.dataSet.shape[1]
treex.keyRandom(p=pp)
treex.tree=treex.buildTree(treex.dataSet, treex.labelSet)
for j in range(tsize[0]+1):
f[j][0] = f[j][1] = az[j] = 0
treex.PEPrunning('UD')
forest.append(treex.tree)
treex.dataSplit()
return treex.accessForest(forest)
if __name__ == '__main__':
tsize = [0]
son = [0]
DCTree = CART('R')
DCTree.dataO = DCTree.loadDataSet('diabetesN.csv')
f = [[0, 0] for i in range(DCTree.dataO.shape[0] + 1)] # costPrunning
az = [0 for i in range(DCTree.dataO.shape[0] + 1)] # REPrunning
keybz = [0 for i in range(DCTree.dataO.shape[0] + 1)] # randomForest
#DCTree.PreProcess()
#DCTree.dataO.to_csv('diabetesN.csv',index=False)
DCTree.dataSplit()
DCTree.tree=DCTree.buildTree(DCTree.dataSet, DCTree.labelSet)
print(DCTree.access())
#DCTree.treeVisualize('no prunning')
print(DCTree.PEPrunning('UD'))
#DCTree.treeVisualize('PEPrunning')
print(randomForest(20))

Python鲍鱼年龄预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
from pandas import *
from numpy import *
import matplotlib.pyplot as plt
from math import log
import operator
import time
from graphviz import Digraph
#决策树
class DecisionTree:
dataSet = []; testSet = []; dataO = []
labelSet = []; testLabel = []; key = {}
strkey = []; valkey = []
replaceTo = {}; replaceBack = {}
algor = 'None'
Thresh = (0, 0)
treshVal = 0.5
tree = {}
# 下载数据
def loadDataSet(self, filename):
fr = read_table(filename, names=['a','b','c','d','e','f','g','h','i'])
fr.replace('?', nan, inplace=True)
return fr
# 删掉缺失值多的key
def delColumn(self,dataSet):
dataIndex = dataSet.keys()
n = dataSet.shape[0]
delVal = n//3
for inx in dataIndex:
if (dataSet[inx].isnull().sum() >= delVal):
dataSet = dataSet.drop(inx, axis=1)
dataSet.drop('EmployeeNumber', axis=1)
dataSet.drop('EmployeeCount', axis=1)
return dataSet
# 寻找异常值
def boxDiagram(self, dataSet):
nIndex = dataSet.keys()
n = dataSet.shape[0]
throw = set([])
tot = -1
for inx in nIndex:
tot += 1
if (tot in self.strkey): continue
if (inx == 'Attrition'): continue
data = dataSet[inx]
# 排出缺失值
data = data[data.notnull()]
# 箱型图
Q1, Q3 = percentile(data, 25), percentile(data, 75)
IQR = Q3 - Q1
throw = throw | set(data[data > Q3 + 3 * IQR].keys())
throw = throw | set(data[data < Q1 - 3 * IQR].keys())
# 温和异常值判为缺失值
dataSet.loc[(dataSet[inx] > Q3 + 1.5 * IQR), inx] = None
dataSet.loc[(dataSet[inx] < Q1 - 1.5 * IQR), inx] = None
for i in throw:
dataSet = dataSet.drop(i, axis=0)
dataSet.dropna(axis=0, how="all", inplace=True)
dataSet = dataSet.reindex(range(dataSet.shape[0]))
return dataSet
# 计算距离
def OjiDist(self,dataSet, x, y, m):
z = 0
for i in range(m):
if (str(dataSet.iloc[x, i]) == 'nan') or (str(dataSet.iloc[y, i]) == 'nan'): continue;
z += (dataSet.iloc[x, i] - dataSet.iloc[y, i]) ** 2
return sqrt(z)
# 处理缺失值
def solveMissing(self,dataSet):
nIndex = dataSet.keys()
n, m = dataSet.shape[0], dataSet.shape[1]
"""for i in range(n):
if (dataSet.iloc[i].isnull().sum() == 0): continue
mini = 0
distMin = inf
for j in range(n):
if (i == j): continue
if (dataSet.iloc[j].isnull().sum() > 0): continue
distx = self.OjiDist(dataSet, i, j, m)
if (distx < distMin):
distMin = distx
mini = j
for j in range(m):
if (str(dataSet.iloc[i, j]) == 'nan'): dataSet.iloc[i, j] = dataSet.iloc[mini, j]"""
tot = -1
for i in nIndex:
tot += 1
data = dataSet[i]
if (i=='a'): meanx = median(data[data.notnull()])
elif (tot in self.strkey): meanx = dataSet[i].mode()[0]
else: meanx = mean(data[data.notnull()])
for j in range(n):
if (str(dataSet.loc[j, i]) == 'nan'):
dataSet.loc[j, i] = meanx
return dataSet
# 记录离散值和连续值的key
def splitStrVal(self):
tot = -1
for inx in self.dataO.keys():
tot += 1
if (inx == 'Attrition'): continue
x = self.dataO.loc[1, inx]
if not (isinstance(x, str)):
self.valkey.append(tot)
else:
self.strkey.append(tot)
# 将离散值标号
def replaceStr(self,dataSet):
nIndex = dataSet.keys()
n, m = dataSet.shape[0], dataSet.shape[1]
tot = 0
for inx in nIndex:
x = self.dataO.loc[1, inx]
if not (isinstance(x, str)): continue
for j in range(n):
y = dataSet.loc[j, inx]
if (y not in self.replaceTo):
tot+=1
self.replaceTo[y] = tot
self.replaceBack[tot] = y
self.dataO.loc[j, inx] = self.replaceTo[y]
return dataSet
# 特征标准化处理
def standardize(self,dataSet):
n, m = dataSet.shape[0], dataSet.shape[1]
keyN = dataSet.keys()
for inx in keyN:
if (inx == 'Outcome'): continue
data = dataSet[inx]
meanx = data.mean()
stdx = data.std()
for j in range(n):
dataSet.loc[j, inx] = (dataSet.loc[j, inx] - meanx) / stdx
return dataSet
# 预处理模块
def PreProcess(self):
self.dataO = self.delColumn(self.dataO)
self.dataO['Attrition'].replace('Yes', 1, inplace=True)
self.dataO['Attrition'].replace('No', 0, inplace=True)
self.splitStrVal()
self.dataO = self.boxDiagram(self.dataO)
self.dataO = self.solveMissing(self.dataO)
#self.dataO = self.standardize(self.dataO)
# 数据划分留出法

"""def dataSplit(self, P=0.7):
dataSet = [];
testSet = []
labelSet = [];
testLabel = []
key = {};
tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
n = self.dataO.shape[0]
S = 0
for i in self.dataO['Outcome']:
if (i == 1.0): S += 1
N = n - S;
S = P * S
N = P * N;
SS = NN = 0
for i in range(n):
p = random.random()
# p = 0.1*(i%10)
data = self.dataO.iloc[i]
if (p <= P):
if (data[7] == 1.0) and (SS >= S): p = 1.0 - p
if (data[7] == 0.0) and (NN >= N): p = 1.0 - p
if (p <= P):
if (data[7] == 1.0):
SS += 1
else:
NN += 1
dataSet.append(data[:7]), labelSet.append(data[7])
else:
testSet.append(data[:7]), testLabel.append(data[7])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 自助法划分

def BSsplit(self):
n = self.dataO.shape[0]
dataSet = [];
testSet = []
labelSet = [];
testLabel = []
dataIndex = []
key = {};
tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
for i in range(n):
x = random.randint(0, n - 1)
dataIndex.append(x)
setIndex = set(dataIndex)
for i in range(n):
data = self.dataO.iloc[dataIndex[i]]
dataSet.append(data[:7]), labelSet.append(data[7])
data = self.dataO.iloc[i]
if (i not in setIndex): testSet.append(data[:7]), testLabel.append(data[7])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key"""

# 数据划分留出法
def dataSplit(self,P = 0.8):
dataSet = []; testSet = []
labelSet = []; testLabel = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
n = self.dataO.shape[0]
for i in range(n):
#p = random.random()
p = 0.1*(i%10)
data = list(self.dataO.iloc[i])
if (p <= P):
dataSet.append(data[:-1]), labelSet.append(data[-1])
else:
testSet.append(data[:-1]), testLabel.append(data[-1])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 自助法划分
def BSsplit(self):
n = self.dataO.shape[0]
dataSet = []; testSet = []
labelSet = []; testLabel = []
dataIndex = []
key = {}; tot = 0
for i in self.dataO.keys():
key[tot] = i
tot += 1
for i in range(n):
x = random.randint(0, n - 1)
dataIndex.append(x)
setIndex = set(dataIndex)
for i in range(n):
data = list(self.dataO.iloc[dataIndex[i]])
dataSet.append(data[:-1]), labelSet.append(data[-1])
data = self.dataO.iloc[i]
if (i not in setIndex): testSet.append(data[:-1]), testLabel.append(data[-1])
self.dataSet = array(dataSet)
self.labelSet = array(labelSet)
self.testSet = array(testSet)
self.testLabel = array(testLabel)
self.key = key
# 随机k个特征
def keyRandom(self, p=0.7):
m = self.dataSet.shape[1]
for i in range(m):
keybz[i] = 1
upx = m*p;
arr = arange(m)
random.shuffle(arr)
for i in range(round(upx)):
keybz[arr[i]] = 0
#香农熵
def calcShannonEnt(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Ent = 0
for key in classX.keys():
prob = float(classX[key]/n)
Ent -= prob*log(prob,2)
return Ent
#基尼不纯度
def calcGink(self, labelSet):
classX = self.calcClass(labelSet)
n = labelSet.shape[0]
Gink = 1
for key in classX.keys():
prob = float(classX[key] / n)
Gink -= prob*prob
return Gink
#计算标准差*长度
# 计算均值
def calcMean(self, labelSet):
return mean(labelSet)
def calcSqrError(self, labelSet):
meanx = mean(labelSet)
Dvalue = labelSet - tile(meanx,(labelSet.shape[0], 1))
Dvalue = Dvalue**2
sumx = sum(Dvalue)
return sumx
#return 1

calc = {'ID3': calcShannonEnt, 'C45': calcShannonEnt, 'CARTC': calcGink
,'CARTR': calcSqrError}
#统计label的各个值的数量
def calcClass(self, labelSet):
classX = {}
for i in range(labelSet.shape[0]):
lab = labelSet[i]
classX[lab] = classX.get(lab, 0) + 1
return classX
#按照feat和val分离数据
def splitToFeat(self, dataSet, labelSet, feat, val):
datA = dataSet[nonzero(dataSet[:, feat] < val)[0]]
labA = labelSet[nonzero(dataSet[:, feat] < val)[0]]
datB = dataSet[nonzero(dataSet[:, feat] >= val)[0]]
labB = labelSet[nonzero(dataSet[:, feat] >= val)[0]]
return datA, labA, datB, labB
#
def splitToSame(self, dataSet, labelSet, feat, val):
dat = dataSet[nonzero(dataSet[:, feat] == val)[0]]
lab = labelSet[nonzero(dataSet[:, feat] == val)[0]]
return dat, lab
#计算叶子节点的标签
def calcLeave(self, labelSet, algor):
treshVal = self.treshVal
# 找label中的均值
if (algor == 'CARTR'):
o = self.calcMean(labelSet)
return o
# 找label中的最多的
classX = self.calcClass(labelSet)
sortX = sorted(classX.items(), key=operator.itemgetter(1), reverse=True)
return sortX[0][0]
#找当前分界点
def chooseBest(self, dataSet, labelSet, algor, Thresh):
n, m = shape(dataSet)
if (sum(labelSet==labelSet[0])==n):
return None, labelSet[0]
if (m == 1):
return None, self.calcLeave(labelSet, algor)
EntO = self.calc[algor](self,labelSet)
BEnt = inf; Bfeat = Bval = 0
for feat in range(m):
if (feat in self.strkey): continue
a = dataSet[:, feat]
if (max(a)==min(a)): continue
setX = list(sorted(set(a)))
val = mean(a)
#for j in range(len(setX)-1):
#for val in setX:
#if(val==max(setX))or(val==min(setX)): continue
#val = (setX[j] + setX[j+1]) / 2.0
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, feat, val)
probA = float((datA.shape[0])/n)
probB = float((datB.shape[0])/n)
if (algor == 'CARTR'):
probA = probB = 1.0
Ent = probA*self.calc[algor](self,labA) + probB*self.calc[algor](self,labB)
if (algor == 'C45'):
Ent = Ent/(self.calc[algor](self,a) + 0.1)
if (Ent < BEnt):
BEnt, Bfeat, Bval = Ent, feat, val
for feat in range(m):
if (feat not in self.strkey): continue
a = dataSet[:, feat]
setX = list(sorted(set(a)))
Ent = 0
for val in setX:
dat, lab = self.splitToSame(dataSet, labelSet, feat, val)
numx = lab.shape[0]
prob = numx*1.0/n
if (algor == 'CARTR'): prob = 1.0
Ent += prob*self.calc[algor](self, lab)
if (algor == 'C45'):
Ent = Ent / (self.calc[algor](self, a) + 0.1)
if (Ent < BEnt):
BEnt, Bfeat = Ent, feat
if (EntO-BEnt<Thresh[0]):
return None, self.calcLeave(labelSet, algor)
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
if (datA.shape[0] < Thresh[1])and(datB.shape[0] < Thresh[1]):
return None, self.calcLeave(labelSet, algor)
return Bfeat, Bval
#建树
def buildTree(self, dataSet, labelSet):
algor = self.algor
Thresh = self.Thresh
Bfeat ,Bval= self.chooseBest(dataSet, labelSet, algor, Thresh)
if (Bfeat == None):
tsize[0] += 1
son.append(1)
tree = {'val': Bval, 'num': tsize[0], 'Ent': self.calc[algor](self, labelSet)
, 'sum': dataSet.shape[0], 'mx':self.calcLeave(labelSet, algor),
'err': sum(labelSet != self.calcLeave(labelSet, algor))*1.0,
'son': 1, 'errs': sum(labelSet != self.calcLeave(labelSet, algor))*1.0}
if (algor == 'CARTR'):
tree['errs'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i]-tree['mx']) > self.treshVal): tree['errs'] += 1
tree['err'] = tree['errs']
return tree
tsize[0] += 1
son.append(0)
tree = {}
tree['feat'] = Bfeat
tree['val'] = Bval
tree['num'] = tsize[0]
tree['Ent'] = self.calc[algor](self, labelSet)
tree['mx'] = self.calcLeave(labelSet, algor)
tree['err'] = sum(labelSet != tree['mx'])*1.0
if (algor == 'CARTR'):
tree['err'] = 0
for i in range(labelSet.shape[0]):
if (fabs(labelSet[i] - tree['mx']) > self.treshVal): tree['err'] += 1
tree['sum'] = dataSet.shape[0]
if (Bfeat in self.strkey):
a = dataSet[:, Bfeat]
setX = list(sorted(set(a)))
Ent = 0
tree['val'] = setX
tree['son'] = tree['errs'] = 0
for val in setX:
dat, lab = self.splitToSame(dataSet, labelSet, Bfeat, val)
if(lab.shape[0]==0):
ans=1
tree[val] = self.buildTree(dat, lab)
tree['son'] += tree[val]['son']
tree['errs'] += tree[val]['errs']
return tree
datA, labA, datB, labB = self.splitToFeat(dataSet, labelSet, Bfeat, Bval)
tree['left'] = self.buildTree(datA, labA)
tree['right'] = self.buildTree(datB, labB)
tree['son'] = tree['left']['son'] + tree['right']['son']
tree['errs'] = tree['left']['errs'] + tree['right']['errs']
return tree
#对测试集分类
def classifyTest(self, tree, test):
x = tree['num']
if (az[x]==1):
return tree['mx']
if ('feat' not in tree):
return tree['val']
feat = tree['feat']
val = tree['val']
if (feat in self.strkey):
for i in val:
if (i == test[feat]):
return self.classifyTest(tree[i], test)
return tree['mx']
if (test[feat] < val): return self.classifyTest(tree['left'], test)
else: return self.classifyTest(tree['right'], test)
#评估
def access(self):
n = self.testLabel.shape[0]
err = 0;yi=0
cha = 0
for i in range(n):
lab = self.classifyTest(self.tree, self.testSet[i])
if(fabs(lab-self.testLabel[i]) > self.treshVal): err += 1
cha += (lab-self.testLabel[i])**2
return (n-err)*1.0/n,cha
# 评估forest
def accessForest(self, forest):
n = self.testLabel.shape[0]
m = len(forest)
err = 0; cha = 0
yi = 0
for i in range(n):
lab = 0
for j in range(m):
lab += self.classifyTest(forest[j],self.testSet[i])
lab /= m
if(fabs(lab-self.testLabel[i]) > self.treshVal): err += 1
cha += (lab - self.testLabel[i]) ** 2
return (n - err) * 1.0 / n,cha
#代价复杂度后剪枝
def costPrunningCalc(self, tree, alpha):
x = tree['num']
if('feat' not in tree):
f[x][0] = tree['sum']*tree['Ent'] + alpha
f[x][1] = None
return
if (tree['feat'] in self.strkey):
f[x][0] = 0
val = tree['val']
for i in val:
son = tree[i]['num']
self.costPrunningCalc(tree[i], alpha)
o = f[son][0]
if (f[son][1] != None): o = min(o, f[son][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
return
self.costPrunningCalc(tree['left'], alpha)
self.costPrunningCalc(tree['right'], alpha)
le, ri = tree['left']['num'], tree['right']['num']
f[x][0] = f[le][0]
if (f[le][1] != None): f[x][0] = min(f[x][0], f[le][1])
o = f[ri][0]
if (f[ri][1] != None): o = min(o, f[ri][1])
f[x][0] += o
f[x][1] = tree['Ent']*tree['sum'] + alpha
def costPrunningDel(self, tree, bz):
x = tree['num']
tree['Ent'] = f[x][bz]
if ('feat' not in tree):
return tree
feat = tree['feat']
if (bz == 1):
if (feat in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (feat in self.strkey):
val = tree['val']
lenx = len(val)
for bin in range(1<<lenx):
tot = -1
o = 0
for i in val:
tot += 1
l = 0
if ((bin & (1 << tot)) > 0): l = 1
son = tree[i]['num']
if (f[son][l] == None):
tot = -2
break
o += f[son][l]
if (o != f[x][bz]): tot = -2
if (tot != -2):
for i in val:
tot += 0
l = 0
if ((bin & (1 << tot)) > 0): l = 1
self.costPrunningCalc(tree[i], l)
return tree
le, ri = tree['left']['num'], tree['right']['num']
yi = er =0
for i in range(2):
if (f[le][i] == None): continue
for j in range(2):
if (f[ri][j] == None): continue
if (f[le][i] + f[ri][j] == f[x][bz]):
yi, er = i, j
self.costPrunningDel(tree['left'], yi)
self.costPrunningDel(tree['right'], er)
return tree
def costPrunning(self, alpha = 10):
self.costPrunningCalc(self.tree, alpha)
bz = 0
if (f[1][0] > f[1][1]): bz = 1
tsize = [0]
self.tree = self.costPrunningDel(self.tree, bz)
return self.access()
#错误率降低剪枝
def REPrunning(self):
for i in range(tsize[0],0,-1):
if (son[i] == 1): continue
pre = self.access()
az[i] = 1
now = self.access()
if(now > pre):
continue
else: az[i] = 0
return self.access()
#悲观剪枝
def PEPrunningCalcDU(self, tree, alpha):
if ('feat' not in tree):
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum']*pnow
meanpre = tree['sum']*ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if(meanpre + valpre > meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
def PEPrunningCalcUD(self, tree, alpha):
if ('feat' not in tree):
return tree
pnow = (tree['err'] + alpha) / tree['sum']
ppre = (tree['errs'] + alpha * tree['son']) / tree['sum']
meannow = tree['sum'] * pnow
meanpre = tree['sum'] * ppre
valnow = sqrt(pnow * (1 - pnow) * tree['sum'])
valpre = sqrt(ppre * (1 - ppre) * tree['sum'])
if (meanpre + valpre > meannow):
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
del tree[i]
else:
del tree['left']
del tree['right']
del tree['feat']
tree['val'] = tree['mx']
return tree
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
tree[i] = self.PEPrunningCalcDU(tree[i], alpha)
else:
tree['left'] = self.PEPrunningCalcDU(tree['left'], alpha)
tree['right'] = self.PEPrunningCalcDU(tree['right'], alpha)
return tree
def PEPrunning(self, alpha = 0.1):
self.tree = self.PEPrunningCalcUD(self.tree, alpha)
return self.access()
#决策树可视化
def treePaint(self, tree, dotx):
keyx = self.key
x = tree['num']
if ('feat' not in tree):
dotx.node(name=str(x), label=str(tree['val']), color='green')
return
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
self.treePaint(tree[i], dotx)
else:
self.treePaint(tree['left'], dotx)
self.treePaint(tree['right'], dotx)
dotx.node(name=str(x), label='split to '+str(keyx[tree['feat']]), color='red', shape='box')
if (tree['feat'] in self.strkey):
val = tree['val']
for i in val:
son = tree[i]['num']
dotx.edge(str(x), str(son), label='is ' + str(self.replaceBack[i]), color='black')
else:
le, ri = tree['left']['num'], tree['right']['num']
dotx.edge(str(x), str(le), label='<' + str(tree['val'].round(3)), color='black')
dotx.edge(str(x), str(ri), label='>=' + str(tree['val'].round(3)), color='black')
def treeVisualize(self, nameadd):
dotx = Digraph(name='Decision'+'('+nameadd+')', format="png")
self.treePaint(self.tree, dotx)
dotx.node(name='a', label='acc: '+str(DCTree.access())+'\n', color='blue', shape='box')
dotx.view()

#ID3决策树
class ID3(DecisionTree):
algor = 'ID3'
Thresh = (0.0001, 8)
treshVal = 0.5
#C45决策树
class C45(DecisionTree):
algor = 'C45'
Thresh = (0.01, 10)
treshVal = 0.5
#CART决策树,带入参数,C是分类,R是回归
class CART(DecisionTree):
algor = ''
def __init__(self, attr):
if (attr == 'C'): self.algor = 'CARTC'
else: self.algor = 'CARTR'
Thresh = (0.001, 8)
treshVal = 2
# 随机森林
def randomForest(num, filename):
forest = []
treex = CART('C')
treex.dataO = treex.loadDataSet(filename)
treex.splitStrVal()
treex.dataO = treex.replaceStr(treex.dataO)
sum = 0
for i in range(num):
tsize[0] = 0
son[0] = 0
treex.BSsplit()
pp = log(treex.dataSet.shape[1], 2)/treex.dataSet.shape[1]
#pp =1
treex.keyRandom(p=pp)
treex.tree=treex.buildTree(treex.dataSet, treex.labelSet)
for j in range(tsize[0]+1):
f[j][0] = f[j][1] = az[j] = 0
treex.PEPrunning()
forest.append(treex.tree)
treex.dataSplit()
return treex.accessForest(forest)
if __name__ == '__main__':
tsize = [0]
son = [0]
DCTree = CART('R')
DCTree.dataO = DCTree.loadDataSet('abalone.txt')
f = [[0, 0] for i in range(DCTree.dataO.shape[0] + 1)] # costPrunning
az = [0 for i in range(DCTree.dataO.shape[0] + 1)] # REPrunning
keybz = [0 for i in range(DCTree.dataO.shape[0] + 1)] # randomForest
#DCTree.PreProcess()
#DCTree.dataO.to_csv('employeeFiredN.csv', index=False)
#DCTree.splitStrVal()
#DCTree.dataO = DCTree.replaceStr(DCTree.dataO)
DCTree.dataSplit()
for i in range(len(DCTree.strkey)): DCTree.strkey[i] -= 1
for i in range(len(DCTree.valkey)): DCTree.valkey[i] -= 1
DCTree.tree=DCTree.buildTree(DCTree.dataSet, DCTree.labelSet)
print(DCTree.access())
print(DCTree.PEPrunning())
print(randomForest(20,'abalone.txt'))

提升树

用于回归

  • 提升树的思路很简单
    • 1、一开始初始化一个label数组
    • 2、每次决策树中的训练的数据的值,是label数组与真实值的残差
    • 3、学习这些残差,更新label数组,训练用回归树,不过只用一层
    • 4、没了,不过要保证每次的最优分割点不一样,不然每次都会在同一个地方分开

GDBT

  • 上面的提升树是训练残差,残差相当于${1\over 2}(y-x)^2$的梯度
  • 而在此我们可以直接考虑一个可为函数的来当做误差,然后每次训练学习数据的梯度
  • 然后再加上学习率变成梯度下降法