EasyMiner easy association rule mining, classification and anomaly detection

Benchmarks

Benchmark with symbolic learners

Benchmark involves 36 UCI datasets. The first 13 datasets contain only categorical attributes. Bold font - CBA obtained maximum accuracy, italics -- CBA_a (with automatic setting of metaparameter values) obtained better results than CBA_d. Reported results are reproducible with EasyMiner-Benchmark.

datasetRIPJ48PARTPDTCBA_dCBA_aRIPJ48PARTCBA_dCBA_a
audiology .75 .81 .79 .13 .75 .63 21 28 19 51 18
balance-scale .81 .77 .79 .76 .86 .81 17 28 41 73 46
breast-cancer .72 .73 .70 .61 .68 .73 3 6 6 79 8
car .91 .97 .96 .97 .90 .75 99 59 44 11 8
house-votes-84 .95 .96 .94 .95 .94 .94 3 4 6 41 25
kr-vs-kp .99 .99 .99 .56 .98 .97 16 25 28 40 28
mushroom 1.00 1.00 1.00 .58   1.00 9 21 4   35
primary-tumor .36 .41 .40 .18 .43 .39 11 22 26 61 28
soybean .91 .93 .92 .53 .87 .79 27 43 37 44 44
splice .94 .95 .94 .50 .88 .90 13 32 14 116 58
tic-tac-toe .98 .94 .95 .94 1.00 .87 10 31 34 9 7
vote .95 .96 .94 .95 .94 .94 3 4 6 41 14
zoo .93 .93 .95 .93 .94 .89 8 7 9 8 7
\midrule anneal .94 .94 .95 .89 .97 .92 14 40 37 34 28
australian .85 .86 .86 .85 .81 .86 4 9 6 88 17
autos .79 .79 .78 .43 .79 .69 15 32 22 50 27
breast-w .96 .94 .96 .94 .95 .95 6 10 10 49 23
colic .84 .86 .86 .64 .81 .84 3 5 6 93 8
credit-a .85 .86 .86 .55 .86 .86 5 7 8 136 24
credit-g .72 .72 .73 .68 .74 .73 7 27 24 191 29
diabetes .75 .74 .74 .76 .75 .72 4 8 11 51 19
glass .67 .65 .69 .67 .71 .67 8 15 16 30 22
heart-statlog .77 .76 .83 .77 .81 .80 5 11 9 52 10
hepatitis .79 .81 .78 .71 .76 .79 4 4 6 33 10
hypothyroid .99 1.00 .99 .98 .98 .95 5 12 8 32 18
ionosphere .91 .87 .88 .88 .91 .92 6 7 5 48 27
iris .92 .94 .93 .93 .93 .93 4 4 5 7 5
labor .88 .71 .84 .62 .89 .82 3 4 5 13 5
letter .88 .88   .80 .27 .52 663 1181   91 669
lymph .77 .74 .78 .47 .79 .77 8 8 11 38 21
segment .95 .97 .96 .96 .93 .90 22 39 27 142 135
sonar .74 .68 .73 .71 .75 .71 6 7 7 40 15
spambase .93 .93 .94 .92 .93 .92 17 85 40 449 180
vehicle .67 .72 .73 .71 .69 .66 21 44 35 88 59
vowel .77 .83 .79 .80 .66 .62 54 116 81 126 111
waveform-5000 .79 .77 .79 .76 .80 .78 29 102 65 617 196
 
average 0.84 0.84 0.85 0.41 0.82 0.80 32 58 20 88 55

Benchmark involving association rule classifiers on CLEF#26875

Model benchmark on CLEF#26875 dataset (single 90/10 split). Model size refers to the number of rules for rule models and number of leaves for decision trees. DecisionTree and CHAID - RapidMiner 5 Community edition, FOIL, CPAR, CBA, CMAR - LUCS KDD
algorithmaccuracymodel size
DecisionTree 23.0 13496
ID3 22.8 13579
CHAID 25.4 13224
FOIL 24.7 18047
CPAR 4.6 18907
CBA 21.2 3681
CMAR 16.9 22516

The CBA algorithm used in EasyMiner does not support numerical attributes, these were converted to categorical attributes using  implementation of Minimum Description Length Principle (MDLP) discretization algorithm (Fayyad, 1992). Since PDT does does support nominal attributes, therefore these were converted to dummy variables. We used code from Victor Ruiz.

The hyperparameters for J48, PART and RIPPER were optimized using the Weka MultiSearch package,  and for Python Decision Tree (PDT) using Scikit-learn GridSearch package.

The rCBA package used in EasyMiner can run either with default parameters or with automatically tuned parameters using simulated annealing.
The parameters for our CBA automatic tuning approach (CBA_a) were set to:

  • INIT_TEMP=100
  • ALPHA=0.05
  • TIME_LIMIT=10
  • MAX_LENGTH=5
  • Maximum number of rules 50.000

We also included CBA with default parameters (CBA_d) with setting according to recommendation by (Liu, 98):

  • maximum number of rules to 80.000 rules
  • minimum support set to 1%
  • minimum confidence to 50%

 

Benchmark - Time

Adapted from: Stanislav Vojíř, Václav Zeman, Jaroslav Kuchař, Tomáš Kliegr: EasyMiner/R Preview: Towards a Web Interface for Association Rule Learning and Classification in R. RuleML 2015

Time requirements of rule mining (confidence=0.5)
supportrule countbackend onlyEasyMiner/R
w/o miss.w miss.w miss. prunedLISp-Minerarulesminingwith prun.
0.010 79 163 54 3 s 6.2 s 5.4 s 33.8 s
0.009 95 186 68 6 s 6.4 s 5.4 s 27.8 s
0.008 112 213 73 16 s 6.2 s 5.4 s 31.7 s
0.007 144 295 90 27 s 6.3 s 5.5 s 31.7 s
0.006 187 397 107 1 m 10 s 6.3 s 5.5 s 35.6 s
0.005 256 552 141 4 m 38 s 6.3 s 5.7 s 35.5 s
0.004 396 765 184 28 m 04 s 6.5 s 6.0 s 37.8 s
0.003 602 1147 253 > 5 h 6.5 s 8.6 s 43.3 s
0.002 1391 2699 430 > 6 h 6.5 s 14.0 s 1 m 04.1 s
0.001 3394 6034 697 > 6 h 6.7 s 15.1 s 1 m 59.0 s

Impact of pruning steps in CBA

Adapted from: Tomáš Kliegr, Jaroslav Kuchař: Benchmark of rule-based classifiers in the news recommendation task. CLEF 2015 Proceedings, p. 130–141.

Impact of pruning steps in CBA on CLEF#26875 dataset. Minimum support set to 0.1% and minimum confidence set to 2%.
algorithmaccuracyrules
no pruning, direct use of association rules 6.4 1735
data coverage pruning 6.9 497
data coverage, default rule pruning 7 175

Effect of support threshold

Effect of support threshold - CBA (ten-fold shuffled cross-validation). CBA implementation evaluated is LUCS KDD.
metric0.10%0.09%0.08%0.07%0.06%0.05%0.04%0.03%0.02%0.01%
accuracy 6.68 6.88 7.07 7.64 8.1 8.65 9.48 10.4 13.47 17.55
rule count 148 178 193 228 270 317 452 576 1100 2303