Content uploaded by Fei Tony Liu

Author content

All content in this area was uploaded by Fei Tony Liu on Oct 10, 2018

Content may be subject to copyright.

Isolation Forest

Fei Tony Liu, Kai Ming Ting

Gippsland School of Information Technology

Monash University, Victoria, Australia

{tony.liu},{kaiming.ting}@infotech.monash.edu.au

Zhi-Hua Zhou

National Key Laboratory

for Novel Software Technology

Nanjing University, Nanjing 210093, China

zhouzh@lamda.nju.edu.cn

Abstract

Most existing model-based approaches to anomaly de-

tection construct a proﬁle of normal instances, then iden-

tify instances that do not conform to the normal proﬁle as

anomalies. This paper proposes a fundamentally different

model-based method that explicitly isolates anomalies in-

stead of proﬁles normal points. To our best knowledge, the

concept of isolation has not been explored in current liter-

ature. The use of isolation enables the proposed method,

iForest, to exploit sub-sampling to an extent that is not fea-

sible in existing methods, creating an algorithm which has a

linear time complexity with a low constant and a low mem-

ory requirement. Our empirical evaluation shows that iFor-

est performs favourably to ORCA, a near-linear time com-

plexity distance-based method, LOF and Random Forests in

terms of AUC and processing time, and especially in large

data sets. iForest also works well in high dimensional prob-

lems which have a large number of irrelevant attributes,

and in situations where training set does not contain any

anomalies.

1 Introduction

Anomalies are data patterns that have different data char-

acteristics from normal instances. The detection of anoma-

lies has signiﬁcant relevance and often provides critical ac-

tionable information in various application domains. For

example, anomalies in credit card transactions could signify

fraudulent use of credit cards. An anomalous spot in an as-

tronomy image could indicate the discovery of a new star.

An unusual computer network trafﬁc pattern could stand

for an unauthorised access. These applications demand

anomaly detection algorithms with high detection perfor-

mance and fast execution.

Most existing model-based approaches to anomaly de-

tection construct a proﬁle of normal instances, then iden-

tify instances that do not conform to the normal proﬁle as

anomalies. Notable examples such as statistical methods

[11], classiﬁcation-based methods [1], and clustering-based

methods [5] all use this general approach. Two major draw-

backs of this approach are: (i) the anomaly detector is opti-

mized to proﬁle normal instances, but not optimized to de-

tect anomalies—as a consequence, the results of anomaly

detection might not be as good as expected, causing too

many false alarms (having normal instances identiﬁed as

anomalies) or too few anomalies being detected; (ii) many

existing methods are constrained to low dimensional data

and small data size because of their high computational

complexity.

This paper proposes a different type of model-based

method that explicitly isolates anomalies rather than proﬁles

normal instances. To achieve this, our proposed method

takes advantage of two anomalies’ quantitative properties:

i) they are the minority consisting of fewer instances and

ii) they have attribute-values that are very different from

those of normal instances. In other words, anomalies are

‘few and different’, which make them more susceptible to

isolation than normal points. We show in this paper that a

tree structure can be constructed effectively to isolate every

single instance. Because of their susceptibility to isolation,

anomalies are isolated closer to the root of the tree; whereas

normal points are isolated at the deeper end of the tree. This

isolation characteristic of tree forms the basis of our method

to detect anomalies, and we call this tree Isolation Tree or

iTree.

The proposed method, called Isolation Forest or iFor-

est, builds an ensemble of iTrees for a given data set, then

anomalies are those instances which have short average path

lengths on the iTrees. There are only two variables in this

method: the number of trees to build and the sub-sampling

size. We show that iForest’s detection performance con-

verges quickly with a very small number of trees, and it

only requires a small sub-sampling size to achieve high de-

tection performance with high efﬁciency.

Apart from the key difference of isolation versus pro-

ﬁling, iForest is distinguished from existing model-based

[11, 1, 5], distance-based [6] and density-based methods [4]

in the follow ways:

•The isolation characteristic of iTrees enables them to

build partial models and exploit sub-sampling to an

extent that is not feasible in existing methods. Since

a large part of an iTree that isolates normal points is

not needed for anomaly detection; it does not need to

be constructed. A small sample size produces better

iTrees because the swamping and masking effects are

reduced.

•iForest utilizes no distance or density measures to de-

tect anomalies. This eliminates major computational

cost of distance calculation in all distance-based meth-

ods and density-based methods.

•iForest has a linear time complexity with a low

constant and a low memory requirement. To our

best knowledge, the best-performing existing method

achieves only approximate linear time complexity with

high memory usage [13].

•iForest has the capacity to scale up to handle extremely

large data size and high-dimensional problems with a

large number of irrelevant attributes.

This paper is organised as follows: In Section 2, we

demonstrate isolation at work using an iTree that recursively

partitions data. A new anomaly score based on iTrees is also

proposed. In Section 3, we describe the characteristic of this

method that helps to tackle the problems of swamping and

masking. In Section 4, we provide the algorithms to con-

struct iTrees and iForest. Section 5 empirically compares

this method with three state-of-the-art anomaly detectors;

we also analyse the efﬁciency of the proposed method, and

report the experimental results in terms of AUC and pro-

cessing time. Section 6 provides a discussion on efﬁciency,

and Section 7 concludes this paper.

2 Isolation and Isolation Trees

In this paper, the term isolation means ‘separating an in-

stance from the rest of the instances’. Since anomalies are

‘few and different’ and therefore they are more susceptible

to isolation. In a data-induced random tree, partitioning of

instances are repeated recursively until all instances are iso-

lated. This random partitioning produces noticeable shorter

paths for anomalies since (a) the fewer instances of anoma-

lies result in a smaller number of partitions – shorter paths

in a tree structure, and (b) instances with distinguishable

attribute-values are more likely to be separated in early par-

titioning. Hence, when a forest of random trees collectively

produce shorter path lengths for some particular points, then

they are highly likely to be anomalies.

(a) Isolating xi(b) Isolating xo

(c) Average path lengths converge

Figure 1. Anomalies are more susceptible to

isolation and hence have short path lengths.

Given a Gaussian distribution (135 points),

(a) a normal point xirequires twelve random

partitions to be isolated; (b) an anomaly xore-

quires only four partitions to be isolated. (c)

averaged path lengths of xiand xoconverge

when the number of trees increases.

To demonstrate the idea that anomalies are more suscep-

tible to isolation under random partitioning, we illustrate

an example in Figures 1(a) and 1(b) to visualise the ran-

dom partitioning of a normal point versus an anomaly. We

observe that a normal point, xi, generally requires more

partitions to be isolated. The opposite is also true for the

anomaly point, xo, which generally requires less partitions

to be isolated. In this example, partitions are generated by

randomly selecting an attribute and then randomly selecting

a split value between the maximum and minimum values of

the selected attribute. Since recursive partitioning can be

represented by a tree structure, the number of partitions re-

quired to isolate a point is equivalent to the path length from

the root node to a terminating node. In this example, the

path length of xiis greater than the path length of xo.

Since each partition is randomly generated, individual

trees are generated with different sets of partitions. We av-

erage path lengths over a number of trees to ﬁnd the ex-

pected path length. Figure 1(c) shows that the average path

lengths of xoand xiconverge when the number of trees in-

creases. Using 1000 trees, the average path lengths of xo

and xiconverge to 4.02 and 12.82 respectively. It shows

that anomalies are having path lengths shorter than normal

instances.

Deﬁnition :Isolation Tree. Let Tbe a node of an isola-

tion tree. Tis either an external-node with no child, or an

internal-node with one test and exactly two daughter nodes

(Tl,Tr). A test consists of an attribute qand a split value p

such that the test q < p divides data points into Tland Tr.

Given a sample of data X={x1, ..., xn}of nin-

stances from a d-variate distribution, to build an isolation

tree (iTree), we recursively divide Xby randomly select-

ing an attribute qand a split value p, until either: (i) the

tree reaches a height limit, (ii) |X|= 1 or (iii) all data in

Xhave the same values. An iTree is a proper binary tree,

where each node in the tree has exactly zero or two daughter

nodes. Assuming all instances are distinct, each instance is

isolated to an external node when an iTree is fully grown, in

which case the number of external nodes is nand the num-

ber of internal nodes is n−1; the total number of nodes

of an iTrees is 2n−1; and thus the memory requirement is

bounded and only grows linearly with n.

The task of anomaly detection is to provide a ranking

that reﬂects the degree of anomaly. Thus, one way to de-

tect anomalies is to sort data points according to their path

lengths or anomaly scores; and anomalies are points that

are ranked at the top of the list. We deﬁne path length and

anomaly score as follows.

Deﬁnition :Path Length h(x)of a point xis measured by

the number of edges xtraverses an iTree from the root node

until the traversal is terminated at an external node.

An anomaly score is required for any anomaly detection

method. The difﬁculty in deriving such a score from h(x)

is that while the maximum possible height of iTree grows

in the order of n, the average height grows in the order of

log n[7]. Normalization of h(x)by any of the above terms

is either not bounded or cannot be directly compared.

Since iTrees have an equivalent structure to Binary

Search Tree or BST (see Table 1), the estimation of aver-

age h(x)for external node terminations is the same as the

iTree BST

Proper binary trees Proper binary trees

External node termination Unsuccessful search

Not applicable Successful search

Table 1. List of equivalent structure and oper-

ations in iTree and Binary Search Tree (BST)

unsuccessful search in BST. We borrow the analysis from

BST to estimate the average path length of iTree. Given a

data set of ninstances, Section 10.3.3 of [9] gives the aver-

age path length of unsuccessful search in BST as:

c(n) = 2H(n−1) −(2(n−1)/n),(1)

where H(i)is the harmonic number and it can be estimated

by ln(i)+0.5772156649 (Euler’s constant). As c(n)is the

average of h(x)given n, we use it to normalise h(x). The

anomaly score sof an instance xis deﬁned as:

s(x, n) = 2−E(h(x))

c(n),(2)

where E(h(x)) is the average of h(x)from a collection of

isolation trees. In Equation (2):

•when E(h(x)) →c(n),s→0.5;

•when E(h(x)) →0,s→1;

•and when E(h(x)) →n−1,s→0.

sis monotonic to h(x). Figure 2 illustrates the relationship

between E(h(x)) and s, and the following conditions ap-

plied where 0< s ≤1for 0< h(x)≤n−1. Using the

anomaly score s, we are able to make the following assess-

ment:

•(a) if instances return svery close to 1, then they are

deﬁnitely anomalies,

•(b) if instances have smuch smaller than 0.5, then they

are quite safe to be regarded as normal instances, and

•(c) if all the instances return s≈0.5, then the entire

sample does not really have any distinct anomaly.

A contour of anomaly score can be produced by passing

a lattice sample through a collection of isolation trees, fa-

cilitating a detailed analysis of the detection result. Figure

3 shows an example of such a contour, allowing a user to

visualise and identify anomalies in the instance space. Us-

ing the contour, we can clearly identify three points, where

s≥0.6, which are potential anomalies.

Figure 2. The relationship of expected path

length E(h(x)) and anomaly score s.c(n)is

the average path length as deﬁned in equa-

tion 1. If the expected path length E(h(x))

is equal to the average path length c(n), then

s= 0.5, regardless of the value of n.

3 Characteristic of Isolation Trees

This section describes the characteristic of iTrees and

their unique way of handling the effects of swamping and

masking. As a tree ensemble that employs isolation trees,

iForest a) identiﬁes anomalies as points having shorter path

lengths, and b) has multiple trees acting as ‘experts’ to tar-

get different anomalies. Since iForest does not need to iso-

late all of normal instances – the majority of the training

sample, iForest is able to work well with a partial model

without isolating all normal points and builds models using

a small sample size.

Contrary to existing methods where large sampling size

is more desirable, isolation method works best when the

sampling size is kept small. Large sampling size reduces

iForest’s ability to isolate anomalies as normal instances can

interfere with the isolation process and therefore reduces

its ability to clearly isolate anomalies. Thus, sub-sampling

provides a favourable environment for iForest to work well.

Throughout this paper, sub-sampling is conducted by ran-

dom selection of instances without replacement.

Problems of swamping and masking have been studied

extensively in anomaly detection [8]. Swamping refers to

Figure 3. Anomaly score contour of iFor-

est for a Gaussian distribution of sixty-four

points. Contour lines for s= 0.5,0.6,0.7are

illustrated. Potential anomalies can be iden-

tiﬁed as points where s≥0.6.

wrongly identifying normal instances as anomalies. When

normal instances are too close to anomalies, the number of

partitions required to separate anomalies increases – which

makes it harder to distinguish anomalies from normal in-

stances. Masking is the existence of too many anomalies

concealing their own presence. When an anomaly cluster is

large and dense, it also increases the number of partitions

to isolate each anomaly. Under these circumstances, eval-

uations using these trees have longer path lengths making

anomalies more difﬁcult to detect. Note that both swamp-

ing and masking are a result of too many data for the pur-

pose of anomaly detection. The unique characteristic of

isolation trees allows iForest to build a partial model by

sub-sampling which incidentally alleviates the effects of

swamping and masking. It is because: 1) sub-sampling con-

trols data size, which helps iForest better isolate examples

of anomalies and 2) each isolation tree can be specialised,

as each sub-sample includes different set of anomalies or

even no anomaly.

To illustrate this, Figure 4(a) shows a data set gener-

ated by Mulcross. The data set has two anomaly clusters

located close to one large cluster of normal points at the

centre. There are interfering normal points surrounding

the anomaly clusters, and the anomaly clusters are denser

(a) Original sample

(4096 instances)

(b) Sub-sample

(128 instances)

Figure 4. Using generated data to demon-

strate the effects of swamping and masking,

(a) shows the original data generated by Mul-

cross. (b) shows a sub-sample of the original

data. Circles (◦) denote normal instances and

triangles (4) denote anomalies.

than normal points in this sample of 4096 instances. Fig-

ure 4(b) shows a sub-sample of 128 instances of the origi-

nal data. The anomalies clusters are clearly identiﬁable in

the sub-sample. Those normal instances surrounding the

two anomaly clusters have been cleared out, and the size of

anomaly clusters becomes smaller which makes them easier

to identify. When using the entire sample, iForest reports an

AUC of 0.67. When using a sub-sampling size of 128, iFor-

est achieves an AUC of 0.91. The result shows iForest’s

superior anomaly detection ability in handling the effects

swamping and masking through a sigiﬁcantly reduced sub-

sample.

4 Anomaly Detection using iForest

Anomaly detection using iForest is a two-stage process.

The ﬁrst (training) stage builds isolation trees using sub-

samples of the training set. The second (testing) stage

passes the test instances through isolation trees to obtain

an anomaly score for each instance.

4.1 Training Stage

In the training stage, iTrees are constructed by recur-

sively partitioning the given training set until instances are

isolated or a speciﬁc tree height is reached of which results

a partial model. Note that the tree height limit lis automat-

ically set by the sub-sampling size ψ:l=ceiling(log2ψ),

which is approximately the average tree height [7]. The ra-

tionale of growing trees up to the average tree height is that

we are only interested in data points that have shorter-than-

average path lengths, as those points are more likely to be

anomalies. Details of the training stage can be found in Al-

gorithms 1 and 2.

Algorithm 1 :iF orest(X, t, ψ)

Inputs:X- input data, t- number of trees, ψ- sub-

sampling size

Output: a set of tiTrees

1: Initialize F orest

2: set height limit l=ceiling(log2ψ)

3: for i= 1 to tdo

4: X0←sample(X, ψ)

5: F orest ←F orest ∪iT ree(X0,0, l)

6: end for

7: return F orest

There are two input parameters to the iForest algorithm.

They are the sub-sampling size ψand the number of trees t.

We provide a guide below to select a suitable value for each

of the two parameters.

Sub-sampling size ψcontrols the training data size. We

ﬁnd that when ψincreases to a desired value, iForest de-

tects reliably and there is no need to increase ψfurther be-

cause it increases processing time and memory size without

any gain in detection performance. Empirically, we ﬁnd

that setting ψto 28or 256 generally provides enough de-

tails to perform anomaly detection across a wide range of

Algorithm 2 :iT ree(X, e, l)

Inputs:X- input data, e- current tree height, l- height

limit

Output: an iTree

1: if e≥lor |X| ≤ 1then

2: return exNode{Size ← |X|}

3: else

4: let Qbe a list of attributes in X

5: randomly select an attribute q∈Q

6: randomly select a split point pfrom max and min

values of attribute qin X

7: Xl←filter(X, q < p)

8: Xr←filter(X, q ≥p)

9: return inNode{Left ←iT ree(Xl, e + 1, l),

10: Right ←iT ree(Xr, e + 1, l),

11: SplitAtt ←q,

12: SplitV alue ←p}

13: end if

data. Unless otherwise speciﬁed, we use ψ= 256 as the

default value for our experiment. An analysis on the effect

sub-sampling size can be found in section 5.2 which shows

that the detection performance is near optimal at this default

setting and insensitive to a wide range of ψ.

Number of tree tcontrols the ensemble size. We ﬁnd

that path lengths usually converge well before t= 100. Un-

less otherwise speciﬁed, we shall use t= 100 as the default

value in our experiment.

At the end of the training process, a collection of trees

is returned and is ready for the evaluation stage. The com-

plexity of the training an iForest is O(tψ log ψ).

4.2 Evaluating Stage

In the evaluating stage, an anomaly score sis derived

from the expected path length E(h(x)) for each test in-

stance. E(h(x)) are derived by passing instances through

each iTree in an iForest. Using P athLeng th function, a

single path length h(x)is derived by counting the number

of edges efrom the root node to a terminating node as in-

stance xtraverses through an iTree. When xis terminated

at an external node, where Size > 1, the return value is e

plus an adjustment c(Size). The adjustment accounts for

an unbuilt subtree beyond the tree height limit. When h(x)

is obtained for each tree of the ensemble, an anomaly score

is produced by computing s(x, ψ)in Equation 2. The com-

plexity of the evaluation process is O(nt log ψ), where nis

the testing data size. Details of the P athLeng th function

can be found in Algorithm 3. To ﬁnd the top manomalies,

simply sorts the data using sin descending order. The ﬁrst

minstances are the top manomalies.

Algorithm 3 :P athLength(x, T , e)

Inputs :x- an instance, T- an iTree, e- current path length;

to be initialized to zero when ﬁrst called

Output: path length of x

1: if Tis an external node then

2: return e+c(T.size){c(.)is deﬁned in Equation 1}

3: end if

4: a←T.splitAtt

5: if xa< T .splitV alue then

6: return P athLength(x, T .left, e + 1)

7: else {xa≥T.splitV alue}

8: return P athLength(x, T .right, e + 1)

9: end if

5 Empirical Evaluation

This section presents the detailed results for four sets of

experiment designed to evaluate iForest. In the ﬁrst exper-

iment we compare iForest with ORCA [3], LOF [6] and

Random Forests (RF) [12]. LOF is a well known density

based method, and RF is selected because this algorithm

also uses tree ensembles. In the second experiment, we ex-

amine the impact of different sub-sampling sizes using the

two largest data sets in our experiments. The results pro-

vide an insight as to what sub-sampling size should be used

and its effects on detection performance. The third exper-

iment extends iForest to handle high-dimensional data; we

reduce attribute space before tree construction by applying

a simple uni-variate test for each sub-sample. We aim to

ﬁnd out whether this simple mechanism is able to improve

iForest’s detection performance in high dimensional spaces.

In many cases, anomaly data are hard to obtain, the fourth

experiment examines iForest’s performance when only nor-

mal instances are available for training. For all the experi-

ments, actual CPU time and Area Under Curve (AUC) are

reported. They are conducted as single threaded jobs pro-

cessed at 2.3GHz in a Linux cluster (www.vpac.org).

The benchmarking method is ORCA - a k-Nearest

Neighbour (k-nn) based method and one of the state-of-the-

art anomaly detection methods, where the largest demand

of processing time comes from the distance calculation of

knearest neighbours. Using sample randomisation together

with a simple pruning rule, ORCA is claimed to be able to

cut down the complexity of O(n2)to near linear time [3].

In ORCA, the parameter kdetermines the number of

nearest neighbourhood, increasing kalso increases the run

time. We use ORCA’s default setting of k= 5 in our ex-

periment unless otherwise speciﬁed. The parameter Nde-

termines how many anomalies are reported. If Nis small,

ORCA increases the running cutoff rapidly and pruning off

more searches, resulting in a much faster run time. How-

ever, it would be unreasonable to set Nbelow the number

of anomalies due to AUC’s requirement to report anomaly

scores for every instances. Since choosing Nhas an ef-

fect on run time and the number of anomalies is not sup-

posed to be known in the training stage, we will use a rea-

sonable value N=n

8unless otherwise speciﬁed. Using

ORCA’s original default setting (k= 5 and N= 30), all

data sets larger than one thousand points report AUC close

to 0.5, which is equivalent to randomly selecting points as

anomalies. In reporting processing time, we report the total

training and testing time, but omit the pre-processing time

“dprep” from ORCA.

As for LOF, we use a commonly used setting of k= 10

in our experiment. As for RF, we use t= 100 and other pa-

rameters in their default values. Because RF is a supervised

learner, we follow the exact instruction as in [12] to generate

synthetic data as the alternative class. The alternative class

is generated by uniformly sampling random points valued

between the maximums and minimums of all attributes.

Proximity measure is calculated after decision trees are be-

ing constructed and anomalies are instances whose proxim-

ities to all other instances in the data are generally small.

We use eleven natural data sets plus a synthetic data

set for evaluation. They are selected because they contain

known anomaly classes as ground truth and these data sets

are used in the literature to evaluate anomaly detectors in a

similar setting. They include: the two biggest data subsets

(Http and Smtp) of KDD CUP 99 network intrusion data

as used in [14], Annthyroid,Arrhythmia, Wisconsin Breast

Cancer (Breastw), Forest Cover Type (ForestCover), Iono-

sphere,Pima,Satellite,Shuttle [2], Mammography1and

Mulcross [10]. Since we are only interested in continuous-

valued attributes in this paper, all nominal and binary at-

tributes are removed. The synthetic data generator, Mul-

cross generates a multi-variate normal distribution with a

selectable number of anomaly clusters. In our experiments,

the basic setting for Mulcross is as following: contamina-

tion ratio = 10% (number of anomalies over the total num-

ber of points), distance factor = 2(distance between the cen-

ter of normal cluster and anomaly clusters), and number of

anomaly clusters = 2. An example of Mulcross data can be

found in Figure 4. Table 2 provides the properties of all data

sets and information on anomaly classes sorted by the size

of data in descending order.

It is assumed that anomaly labels are unavailable in

the training stage. Anomaly labels are only available in

the evaluation stage to compute the performance measure,

AUC.

1The Mammography data set was made available by the courtesy of

Aleksandar Lazarevic

n d anomaly class

Http (KDDCUP99) 567497 3 attack (0.4%)

ForestCover 286048 10

class 4 (0.9%)

vs. class 2

Mulcross 262144 4 2 clusters (10%)

Smtp (KDDCUP99) 95156 3 attack (0.03%)

Shuttle 49097 9 classes 2,3,5,6,7 (7%)

Mammography 11183 6 class 1 (2%)

Annthyroid 6832 6 classes 1, 2 (7%)

Satellite 6435 36

3 smallest

classes (32%)

Pima 768 8 pos (35%)

Breastw 683 9 malignant (35%)

Arrhythmia 452 274

classes 03,04,05,07,

08,09,14,15 (15%)

Ionosphere 351 32 bad (36%)

Table 2. Table of Data properties, where nis

the number of instances, and dis the number

of dimensions, and the percentage in bracket

indicates the percentage of anomalies.

5.1 Comparison with ORCA, LOF and

Random Forests

The aim of this experiment is to compare iForest with

ORCA, LOF and RF in terms of AUC and processing time.

Table 3 reports the AUC score and actual run time for all

methods. From the table, we observe that iForest compares

favourably to ORCA. It shows that iForest as a model-based

method outperforms ORCA, a distance based method, in

terms of AUC and processing time. In particular, iForest is

more accurate and faster in all the data sets larger than one

thousand points.

Note that the difference in execution time is huge be-

tween iForest and ORCA, especially in large data sets; this

is due to the fact that iForest is not required to compute pair-

wise distances; this happens despite the fact that ORCA

only reports n

8anomalies where iForest ranks all npoints.

iForest compares favourable to LOF in seven out of eight

data sets examined and iForest is better than RF in all the

four data sets tested in terms of AUC. In terms of processing

time, iForest is superior in all the data sets as compared with

LOF and RF.

The performance of iForest is stable in a wide range of

t. Using the two data sets of highest dimension, Figure 5

shows that AUC converges at a small t. Since increasing

talso increases processing time, the early convergence of

AUC suggests that iForest’s execution time can be further

reduces if tis tuned to a data set.

As for the Http and Mulcross data sets, due to the large

AUC Time (seconds)

iForest ORCA LOF RF iForest ORCA LOF RF

Train Eval. Total

Http (KDDCUP99) 1.00 0.36 NA NA 0.25 15.33 15.58 9487.47 NA NA

ForestCover 0.88 0.83 NA NA 0.76 15.57 16.33 6995.17 NA NA

Mulcross 0.97 0.33 NA NA 0.26 12.26 12.52 2512.20 NA NA

Smtp (KDDCUP99) 0.88 0.80 NA NA 0.14 2.58 2.72 267.45 NA NA

Shuttle 1.00 0.60 0.55 NA 0.30 2.83 3.13 156.66 7489.74 NA

Mammography 0.86 0.77 0.67 NA 0.16 0.50 0.66 4.49 14647.00 NA

Annthyroid 0.82 0.68 0.72 NA 0.15 0.36 0.51 2.32 72.02 NA

Satellite 0.71 0.65 0.52 NA 0.46 1.17 1.63 8.51 217.39 NA

Pima 0.67 0.71 0.49 0.65 0.17 0.11 0.28 0.06 1.14 4.98

Breastw 0.99 0.98 0.37 0.97 0.17 0.11 0.28 0.04 1.77 3.10

Arrhythmia 0.80 0.78 0.73 0.60 2.12 0.86 2.98 0.49 6.35 2.32

Ionosphere 0.85 0.92 0.89 0.85 0.33 0.15 0.48 0.04 0.64 0.83

Table 3. iForest performs favourably to ORCA, especially for large data sets where n > 1000. AUC

reported with boldfaced are the best performance. iForest is signiﬁcantly faster than ORCA for large

data sets where n > 1000. We do not have the full results for LOF and RF because: (1) LOF has a high

computation complexity and is unable to complete some very high volume data sets in reasonable

time; (2) RF has a huge memory requirement, which requires system memory of (2n)2to produce

proximity matrix in unsupervised learning settings.

(a) Arrhythmia (b) Satellite

Figure 5. Detection performance AUC (y-axis)

is converged at a small t(x-axis).

anomaly-cluster size and the fact that anomaly clusters have

an equal or higher density as compared to normal instances

(i.e., masking effect), ORCA reports a poorer-than-average

result on these data sets. We also experiment ORCA on

these data sets using a higher value of k(where k= 150),

however the detection performance is similar. This high-

lights one problematic assumption in ORCA and other sim-

ilar k-nn based methods: they can only detect low-density

anomaly clusters of size smaller than k. Increasing kmay

solve the problem, but it is not practical in high volume set-

ting due to the increase in processing time.

5.2 Eﬃciency Analysis

This experiment investigates iForest’s efﬁciency in re-

lation to the sub-sampling size ψ. Using the two largest

data sets, Http and ForestCover, we examine the effect

of sub-sample size on detection accuracy and processing

time. In this experiment we adjust the sub-sampling size

ψ= 2,4,8,16, ..., 32768.

Our ﬁndings are shown in Figure 6. We observe that

AUC converges very quickly at small ψ. AUC is near opti-

mal when ψ= 128 for Http and ψ= 512 for ForestCover,

and they are only a fraction of the original data (0.00045

for Http and 0.0018 for ForestCover). After this ψsetting,

the variation of AUC is minimal: ±0.0014 and ±0.023

respectively. Also note that the processing time increases

very modestly when ψincreases from 4 up to 8192. iFor-

est maintains its near optimal detection performance within

this range. In a nutshell, a small ψprovides high AUC and

low processing time, and a further increase of ψis not nec-

essary.

5.3 High Dimensional Data

One of the important challenges in anomaly detection is

high dimensional data. For distance-based methods, every

point is equally sparse in high dimensional space — render-

ing distance a useless measure. For iForest, it also suffers

from the same ‘curse of dimensionality’.

In this experiment, we study a special case in which high

(a) Http (b) ForestCover

Figure 6. A small sub-sampling size provides

both high AUC (left y-axis, solid lines) and

low processing time (right y-axis, dashed

lines, in seconds). Sub-sampling size (x-axis,

log scale) ranges ψ= 2,4,8,16, ..., 32768.

dimension data sets has a large number of irrelevant at-

tributes or background noises, and show that iForest has

a signiﬁcant advantage in processing time. We simulate

high dimensional data using Mammography and Annthy-

roid data sets. For each data set, 506 random attributes,

each uniformly distributed, valued between 0 and 1 are

added to simulate background noise. Thus, there is a total

of 512 attributes in each data sets. We use a simple sta-

tistical test, Kurtosis, to select an attribute subspace from

the sub-sample before constructing each iTree. Kurtosis

measures the ‘peakness’ of univariate distribution. Kurto-

sis is sensitive to the presence of anomalies and hence it

is a good attribute selector for anomaly detection. After

Kurtosis has provided a ranking for each attribute, a sub-

space of attributes is selected according to this ranking to

construct each tree. The result is promising and we show

that the detection performance improves when the subspace

size comes close to the original number of attributes. There

are other attribute selectors that we can choose from, e.g.,

Grubb’s test. However, in this section, we are only con-

cern with showcasing iForest’s ability to work with an at-

tribute selector to reduce dimensionality of anomaly detec-

tion tasks.

Figure 7 shows that a) processing time remains less than

30 seconds for the whole range of subspace sizes and b)

AUC peaks when subspace size is the same as the number

of original attributes and this result comes very close to the

result of ORCA using the original attributes only. When

ORCA is used on both high dimensional data sets, it reports

AUC close to 0.5 with processing time over one hundred

seconds. It shows that both data sets are challenging, how-

ever, iForest is able to improve the detection performance

by a simple addition of Kurtosis test. It may well be possible

for other methods to apply similar attribute reduction tech-

(a) Mammography (b) Annthyroid

Figure 7. iForest achieves good results on

high dimensional data using Kurtosis to se-

lect attributes. 506 irrelevant attributes are

added. AUC (left y-axis, solid lines) improves

when the subspace size (x-axis), comes

close to the number of original attributes and

processing time (right y-axis, dashed lines, in

seconds) increases slightly as subspace size

increases. iForest trained using the original

data has slightly better AUC (shown as the

top dotted lines).

nique to improve detection accuracy on high-dimensional

data, but the advantage of iForest is its low processing time

even in high dimensional data.

5.4 Training using normal instances only

“Does iForest work when training set contains normal

instances only? ” To answer this question, we conduct a

simple experiment using the two largest data sets in our ex-

periment. We ﬁrst randomly divide each data set into two

parts, one for training and one for evaluation, so that the

AUC is derived on unseen data. We repeat this process ten

times and report the average AUC.

When training with anomalies and normal points, Http

reports AUC = 0.9997; however, when training with-

out anomalies, AUC reduces to 0.9919. For ForestCover,

AUC reduces from 0.8817 to 0.8802. Whilst there is a

small reduction in AUC, we ﬁnd that using a larger sub-

sampling size can help to restore the detection performance.

When we increase the sub-sampling size from ψ= 256 to

ψ= 8,192 for Http and ψ= 512 for ForestCover and train

without anomalies, AUC catches up to 0.9997 for Http and

0.884 for ForestCover.

6 Discussion

The implication of using a small sub-sample size is that

one can easily host an online anomaly detection system with

minimal memory footprint. Using ψ= 256, the maximum

number of nodes is 511. Let the maximum size of a node

be bbytes, tbe the number of trees. Thus, a working model

to detect anomaly is estimated to be less than 511tb bytes,

which is trivial in modern computing equipments.

iForest has time complexities of O(tψ log ψ)in the train-

ing stage and O(nt log ψ)in the evaluating stage. For Http

data set, when ψ= 256,t= 100 and evaluating 283,748

instances, the total processing time is 7.6seconds only. We

increase the sub-sampling size 64 times to ψ= 16384 and

the processing time increases by only 1.6 times to 11.9sec-

onds. It shows that iForest has a low constant in its compu-

tational complexity.

iForest’s fast execution with low memory requirement is

a direct result of building partial models and requiring only

a signiﬁcantly small sample size as compared to the given

training set. This capability is unparallel in the domain of

anomaly detection.

7 Conclusions

This paper proposes a fundamentally different model-

based method that focuses on anomaly isolation rather than

normal instance proﬁling. The concept of isolation has not

been explored in the current literature and the use of isola-

tion is shown to be highly effective in detecting anomalies

with extremely high efﬁciency. Taking advantage of anoma-

lies’ nature of ‘few and different’, iTree isolates anoma-

lies closer to the root of the tree as compared to normal

points. This unique characteristic allows iForest to build

partial models (as opposed to full models in proﬁling) and

employ only a tiny proportion of training data to build ef-

fective models. As a result, iForest has a linear time com-

plexity with a low constant and a low memory requirement

which is ideal for high volume data sets.

Our empirical evaluation shows that iForest performs

signiﬁcantly better than a near-linear time complexity

distance-based method, ORCA, LOF and RF in terms of

AUC and execution time, especially in large data sets. In

addition, iForest converges quickly with a small ensemble

size, which enables it to detect anomalies with high efﬁ-

ciency.

For high dimensional problems that contain a large num-

ber of irrelevant attributes, iForest can achieve high detec-

tion performance quickly with an additional attribute se-

lector; whereas a distance-based method either has poor

detection performance or requires signiﬁcantly more time.

We also demonstrate that iForest works well even when no

anomalies are present in the training set. Essentially, Iso-

lation Forest is an accurate and efﬁcient anomaly detector

especially for large databases. Its capacity in handling high

volume databases is highly desirable for real life applica-

tions.

Acknowledgements The authors thank Victorian Part-

nership for Advanced Computing (www.vpac.org) for pro-

viding the high-performing computing facility.

Z.-H. Zhou was supported by NSFC (60635030,

60721002) and JiangsuSF (BK2008018).

References

[1] N. Abe, B. Zadrozny, and J. Langford. Outlier detection by

active learning. In Proceedings of the 12th ACM SIGKDD

international conference on Knowledge discovery and data

mining, pages 504–509. ACM Press, 2006.

[2] A. Asuncion and D. Newman. UCI machine learning repos-

itory, 2007.

[3] S. D. Bay and M. Schwabacher. Mining distance-based out-

liers in near linear time with randomization and a simple

pruning rule. In Proceedings of the ninth ACM SIGKDD

international conference on Knowledge discovery and data

mining, pages 29–38. ACM Press, 2003.

[4] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander.

LOF: identifying density-based local outliers. ACM SIG-

MOD Record, 29(2):93–104, 2000.

[5] Z. He, X. Xu, and S. Deng. Discovering cluster-based local

outliers. Pattern Recogn. Lett., 24(9-10):1641–1650, 2003.

[6] E. M. Knorr and R. T. Ng. Algorithms for mining distance-

based outliers in large datasets. In VLDB ’98: Proceedings

of the 24rd International Conference on Very Large Data

Bases, pages 392–403, San Francisco, CA, USA, 1998.

Morgan Kaufmann.

[7] D. E. Knuth. Art of Computer Programming, Volume 3:

Sorting and Searching (2nd Edition). Addison-Wesley Pro-

fessional, April 1998.

[8] R. B. Murphy. On Tests for Outlying Observations. PhD

thesis, Princeton University, 1951.

[9] B. R. Preiss. Data Structures and Algorithms with Object-

Oriented Design Patterns in Java. Wiley, 1999.

[10] D. M. Rocke and D. L. Woodruff. Identiﬁcation of outliers

in multivariate data. Journal of the American Statistical As-

sociation, 91(435):1047–1061, 1996.

[11] P. J. Rousseeuw and K. V. Driessen. A fast algorithm for the

minimum covariance determinant estimator. Technometrics,

41(3):212–223, 1999.

[12] T. Shi and S. Horvath. Unsupervised learning with random

forest predictors. Journal of Computational and Graphical

Statistics, 15(1):118–138, March 2006.

[13] M. Wu and C. Jermaine. Outlier detection by sampling

with accuracy guarantees. In Proceedings of the 12th ACM

SIGKDD international conference on Knowledge discovery

and data mining, pages 767–772, New York, NY, USA,

2006. ACM.

[14] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. On-

line unsupervised outlier detection using ﬁnite mixtures with

discounting learning algorithms. In Proceedings of the sixth

ACM SIGKDD international conference on Knowledge dis-

covery and data mining, pages 320–324. ACM Press, 2000.