Skip to content

Commit

Permalink
massive documentation fix
Browse files Browse the repository at this point in the history
  • Loading branch information
yzhao062 committed Nov 18, 2023
1 parent 898e41b commit f697c57
Show file tree
Hide file tree
Showing 5 changed files with 167 additions and 110 deletions.
23 changes: 11 additions & 12 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,15 @@ Read Me First
Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.

* **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.

* **For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.

* **Performance Comparison \& Datasets**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_. The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.

* **Learn more about anomaly detection** \@ `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_

* **PyOD on Distributed Systems**: you could also run `PyOD on databricks <https://www.databricks.com/blog/2023/03/13/unsupervised-outlier-detection-databricks.html>`_.

----

About PyOD
Expand Down Expand Up @@ -167,7 +170,7 @@ For a broader perspective on anomaly detection, see our NeurIPS papers

* `Installation <#installation>`_
* `API Cheatsheet & Reference <#api-cheatsheet--reference>`_
* `ADBench Benchmark <#adbench-benchmark>`_
* `ADBench Benchmark and Datasets <#adbench-benchmark-and-datasets>`_
* `Model Save & Load <#model-save--load>`_
* `Fast Train with SUOD <#fast-train-with-suod>`_
* `Thresholding Outlier Scores <#thresholding-outlier-scores>`_
Expand Down Expand Up @@ -254,8 +257,8 @@ The full API Reference is available at `PyOD Documentation <https://pyod.readthe
----


ADBench Benchmark
^^^^^^^^^^^^^^^^^
ADBench Benchmark and Datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_ [#Han2022ADBench]_.
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
Expand All @@ -267,16 +270,12 @@ The organization of **ADBench** is provided below:
:alt: benchmark-fig


**The comparison of selected models** is made available below
(\ `Figure <https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png>`_\ ,
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\ ,
`Interactive Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/pyod/master>`_\ ).
For Jupyter Notebooks, please navigate to **"/notebooks/Compare All Models.ipynb"**.

For a simpler visualization, we make **the comparison of selected models** via
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.

.. image:: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
:target: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
:alt: Comparision_of_All
.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:alt: Comparison_of_All



Expand Down
9 changes: 8 additions & 1 deletion docs/benchmark.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Benchmarks
Latest ADBench (2022)
---------------------

We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-preprint-adbench.pdf>`_ :cite:`a-han2022adbench`.
We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://arxiv.org/abs/2206.09426>`_ :cite:`a-han2022adbench`.
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 55 benchmark datasets.

The organization of **ADBench** is provided below:
Expand All @@ -14,6 +14,13 @@ The organization of **ADBench** is provided below:
:alt: benchmark


For a simpler visualization, we make **the comparison of selected models** via
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.

.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:alt: Comparison_of_All

Old Results (2019)
------------------

Expand Down
19 changes: 15 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,15 @@ Read Me First
Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.

* **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.

* **For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.

* **Performance Comparison \& Datasets**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_. The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.

* **Learn more about anomaly detection** \@ `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_

* **PyOD on Distributed Systems**: you could also run `PyOD on databricks <https://www.databricks.com/blog/2023/03/13/unsupervised-outlier-detection-databricks.html>`_.

----

About PyOD
Expand Down Expand Up @@ -169,17 +172,25 @@ For a broader perspective on anomaly detection, see our NeurIPS papers

----

Benchmark
=========
ADBench Benchmark and Datasets
==============================

We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_.
We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_ :cite:`a-han2022adbench`.
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.

The organization of **ADBench** is provided below:

.. image:: https://github.com/Minqi824/ADBench/blob/main/figs/ADBench.png?raw=true
:target: https://github.com/Minqi824/ADBench/blob/main/figs/ADBench.png?raw=true
:alt: benchmark
:alt: benchmark-fig


For a simpler visualization, we make **the comparison of selected models** via
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.

.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
:alt: Comparison_of_All


Implemented Algorithms
Expand Down
Binary file modified examples/ALL.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
226 changes: 133 additions & 93 deletions examples/compare_all_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))
os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

# supress warnings for clean output
import warnings
Expand Down Expand Up @@ -42,6 +42,15 @@
from pyod.models.kde import KDE
from pyod.models.lmdd import LMDD

from pyod.models.dif import DIF
from pyod.models.copod import COPOD
from pyod.models.ecod import ECOD
from pyod.models.suod import SUOD
from pyod.models.qmcd import QMCD
from pyod.models.sampling import Sampling
from pyod.models.kpca import KPCA
from pyod.models.lunar import LUNAR

# TODO: add neural networks, LOCI, SOS, COF, SOD

# Define the number of inliers and outliers
Expand All @@ -59,114 +68,145 @@

# initialize a set of detectors for LSCP
detector_list = [LOF(n_neighbors=5), LOF(n_neighbors=10), LOF(n_neighbors=15),
LOF(n_neighbors=20), LOF(n_neighbors=25), LOF(n_neighbors=30),
LOF(n_neighbors=35), LOF(n_neighbors=40), LOF(n_neighbors=45),
LOF(n_neighbors=50)]
LOF(n_neighbors=20), LOF(n_neighbors=25), LOF(n_neighbors=30),
LOF(n_neighbors=35), LOF(n_neighbors=40), LOF(n_neighbors=45),
LOF(n_neighbors=50)]

# Show the statics of the data
print('Number of inliers: %i' % n_inliers)
print('Number of outliers: %i' % n_outliers)
print(
'Ground truth shape is {shape}. Outlier are 1 and inliers are 0.\n'.format(
shape=ground_truth.shape))
'Ground truth shape is {shape}. Outlier are 1 and inliers are 0.\n'.format(
shape=ground_truth.shape))
print(ground_truth, '\n')

random_state = 42
# Define nine outlier detection tools to be compared
classifiers = {
'Angle-based Outlier Detector (ABOD)':
ABOD(contamination=outliers_fraction),
'Cluster-based Local Outlier Factor (CBLOF)':
CBLOF(contamination=outliers_fraction,
check_estimator=False, random_state=random_state),
'Feature Bagging':
FeatureBagging(LOF(n_neighbors=35),
contamination=outliers_fraction,
random_state=random_state),
'Histogram-base Outlier Detection (HBOS)': HBOS(
contamination=outliers_fraction),
'Isolation Forest': IForest(contamination=outliers_fraction,
random_state=random_state),
'K Nearest Neighbors (KNN)': KNN(
contamination=outliers_fraction),
'Average KNN': KNN(method='mean',
contamination=outliers_fraction),
'Local Outlier Factor (LOF)':
LOF(n_neighbors=35, contamination=outliers_fraction),
'Minimum Covariance Determinant (MCD)': MCD(
contamination=outliers_fraction, random_state=random_state),
'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
'Principal Component Analysis (PCA)': PCA(
contamination=outliers_fraction, random_state=random_state),
'Locally Selective Combination (LSCP)': LSCP(
detector_list, contamination=outliers_fraction,
random_state=random_state),
'INNE': INNE(
max_samples=2, contamination=outliers_fraction,
random_state=random_state,
),
'GMM': GMM(contamination=outliers_fraction,
random_state=random_state),
'KDE': KDE(contamination=outliers_fraction),
'LMDD': LMDD(contamination=outliers_fraction,
random_state=random_state),
'Angle-based Outlier Detector (ABOD)':
ABOD(contamination=outliers_fraction),
'K Nearest Neighbors (KNN)': KNN(
contamination=outliers_fraction),
'Average KNN': KNN(method='mean',
contamination=outliers_fraction),
'Median KNN': KNN(method='median',
contamination=outliers_fraction),
'Local Outlier Factor (LOF)':
LOF(n_neighbors=35, contamination=outliers_fraction),

'Isolation Forest': IForest(contamination=outliers_fraction,
random_state=random_state),
'Deep Isolation Forest (DIF)': DIF(contamination=outliers_fraction,
random_state=random_state),
'INNE': INNE(
max_samples=2, contamination=outliers_fraction,
random_state=random_state,
),

'Locally Selective Combination (LSCP)': LSCP(
detector_list, contamination=outliers_fraction,
random_state=random_state),
'Feature Bagging':
FeatureBagging(LOF(n_neighbors=35),
contamination=outliers_fraction,
random_state=random_state),
'SUOD': SUOD(contamination=outliers_fraction),

'Minimum Covariance Determinant (MCD)': MCD(
contamination=outliers_fraction, random_state=random_state),

'Principal Component Analysis (PCA)': PCA(
contamination=outliers_fraction, random_state=random_state),
'KPCA': KPCA(
contamination=outliers_fraction),

'Probabilistic Mixture Modeling (GMM)': GMM(contamination=outliers_fraction,
random_state=random_state),

'LMDD': LMDD(contamination=outliers_fraction,
random_state=random_state),

'Histogram-based Outlier Detection (HBOS)': HBOS(
contamination=outliers_fraction),

'Copula-base Outlier Detection (COPOD)': COPOD(
contamination=outliers_fraction),

'ECDF-baseD Outlier Detection (ECOD)': ECOD(
contamination=outliers_fraction),
'Kernel Density Functions (KDE)': KDE(contamination=outliers_fraction),

'QMCD': QMCD(
contamination=outliers_fraction),

'Sampling': Sampling(
contamination=outliers_fraction),

'LUNAR': LUNAR(),

'Cluster-based Local Outlier Factor (CBLOF)':
CBLOF(contamination=outliers_fraction,
check_estimator=False, random_state=random_state),

'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
}

# Show all detectors
for i, clf in enumerate(classifiers.keys()):
print('Model', i + 1, clf)
print('Model', i + 1, clf)

# Fit the models with the generated data and
# compare model performances
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]

# Fit the model
plt.figure(figsize=(15, 16))
for i, (clf_name, clf) in enumerate(classifiers.items()):
print()
print(i + 1, 'fitting', clf_name)
# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X) * -1
y_pred = clf.predict(X)
threshold = percentile(scores_pred, 100 * outliers_fraction)
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
subplot = plt.subplot(4, 4, i + 1)
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=plt.cm.Blues_r)
# a = subplot.contour(xx, yy, Z, levels=[threshold],
# linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white',
s=20, edgecolor='k')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black',
s=20, edgecolor='k')
subplot.axis('tight')
subplot.legend(
[
# a.collections[0],
b, c],
[
# 'learned decision function',
'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=10),
loc='lower right')
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
plt.suptitle("Outlier detection")
plt.savefig('ALL.png', dpi=300)
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]

# Fit the model
plt.figure(figsize=(20, 22))
for i, (clf_name, clf) in enumerate(classifiers.items()):
print()
print(i + 1, 'fitting', clf_name)
# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X) * -1
y_pred = clf.predict(X)
threshold = percentile(scores_pred, 100 * outliers_fraction)
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
subplot = plt.subplot(5, 5, i + 1)
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=plt.cm.Blues_r)
# a = subplot.contour(xx, yy, Z, levels=[threshold],
# linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white',
s=20, edgecolor='k')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black',
s=20, edgecolor='k')
subplot.axis('tight')
subplot.legend(
[
# a.collections[0],
b, c],
[
# 'learned decision function',
'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=10),
loc='lower right')
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
plt.suptitle("25 outlier detection algorithms on synthetic data",
fontsize=35)
plt.savefig('ALL.png', dpi=300, bbox_inches='tight')
plt.show()

0 comments on commit f697c57

Please sign in to comment.