massive documentation fix

yzhao062 · Nov 18, 2023 · f697c57 · f697c57
1 parent 898e41b
commit f697c57
Show file tree

Hide file tree

Showing 5 changed files with 167 additions and 110 deletions.
diff --git a/README.rst b/README.rst
@@ -65,12 +65,15 @@ Read Me First
 Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.
 
 * **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.
+
 * **For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.
 
 * **Performance Comparison \& Datasets**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_. The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
 
 * **Learn more about anomaly detection** \@ `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_
 
+* **PyOD on Distributed Systems**: you could also run `PyOD on databricks <https://www.databricks.com/blog/2023/03/13/unsupervised-outlier-detection-databricks.html>`_.
+
 ----
 
 About PyOD
@@ -167,7 +170,7 @@ For a broader perspective on anomaly detection, see our NeurIPS papers
 
 * `Installation <#installation>`_
 * `API Cheatsheet & Reference <#api-cheatsheet--reference>`_
-* `ADBench Benchmark <#adbench-benchmark>`_
+* `ADBench Benchmark and Datasets <#adbench-benchmark-and-datasets>`_
 * `Model Save & Load <#model-save--load>`_
 * `Fast Train with SUOD <#fast-train-with-suod>`_
 * `Thresholding Outlier Scores <#thresholding-outlier-scores>`_
@@ -254,8 +257,8 @@ The full API Reference is available at `PyOD Documentation <https://pyod.readthe
 ----
 
 
-ADBench Benchmark
-^^^^^^^^^^^^^^^^^
+ADBench Benchmark and Datasets
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_ [#Han2022ADBench]_.
 The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
@@ -267,16 +270,12 @@ The organization of **ADBench** is provided below:
    :alt: benchmark-fig
 
 
-**The comparison of selected models** is made available below
-(\ `Figure <https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png>`_\ ,
-`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\ ,
-`Interactive Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/pyod/master>`_\ ).
-For Jupyter Notebooks, please navigate to **"/notebooks/Compare All Models.ipynb"**.
-
+For a simpler visualization, we make **the comparison of selected models** via
+`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.
 
-.. image:: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
-   :target: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
-   :alt: Comparision_of_All
+.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :alt: Comparison_of_All
 
 
 

diff --git a/docs/benchmark.rst b/docs/benchmark.rst
@@ -4,7 +4,7 @@ Benchmarks
 Latest ADBench (2022)
 ---------------------
 
-We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-preprint-adbench.pdf>`_ :cite:`a-han2022adbench`.
+We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://arxiv.org/abs/2206.09426>`_ :cite:`a-han2022adbench`.
 The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 55 benchmark datasets.
 
 The organization of **ADBench** is provided below:
@@ -14,6 +14,13 @@ The organization of **ADBench** is provided below:
    :alt: benchmark
 
 
+For a simpler visualization, we make **the comparison of selected models** via
+`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.
+
+.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :alt: Comparison_of_All
+
 Old Results (2019)
 ------------------
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -70,12 +70,15 @@ Read Me First
 Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.
 
 * **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.
+
 * **For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.
 
 * **Performance Comparison \& Datasets**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_. The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
 
 * **Learn more about anomaly detection** \@ `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_
 
+* **PyOD on Distributed Systems**: you could also run `PyOD on databricks <https://www.databricks.com/blog/2023/03/13/unsupervised-outlier-detection-databricks.html>`_.
+
 ----
 
 About PyOD
@@ -169,17 +172,25 @@ For a broader perspective on anomaly detection, see our NeurIPS papers
 
 ----
 
-Benchmark
-=========
+ADBench Benchmark and Datasets
+==============================
 
-We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_.
+We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_ :cite:`a-han2022adbench`.
 The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
 
 The organization of **ADBench** is provided below:
 
 .. image:: https://github.com/Minqi824/ADBench/blob/main/figs/ADBench.png?raw=true
    :target: https://github.com/Minqi824/ADBench/blob/main/figs/ADBench.png?raw=true
-   :alt: benchmark
+   :alt: benchmark-fig
+
+
+For a simpler visualization, we make **the comparison of selected models** via
+`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.
+
+.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
+   :alt: Comparison_of_All
 
 
 Implemented Algorithms

diff --git a/examples/ALL.png b/examples/ALL.png
diff --git a/examples/compare_all_models.py b/examples/compare_all_models.py
@@ -14,7 +14,7 @@
 # temporary solution for relative imports in case pyod is not installed
 # if pyod is installed, no need to use the following line
 sys.path.append(
-    os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))
+	os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))
 
 # supress warnings for clean output
 import warnings
@@ -42,6 +42,15 @@
 from pyod.models.kde import KDE
 from pyod.models.lmdd import LMDD
 
+from pyod.models.dif import DIF
+from pyod.models.copod import COPOD
+from pyod.models.ecod import ECOD
+from pyod.models.suod import SUOD
+from pyod.models.qmcd import QMCD
+from pyod.models.sampling import Sampling
+from pyod.models.kpca import KPCA
+from pyod.models.lunar import LUNAR
+
 # TODO: add neural networks, LOCI, SOS, COF, SOD
 
 # Define the number of inliers and outliers
@@ -59,114 +68,145 @@
 
 # initialize a set of detectors for LSCP
 detector_list = [LOF(n_neighbors=5), LOF(n_neighbors=10), LOF(n_neighbors=15),
-                 LOF(n_neighbors=20), LOF(n_neighbors=25), LOF(n_neighbors=30),
-                 LOF(n_neighbors=35), LOF(n_neighbors=40), LOF(n_neighbors=45),
-                 LOF(n_neighbors=50)]
+				 LOF(n_neighbors=20), LOF(n_neighbors=25), LOF(n_neighbors=30),
+				 LOF(n_neighbors=35), LOF(n_neighbors=40), LOF(n_neighbors=45),
+				 LOF(n_neighbors=50)]
 
 # Show the statics of the data
 print('Number of inliers: %i' % n_inliers)
 print('Number of outliers: %i' % n_outliers)
 print(
-    'Ground truth shape is {shape}. Outlier are 1 and inliers are 0.\n'.format(
-        shape=ground_truth.shape))
+	'Ground truth shape is {shape}. Outlier are 1 and inliers are 0.\n'.format(
+		shape=ground_truth.shape))
 print(ground_truth, '\n')
 
 random_state = 42
 # Define nine outlier detection tools to be compared
 classifiers = {
-    'Angle-based Outlier Detector (ABOD)':
-        ABOD(contamination=outliers_fraction),
-    'Cluster-based Local Outlier Factor (CBLOF)':
-        CBLOF(contamination=outliers_fraction,
-              check_estimator=False, random_state=random_state),
-    'Feature Bagging':
-        FeatureBagging(LOF(n_neighbors=35),
-                       contamination=outliers_fraction,
-                       random_state=random_state),
-    'Histogram-base Outlier Detection (HBOS)': HBOS(
-        contamination=outliers_fraction),
-    'Isolation Forest': IForest(contamination=outliers_fraction,
-                                random_state=random_state),
-    'K Nearest Neighbors (KNN)': KNN(
-        contamination=outliers_fraction),
-    'Average KNN': KNN(method='mean',
-                       contamination=outliers_fraction),
-    'Local Outlier Factor (LOF)':
-        LOF(n_neighbors=35, contamination=outliers_fraction),
-    'Minimum Covariance Determinant (MCD)': MCD(
-        contamination=outliers_fraction, random_state=random_state),
-    'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
-    'Principal Component Analysis (PCA)': PCA(
-        contamination=outliers_fraction, random_state=random_state),
-    'Locally Selective Combination (LSCP)': LSCP(
-        detector_list, contamination=outliers_fraction,
-        random_state=random_state),
-    'INNE': INNE(
-        max_samples=2, contamination=outliers_fraction,
-        random_state=random_state,
-        ),
-    'GMM': GMM(contamination=outliers_fraction,
-               random_state=random_state),
-    'KDE': KDE(contamination=outliers_fraction),
-    'LMDD': LMDD(contamination=outliers_fraction,
-                 random_state=random_state),
+	'Angle-based Outlier Detector (ABOD)':
+		ABOD(contamination=outliers_fraction),
+	'K Nearest Neighbors (KNN)': KNN(
+		contamination=outliers_fraction),
+	'Average KNN': KNN(method='mean',
+					   contamination=outliers_fraction),
+	'Median KNN': KNN(method='median',
+					  contamination=outliers_fraction),
+	'Local Outlier Factor (LOF)':
+		LOF(n_neighbors=35, contamination=outliers_fraction),
+
+	'Isolation Forest': IForest(contamination=outliers_fraction,
+								random_state=random_state),
+	'Deep Isolation Forest (DIF)': DIF(contamination=outliers_fraction,
+									   random_state=random_state),
+	'INNE': INNE(
+		max_samples=2, contamination=outliers_fraction,
+		random_state=random_state,
+	),
+
+	'Locally Selective Combination (LSCP)': LSCP(
+		detector_list, contamination=outliers_fraction,
+		random_state=random_state),
+	'Feature Bagging':
+		FeatureBagging(LOF(n_neighbors=35),
+					   contamination=outliers_fraction,
+					   random_state=random_state),
+	'SUOD': SUOD(contamination=outliers_fraction),
+
+	'Minimum Covariance Determinant (MCD)': MCD(
+		contamination=outliers_fraction, random_state=random_state),
+
+	'Principal Component Analysis (PCA)': PCA(
+		contamination=outliers_fraction, random_state=random_state),
+	'KPCA': KPCA(
+		contamination=outliers_fraction),
+
+	'Probabilistic Mixture Modeling (GMM)': GMM(contamination=outliers_fraction,
+												random_state=random_state),
+
+	'LMDD': LMDD(contamination=outliers_fraction,
+				 random_state=random_state),
+
+	'Histogram-based Outlier Detection (HBOS)': HBOS(
+		contamination=outliers_fraction),
+
+	'Copula-base Outlier Detection (COPOD)': COPOD(
+		contamination=outliers_fraction),
+
+	'ECDF-baseD Outlier Detection (ECOD)': ECOD(
+		contamination=outliers_fraction),
+	'Kernel Density Functions (KDE)': KDE(contamination=outliers_fraction),
+
+	'QMCD': QMCD(
+		contamination=outliers_fraction),
+
+	'Sampling': Sampling(
+		contamination=outliers_fraction),
+
+	'LUNAR': LUNAR(),
+
+	'Cluster-based Local Outlier Factor (CBLOF)':
+		CBLOF(contamination=outliers_fraction,
+			  check_estimator=False, random_state=random_state),
+
+	'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
 }
 
 # Show all detectors
 for i, clf in enumerate(classifiers.keys()):
-    print('Model', i + 1, clf)
+	print('Model', i + 1, clf)
 
 # Fit the models with the generated data and
 # compare model performances
 for i, offset in enumerate(clusters_separation):
-    np.random.seed(42)
-    # Data generation
-    X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
-    X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
-    X = np.r_[X1, X2]
-    # Add outliers
-    X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
-
-    # Fit the model
-    plt.figure(figsize=(15, 16))
-    for i, (clf_name, clf) in enumerate(classifiers.items()):
-        print()
-        print(i + 1, 'fitting', clf_name)
-        # fit the data and tag outliers
-        clf.fit(X)
-        scores_pred = clf.decision_function(X) * -1
-        y_pred = clf.predict(X)
-        threshold = percentile(scores_pred, 100 * outliers_fraction)
-        n_errors = (y_pred != ground_truth).sum()
-        # plot the levels lines and the points
-
-        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
-        Z = Z.reshape(xx.shape)
-        subplot = plt.subplot(4, 4, i + 1)
-        subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
-                         cmap=plt.cm.Blues_r)
-        # a = subplot.contour(xx, yy, Z, levels=[threshold],
-        #                     linewidths=2, colors='red')
-        subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
-                         colors='orange')
-        b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white',
-                            s=20, edgecolor='k')
-        c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black',
-                            s=20, edgecolor='k')
-        subplot.axis('tight')
-        subplot.legend(
-            [
-                # a.collections[0],
-                b, c],
-            [
-                # 'learned decision function', 
-                'true inliers', 'true outliers'],
-            prop=matplotlib.font_manager.FontProperties(size=10),
-            loc='lower right')
-        subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
-        subplot.set_xlim((-7, 7))
-        subplot.set_ylim((-7, 7))
-    plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
-    plt.suptitle("Outlier detection")
-plt.savefig('ALL.png', dpi=300)
+	np.random.seed(42)
+	# Data generation
+	X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
+	X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
+	X = np.r_[X1, X2]
+	# Add outliers
+	X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
+
+	# Fit the model
+	plt.figure(figsize=(20, 22))
+	for i, (clf_name, clf) in enumerate(classifiers.items()):
+		print()
+		print(i + 1, 'fitting', clf_name)
+		# fit the data and tag outliers
+		clf.fit(X)
+		scores_pred = clf.decision_function(X) * -1
+		y_pred = clf.predict(X)
+		threshold = percentile(scores_pred, 100 * outliers_fraction)
+		n_errors = (y_pred != ground_truth).sum()
+		# plot the levels lines and the points
+
+		Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
+		Z = Z.reshape(xx.shape)
+		subplot = plt.subplot(5, 5, i + 1)
+		subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
+						 cmap=plt.cm.Blues_r)
+		# a = subplot.contour(xx, yy, Z, levels=[threshold],
+		#                     linewidths=2, colors='red')
+		subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
+						 colors='orange')
+		b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white',
+							s=20, edgecolor='k')
+		c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black',
+							s=20, edgecolor='k')
+		subplot.axis('tight')
+		subplot.legend(
+			[
+				# a.collections[0],
+				b, c],
+			[
+				# 'learned decision function',
+				'true inliers', 'true outliers'],
+			prop=matplotlib.font_manager.FontProperties(size=10),
+			loc='lower right')
+		subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
+		subplot.set_xlim((-7, 7))
+		subplot.set_ylim((-7, 7))
+	plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
+	plt.suptitle("25 outlier detection algorithms on synthetic data",
+				 fontsize=35)
+plt.savefig('ALL.png', dpi=300, bbox_inches='tight')
 plt.show()