Scikit-clean
==================

**scikit-clean** is a python ML library for classification in the presence of \
label noise. Aimed primarily at researchers, this provides implementations of \
several state-of-the-art algorithms; tools to simulate artificial noise, create complex pipelines \
and evaluate them.

This library is fully scikit-learn API compatible: which means \
all scikit-learn's building blocks can be seamlessly integrated into workflow. \
Like scikit-learn estimators, most of the methods also support features like \
parallelization, reproducibility etc.

Example Usage
***************
A typical label noise research workflow begins with clean labels, simulates \
label noise into training set, and then evaluates how a model handles that noise \
using clean test set. In scikit-clean, this looks like:

.. code-block:: python

    from skclean.simulate_noise import flip_labels_uniform
    from skclean.models import RobustLR   # Robust Logistic Regression

    X, y = make_classification(n_samples=200,n_features=4)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20)

    y_train_noisy = flip_labels_uniform(y_train, .3)  # Flip labels of 30% samples
    clf = RobustLR().fit(X_train,y_train_noisy)
    print(clf.score(X_test, y_test))

scikit-clean provides a customized `Pipeline` for more complex workflow. Many noise robust \
algorithms can be broken down into two steps: detecting noise likelihood for each sample
in the dataset, and train robust classifiers by using that information. This fits
nicely with Pipeline's API:

.. code-block:: python

    # ---Import scikit-learn stuff----
    from skclean.simulate_noise import UniformNoise
    from skclean.detectors import KDN
    from skclean.handlers import Filter
    from skclean.pipeline import Pipeline, make_pipeline  # Importing from skclean, not sklearn


    clf = Pipeline([
            ('scale', StandardScaler()),          # Scale features
            ('feat_sel', VarianceThreshold(.2)),  # Feature selection
            ('detector', KDN()),                  # Detect mislabeled samples
            ('handler', Filter(SVC())),           # Filter out likely mislabeled samples and then train a SVM
    ])

    clf_g = GridSearchCV(clf,{'detector__n_neighbors':[2,5,10]})
    n_clf_g = make_pipeline(UniformNoise(.3),clf_g)  # Create label noise at the very first step

    print(cross_val_score(n_clf_g, X, y, cv=5).mean())  # 5-fold cross validation

Please see this notebook_ before you begin for a more detailed introduction, \
and this_ for complete API.

.. _notebook: examples/Introduction%20to%20Scikit-clean.html
.. _this: api.html

Installation
******************

Simplest option is probably using pip::

    pip install scikit-clean

If you intend to modify the code, install in editable mode::

    git clone https://github.com/Shihab-Shahriar/scikit-clean.git
    cd scikit-clean
    pip install -e .

If you're only interested in small part of this library, say one or two algorithms, feel free to simply \
copy/paste relevant code into your project.

Alternatives
**************
There are several open source tools to handle label noise, some of them are: \

1. Cleanlab_
2. Snorkel_
3. NoiseFiltersR_

.. _Cleanlab: https://github.com/cgnorthcutt/cleanlab
.. _Snorkel: https://github.com/snorkel-team/snorkel
.. _NoiseFiltersR: https://journal.r-project.org/archive/2017/RJ-2017-027/RJ-2017-027.pdf

`NoiseFiltersR` is closest in objective as ours, though it's implemented in R, and doesn't \
appear to be actively maintained.

`Cleanlab` and `Snorkel` are both in Python, though they have somewhat different \
priorities than us. While our goal is to implement as many algorithms as \
possible, these tools usually focus on one or few related papers. They have also been \
developed for some time- meaning they are more stable, well-optimized and better suited \
for practitioners/ engineers than `scikit-clean`.


Credits
**************

We want to `scikit-learn`, `imbalance-learn` and `Cleanlab`, these implemntations \
are inspired by, and dircetly borrows code from these libraries.

We also want to thank the authors of original papers. Here is a list of papers partially \
or fully implemented by `scikit-clean`:

.. bibliography:: zrefs.bib
    :list: bullet
    :cited:

A note about naming
-----------------------------------------------

    "There are 2 hard problems in computer science: cache invalidation, naming things, and \
    off-by-1 errors."

Majority of the algorithms in `scikit-clean` are not explicitly named by their authors. \
In some rare cases, similar or very similar ideas appear under different names (e.g. `KDN`). \
We tried to name things as best as we could. However, if you're the author of any of these \
methods and want to rename it, we'll happily oblige.