Introduction to Scikit-clean

scikit-clean is a python ML library for classification in the presence of label noise. Aimed primarily at researchers, this provides implementations of several state-of-the-art algorithms, along with tools to simulate artificial noise, create complex pipelines and evaluate them.

Example Usage

Before we dive into the details, let’s take a quick look to see how it works. scikit-clean, as the name implies, is built on top of scikit-learn and is fully compatible* with scikit-learn API. scikit-clean classifiers can be used as a drop in replacement for scikit-learn classifiers.

In the simple example below, we corrupt a dataset using artifical label noise, and then train a model using robust logistic regression:

[1]:
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

from skclean.simulate_noise import flip_labels_uniform, UniformNoise, CCNoise
from skclean.models import RobustLR
from skclean.pipeline import Pipeline, make_pipeline

SEED = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=SEED)

y_train_noisy = flip_labels_uniform(y_train, .3, random_state=SEED)  # Flip labels of 30% samples

clf = RobustLR(random_state=SEED).fit(X_train,y_train_noisy)
print(clf.score(X_test, y_test))
0.8888888888888888

You can use scikit-learn’s built in tools with scikit-clean. For example, let’s tune one hyper-parameter of RobustLR used above, and evaluate the resulting model using cross-validation:

[2]:
from sklearn.model_selection import GridSearchCV, cross_val_score

grid_clf = GridSearchCV(RobustLR(),{'PN':[.1,.2,.4]},cv=3)
cross_val_score(grid_clf, X, y, cv=5, n_jobs=5).mean()  # Note: here we're training & testing here on clean data for simplicity
[2]:
0.8804533457537648

Algorithms

Algorithms implemented in scikit-clean can be broadly categorized into two types. First we have ones that are inherently robust to label noise. They often modify or replace the loss functions of existing well known algorithms like SVM, Logistic Regression etc. and do not explcitly try to detect mislabeled samples in data. RobustLR used above is a robust variant of regular Logistic Regression. These methods can currently be found in skclean.models module, though this part of API is likely to change in future as no. of implementations grow.

On the other hand we have Dataset-focused algorithms: their focus is more on identifying or cleaning the dataset, they usually rely on other existing classifiers to do the actual learning. Majority of current scikit-clean implementations fall under this category, so we describe them in a bit more detail in next section.

Detectors and Handlers

Many robust algorithms designed to handle label noise can be essentially broken down to two sequential steps: detect samples which has (probably) been mislabeled, and use that information to build robust meta classifiers on top of existing classifiers. This allows us to easily create new robust classifiers by mixing the noise detector of one paper with the noise-handler of another.

In scikit-clean, the classes that implement those two tasks are called Detector and Handler respectively. During training, Detector will find for each sample the probability that it has been correctly labeled (i.e. conf_score). Handler can use that information in many ways, like removing likely noisy instances from dataset (Filtering class), or assigning more weight on reliable samples (example_weighting module) etc.

Let’s rewrite the above example. We’ll use KDN: a simple neighborhood-based noise detector, and WeightedBagging: a variant of regular bagging that takes sample reliability into account.

[3]:
from skclean.detectors import KDN
from skclean.handlers import WeightedBagging

conf_score = KDN(n_neighbors=5).detect(X_train, y_train_noisy)
clf = WeightedBagging(n_estimators=50).fit(X_train, y_train_noisy, conf_score)
print(clf.score(X_test, y_test))
0.9181286549707602

The above code is fine for very simple workflow. However, real world data modeling usually includes lots of sequential steps for preprocesing, feature selection etc. Moreover, hyper-paramter tuning, cross-validation further complicates the process, which, among other things, frequently leads to Information Leakage. An elegant solution to this complexity management is Pipeline.

Pipeline

scikit-clean provides a customized Pipeline to manage modeling which involves lots of sequential steps, including noise detection and handling. Below is an example of pipeline. At the very first step, we introduce some label noise on training set. Some preprocessing like scaling and feature selection comes next. The last two steps are noise detection and handling respectively, these two must always be the last steps.

[4]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVC
from sklearn.model_selection import ShuffleSplit, StratifiedKFold

from skclean.handlers import Filter
from skclean.pipeline import Pipeline         # Importing from skclean, not sklearn


clf = Pipeline([
        ('scale', StandardScaler()),          # Scale features
        ('feat_sel', VarianceThreshold(.2)),  # Feature selection
        ('detector', KDN()),                  # Detect mislabeled samples
        ('handler', Filter(SVC())),           # Filter out likely mislabeled samples and then train a SVM
])

inner_cv = ShuffleSplit(n_splits=5,test_size=.2,random_state=1)
outer_cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=2)

clf_g = GridSearchCV(clf,{'detector__n_neighbors':[2,5,10]},cv=inner_cv)

n_clf_g = make_pipeline(UniformNoise(.3),clf_g)            # Create label noise at the very first step
print(cross_val_score(n_clf_g, X, y, cv=outer_cv).mean())  # 5-fold cross validation
0.9332712311752832

There are two important things to note here. First, don’t use the Pipeline of scikit-learn, import from skclean.pipeline instead.

Secondly, a group of noise hanbdlers are iterative: they call the detect of noise detectors multiple times (CLNI, IPF etc). Since they don’t exactly follow the sequential noise detection->handling pattern, you must pass the detector in the constructor of those Handlers.

[5]:
from skclean.handlers import CLNI

clf = CLNI(classifier=SVC(), detector=KDN())

All Handler can be instantiated this way, but this is a must for iterative ones. (Use iterative attribute to check.)

Noise Simulation

Remember that as a library written primarily for researchers, you’re expected to have access to “true” or “clean” labels, and then introduce noise to training data by flipping those true labels. scikit-clean provides several commonly used noise simulators- take a look at this example to understand their differences. Here we mainly focus on how to use them.

Perhaps the most important thing to remember is that noise simulation should usually be the very first thing you do to your training data. In code below, GridSearchCV is creating a validation set before introducing noise and using clean labels for inner loop, leading to information leakage. This is probably NOT what you want.

[6]:
clf = Pipeline([
    ('simulate_noise', UniformNoise(.3)), # Create label noise at first step
    ('scale', StandardScaler()),          # Scale features
    ('feat_sel', VarianceThreshold(.2)),  # Feature selection
    ('detector', KDN()),                  # Detect mislabeled samples
    ('handler', Filter(SVC())),           # Filter out likely mislabeled samples and then train a SVM
])
clf_g = GridSearchCV(clf,{'detector__n_neighbors':[2,5,10]},cv=inner_cv)
print(cross_val_score(clf_g, X, y, cv=outer_cv).mean())  # 5-fold cross validation
0.9244216736531594

You can use noise simulators outside Pipeline, all NoiseSimulator classes are simple wrapper around functions. UniformNoise is a wrapper of flip_labels_uniform, as the first example of this document shows.

Datasets & Performance Evaluation

Unlike deep learning datasets which tends to be massive in size, tabular datasets are usually lot smaller. Any new algorithm is therefore compared using multiple datasets against baselines. The skclean.utils module provides two important functions to help researchers in these tasks:

  1. load_data: to load several small to medium-sized preprocessed datasets on memory.

  2. compare: These function takes several algorithms and datasets, and outputs the performances in a csv file. It supports automatic resumption of partially computed results, specially helpful for comparing long running, computationally expensive methods on big datasets.

Take a look at this notebook to see how they are used.

[ ]: