API Reference

Detectors (skclean.detectors)

skclean.detectors.KDN([n_neighbors, weight, …])

For each sample, the percentage of it’s nearest neighbors with same label serves as it’s conf_score.

skclean.detectors.ForestKDN([n_neighbors, …])

Like KDN, but a trained Random Forest is used to compute pairwise similarity.

skclean.detectors.RkDN([n_neighbors, …])

For each sample, the percentage of it’s nearest neighbors with same label serves as it’s conf_score.

skclean.detectors.PartitioningDetector([…])

Partitions dataset into n subsets, trains a classifier on each.

skclean.detectors.MCS([classifier, n_steps, …])

Detects noise using a sequential Markov Chain Monte Carlo sampling algorithm.

skclean.detectors.InstanceHardness([…])

A set of classifiers are used to predict labels of each sample using cross-validation.

skclean.detectors.RandomForestDetector([…])

Trains a Random Forest first- for each sample, only trees that didn’t select it for training (via bootstrapping) are used to predict it’s label.

Handlers (skclean.handlers)

skclean.handlers.Filter(classifier[, …])

Removes from dataset samples most likely to be noisy.

skclean.handlers.FilterCV(classifier[, …])

For quickly finding best cutoff point for Filter i.e.

skclean.handlers.CLNI(classifier, detector)

Iteratively detects and filters out mislabelled samples unless a stopping criterion is met.

skclean.handlers.IPF(classifier, detector[, …])

Iteratively detects and filters out mislabelled samples unless a stopping criterion is met.

skclean.handlers.SampleWeight(classifier[, …])

Simply passes conf_score (computed with detector) as sample weight to underlying classifier.

skclean.handlers.WeightedBagging([…])

Similar to regular bagging- except cleaner samples will be chosen more often during bagging.

skclean.handlers.Costing([classifier, …])

Implements costing, a method combining cost-proportionate rejection sampling and ensemble aggregation.

Models (skclean.models)

skclean.models.RobustForest([method, K, …])

Uses a random forest to to compute pairwise similarity/distance, and then a simple K Nearest Neighbor that works on that similarity matrix.

skclean.models.RobustLR([PN, NP, C, …])

Modifies the logistic loss using class dependent (estimated) noise rates for robustness.

Pipeline (skclean.pipeline)

The imblearn.pipeline module implements utilities to build a composite estimator, as a chain of transforms, samples and estimators.

skclean.pipeline.Pipeline(**kwargs)

Sequentially applies a list of transforms and a final estimator.

skclean.pipeline.make_pipeline(*steps, **kwargs)

Construct a Pipeline from the given estimators.

Noise Simulation (skclean.simulate_noise)

skclean.simulate_noise.flip_labels_uniform(Y, …)

All labels are equally likely to be flipped, irrespective of their true label or feature.

skclean.simulate_noise.flip_labels_cc(y, lcm)

Class Conditional Noise: general version of flip_labels_uniform, a sample’s probability of getting mislabelled and it’s new (noisy) label depends on it’s true label, but not features.

skclean.simulate_noise.UniformNoise(noise_level)

All labels are equally likely to be flipped, irrespective of their true label or feature.

skclean.simulate_noise.CCNoise([lcm, …])

Class Conditional Noise: general version of flip_labels_uniform- a sample’s probability of getting mislabelled and it’s new (noisy) label depends on it’s true label, but not features.

skclean.simulate_noise.BCNoise(classifier, …)

Boundary Consistent Noise- instances closer to boundary more likely to be noisy.

References