NotificationsYou must be signed in to change notification settings
Fork834
Star10.7k

Commit5eb89fc

authored

link to papers for details of issue types in guide (#752)

1 parenta77f840 commit5eb89fcCopy full SHA for 5eb89fc

File tree

1 file changed

-4

lines changed

docs/source/cleanlab/datalab/guide
- issue_type_description.rst

1 file changed

-4

lines changed

`‎docs/source/cleanlab/datalab/guide/issue_type_description.rst`

Lines changed: 7 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -41,7 +41,7 @@ Examples whose given label is estimated to be potentially incorrect (e.g. due to`
`41`	`41`	`Datalab estimates which examples appear mislabeled as well as a numeric label quality score for each, which quantifies the likelihood that an example is correctly labeled.`
`42`	`42`
`43`	`43`	`For now, Datalab can only detect label issues in a multi-class classification dataset.`
`44`		`-The cleanlab library has alternative methods you can us to detect label issues in other types of datasets (multi-label, multi-annotator, token classification, etc.).`
	`44`	`+The cleanlab library has alternative methods you can us to detect label issues in other types of datasets (multi-label, multi-annotator, token classification, etc.).`
`45`	`45`
`46`	`46`	Label issues are calculated based on provided `pred_probs` from a trained model. If you do not provide this argument, this type of issue will not be considered.
`47`	`47`	For the most accurate results, provide out-of-sample `pred_probs` which can be obtained for a dataset via `cross-validation<https://docs.cleanlab.ai/stable/tutorials/pred_probs_cross_val.html>`_.
`@@ -50,6 +50,7 @@ Having mislabeled examples in your dataset may hamper the performance of supervi`
`50`	`50`	`For evaluating models or performing other types of data analytics, mislabeled examples may lead you to draw incorrect conclusions.`
`51`	`51`	`To handle mislabeled examples, you can either filter out the data with label issues or try to correct their labels.`
`52`	`52`
	`53`	+Learn more about the method used to detect label issues in our paper: `Confident Learning: Estimating Uncertainty in Dataset Labels<https://arxiv.org/abs/1911.00068>`_
`53`	`54`
`54`	`55`
`55`	`56`	`Outlier Issue`
@@ -68,6 +69,8 @@ When based on `pred_probs`, the outlier quality of each example is scored invers
`68`	`69`	`Modeling data with outliers may have unexpected consequences.`
`69`	`70`	`Closely inspect them and consider removing some outliers that may be negatively affecting your models.`
`70`	`71`
	`72`	+Learn more about the methods used to detect outliers in our article: `Out-of-Distribution Detection via Embeddings or Predictions<https://cleanlab.ai/blog/outlier-detection/>`_
	`73`	`+`
`71`	`74`	`(Near) Duplicate Issue`
`72`	`75`	`----------------------`
`73`	`76`
`@@ -82,7 +85,7 @@ Near duplicated examples may record the same information with different:`
`82`	`85`	Near Duplicate issues are calculated based on provided `features` or `knn_graph`.
`83`	`86`	`If you do not provide one of these arguments, this type of issue will not be considered.`
`84`	`87`
`85`		-Datalab defines near duplicates as those examples whose distance to their nearest neighbor (in the space of provided `features`) in the dataset is less than `c * D`, where `0 < c < 1` is afractional constant parameter, and `D` is the median (over the full dataset) of such distances between each example and its nearest neighbor.
	`88`	+Datalab defines near duplicates as those examples whose distance to their nearest neighbor (in the space of provided `features`) in the dataset is less than `c * D`, where `0 < c < 1` is asmall constant, and `D` is the median (over the full dataset) of such distances between each example and its nearest neighbor.
`86`	`89`	`Scoring the numeric quality of an example in terms of the near duplicate issue type is done proportionally to its distance to its nearest neighbor.`
`87`	`90`
`88`	`91`	`Including near-duplicate examples in a dataset may negatively impact a ML model's generalization performance and lead to overfitting.`
`@@ -96,9 +99,9 @@ Whether the dataset exhibits statistically significant violations of the IID ass`
`96`	`99`
`97`	`100`	The Non-IID issue is detected based on provided `features` or `knn_graph`. If you do not provide one of these arguments, this type of issue will not be considered.
`98`	`101`
`99`		-Mathematically, the overall Non-IID score for the dataset is defined as the p-value of a statistical test for whether the distribution of index-gap values differs between group A vs. group B defined as follows. For a pair of examples in the dataset `x1, x2`, we define their index-gap as the distance between the indices of these examples in the ordering of the data (e.g. if `x1` is the 10th example and `x2` is the 100th example in the dataset, their index-gap is 90). We construct group A from pairs of examples which are amongst the K nearest neighbors of each other, where neighbors are defined based on the provided `knn_graph` or via distances in the space of the provided vector `features` . Group B is constructed from random pairs of examples in the dataset.
	`102`	+Mathematically, the overall Non-IID score for the dataset is defined as the p-value of a statistical test for whether the distribution of index-gap values differs between group A vs. group B defined as follows. For a pair of examples in the dataset `x1, x2`, we define their index-gap as the distance between the indices of these examples in the ordering of the data (e.g. if `x1` is the 10th example and `x2` is the 100th example in the dataset, their index-gap is 90). We construct group A from pairs of examples which are amongst the K nearest neighbors of each other, where neighbors are defined based on the provided `knn_graph` or via distances in the space of the provided vector `features` . Group B is constructed from random pairs of examples in the dataset.
`100`	`103`
`101`		-The Non-IID quality score for each example `x` is defined via a similarly computed p-value but with Group A constructed from the K nearest neighbors of `x` and Group B constructed from random examples from the dataset paired with `x`.
	`104`	+The Non-IID quality score for each example `x` is defined via a similarly computed p-value but with Group A constructed from the K nearest neighbors of `x` and Group B constructed from random examples from the dataset paired with `x`. Learn more about the math behind this method in our paper: `Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors<https://arxiv.org/abs/2305.15696>`_
`102`	`105`
`103`	`106`	`The assumption that examples in a dataset are Independent and Identically Distributed (IID) is fundamental to most proper modeling. Detecting all possible violations of the IID assumption is statistically impossible. This issue type only considers specific forms of violation where examples that tend to be closer together in the dataset ordering also tend to have more similar feature values. This includes scenarios where:`
`104`	`107`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit5eb89fc

File tree

1 file changed

1 file changed

`‎docs/source/cleanlab/datalab/guide/issue_type_description.rst`

0 commit comments