Active learning (called query learning or optimal experimental design in statistics). The key hypothesis is, if the learning algorithm is allowed to choose the data from what it learns, it will perform better with less training.
Comparison with passive learning: typically gathering a large amount of data randomly sampled from the underlying distribution and using this large dataset to train a model.
Why active learning?
Give a simple example shown as below.
Figure (a) illustrates a toy data set of 400 samples, evenly sampled from two class Gaussians. (b) A logistic regression (LR) model trained with 30 sampled instances randomly sampled from overall distribution (70% accuracy). (c) A LR model trained with 30 actively queried instances using uncertainty sampling (90% accuracy).
Fig. (b) illustrates the traditional supervised learning approach after random selecting 30 instances for labelling, drawn i.i.d. from the unlabeled pool
Figure (c) shows the active learning approach. The active learner uses uncertainty sampling to focus on instances closest to its decision boundary, assuming it can adequately explain those in other parts of the input sapce characterized by
Scenarios
Membership Query Synthesis
The learner generates the instance from some underlying natural distribution. E.g. mnist dataset, rotation or excluding some pieces of the digit.
Limitation: labelling generated arbitrary instances an be awkward if the oracle is a human annotator. Unexpected problems: for image data, generated images are not recognisable; for NLP and speech tasks, synthesised text or speech is gibberish.
Stream-based selective sampling
Alternative to synthesizing queries” selective sampling. The key assumption: obtaining an unlabeled instance is free.
Stream-based selective sampling (a.k.a. stream-based or sequential active learning): each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard.
Decision criteria:
- informativeness measure (a.k.a. query strategy) See following section
- Compute an explicit region of uncertainty, i.e. the part of the instance space that is ambiguous to the learner, and only query instances that fall within it.
- Naive way: set a threshold for informativeness measure and regard those below the threshold as query region;
- Hypothesis consistency. Define the region that is unknown to the overall model class, i.e. whether hypothesis between any two models are consistent. That is, any two models (completely expensive; requires keep models after each new query) of the same model class agree on all the labeled data, but disagree on some unlabelled instance, such instance lies within the region of uncertainty.
Pool-based sampling
Pool-based evaluates and ranks the entire unlabelled data collection before selecting the best query, whilst stream-based sampling queries and make decisions for one instance at a time. Pool-based sampling is more practical for many real-world problems.
Query-strategy
Active learning involves evaluating the informativeness of unlabelled instances by either generating de novo or sampling from the given distribution.
Let
Uncertainty sampling
Least confident strategy
The active learner queries instances about which is least certain how to label. E.g. for binary classification models, simply query the instance whose posterior probability of being positive is nearest 0.5 .
For three or more class labels, a more general uncertainty sampling method is to query the instance whose prediction is the least confident:
where
, the class label with the highest posterior probability under the model
One way to interpret this uncertainty measure is the expected 0/1-loss, i.e. the model’s belief that it will mislabel
Limitations: Least confident strategy only considers information about the most probable label. It “throws away” information about the remaining label distribution. Margin sampling incorporates the posterior of the second most likely label to correct this.
Margin sampling
where
Intuitions: Instances with large margins are easy, since classifiers have little doubt in differentiating between the two most likely class labels. Instances with small margins are more ambiguous, thus knowing the true label would help discriminate.
Limitations Problems with very large label sets, the margin method still ignores much of the output distribution for the remaining classes. solution -> entropy.
Entropy sampling
where
Above three uncertainty sampling methods are application-dependent.
“Intuitively, though, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.”
Query-By-Committee
Query-by-committee (QBC) algorithm: maintain a committee
Measure the level of disagreement:
vote entropy:
where
Kullback-Leibler (KL) divergence
where
Here
is the “consensus” probability that
Expected model change
A decision-theoretic approach: selecting the instance that would impart the greatest change to the current model if we knew its label.
Expected gradient length (EGL): apply to gradient-based training. The “change” imparted to the model can be measured by the length of the training gradient (i.e. the vector used to re-estimate parameter values). That is, the learner query the new instances
where ||.|| is the Euclidean norm of each resulting gradient vector. Note that,
Intuition: prefer the instance that is likely to most influence the model (i.e. have the greatest impact on its parameters), regardless of the resulting query label.
Limitations: EGL can lead astray if features are not properly scaled, i.e. the informativeness of a given instance maybe over-estimated simply because one of its future values is usually large, or corresponding parameter estimate is larger. Solution -> parameter regularisation.
Expected error reduction
Idea: estimate the expected future error of a model trained using
One approach is to minimize the expected 0/1-loss:
where
One interpretation is,
maximizing the expected information gain of the query x, or (equivalently) the mutual information of the output variables over x and U
Variance reduction
Reduce generalization error indirectly by minimizing output variance.
Regression problem
where
Density-weighted methods
Challenges: A central idea of the estimated error and variance reduction frameworks is that they focus on the entire input space rather than individual instances. Thus, they are less prone to querying outliers than simpler query strategies like uncertainty sampling, QBC, and EGL. The least certain instance lies on the classification boundary, but is not “representative” of other instances in the distribution, so knowing its label is unlikely to improve accuracy on the data as a whole.
informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution (i.e., inhabit dense regions of the input space).
Here,
References
- [1] Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.
- [2] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.