Beyond one-hot encoding: Lower dimensional target embedding☆
Graphical Abstract
Introduction
Convolutional Neural Networks lie at the core of the latest breakthroughs in large-scale image recognition [1,2], at present even surpassing human performance [3], applied to the classification of objects [4], faces [5], or scenes [6]. Due to its effectiveness and simplicity, one-hot encoding is still the most prevalent procedure for addressing such multi-class classification tasks: in essence, a function is modeled, that maps image samples to a probability distribution over a discrete set of the n labels of target categories.
Unfortunately, when the output space grows, class labels do not properly span the full label space, mainly due to existing label cross-correlations. Consequently, one-hot encoding might result inadequate for fine-grained classification tasks, since the projection of the outputs into a higher dimensional (orthogonal) space dramatically increases the parameter space of computed models. In addition, for datasets with a large number of labels, the ratio of samples per label is typically reduced. This constitutes an additional challenge for training CNN models in large output spaces, and the reason of slow convergence rates [7].
In order to address the aforementioned limitations, output embeddings have been proposed as an alternative to the one-hot encoding for training in large output spaces [8]: depending on the specific classification task at hand, using different output embeddings captures different aspects of the structure of the output space. Indeed, since embeddings use weight sharing during training for finding simpler (and more natural) partitions of classes, the latent relationships between categories are included in the modeling process.
According to Akata et al. [9], output embeddings can be categorized as:
- •
Data-independent embeddings, such as drawing rows or columns from a Hadamard matrix [10]: data-independent embeddings produce strong baselines [11], since embedded classes are equidistant due to the lack of prior knowledge;
- •
Embeddings based on a priori information, like attributes [12], or hierarchies [13]: unfortunately, learning from attributes requires expert knowledge or extra labeling effort and hierarchies require a prior understanding of a taxonomy of classes, and in addition, approaches that use textual data as prior do not guarantee visual similarity [11]; and
- •
Learned embeddings, for capturing the semantic structure of word sequences (i.e. annotations) and images jointly [14]. The main drawbacks of learning output embeddings are the need of a high amount of data, and a slow training performance.
Thus, in cases where there exist high quality attributes, methods with prior information are preferred, while in cases of a known equidistant label space, data-independent embeddings are a more suitable alternative. Unfortunately, the architectural design of a model is bound to the particular choice among the above-mentioned embeddings. Thus, once a model is chosen and trained using a specific output embedding, it is hard to reuse it for another tasks requiring a different type of embedding.
In this paper, Error-Correcting Output Codes (ECOCs) are proven to be a better alternative to one-hot encoding for image recognition, since ECOCs are a generalization of the three embedding categories [15], so a change in the ECOC matrix will not constitute a change in the chosen architecture. In addition, ECOCs naturally enable error-correction, low dimensional embedding spaces [16], and bias and variance error reduction [17].
Inspired by the latest advances on ECOCs, we circumvent one-hot encoding by integrating the Error-Correcting Output Codes into CNNs, as a generalization of output embedding. As a result, a best-of-both-worlds approach is indeed proposed: compact outputs, data-based hierarchies, and error correction. Using our approach, training models in low-dimensional spaces drastically improves convergence speed in comparison to one-hot encoding. Fig. 1 shows an overview of the proposed model.
The rest of the paper is organized as follows: Section 2 reviews the existing work most closely related to this paper. Section 3 presents the contribution of the proposed embedding technique, which is two fold: (i) we show that random projections of the label space are suitable for finding useful lower dimensional embeddings, while boosting dramatically convergence rates at zero computational cost; and (ii) In order to generate partitions of the label space that are more discriminative than the random encoding (which generates random partitions of the label space), we also propose a normalized eigenrepresentation of the class manifold to encode the targets with minimal information loss, thus improving the accuracy of random projections encoding while enjoying the same convergence rates. Subsequently, the experimental results on CIFAR-100 [18], CUB200-2011 [19], MIT Places [6], and ImageNet [1] presented in Section 4 show that our approach drastically improves convergence speed while maintaining a competitive accuracy. Lastly, Section 5 concludes the paper discussing how, when gradient sparsity on the output neurons is highly reduced, more robust gradient estimates and better representations can be found.
Section snippets
Related work
This section reviews those works on output embeddings most related to ours, in particular those using ECOC.
Low dimensional target embedding
Fig. 1 depicts our proposed model inspired by the ECOC framework [34] and applied for deep supervised learning. Given a set of n classes, an ECOC consists of a set of k binary partitions of the label space (groups of classes) representing each of the n classes in the dataset. The codes are usually arranged in a design matrix M ∈{−1,1}n×k.
Let's define the output of the last layer of a neural network as zl, with l the depth of the network. For the sake of clarity the identity non-linearity ϕ(⋅)
Experiments
To validate our approach, we perform a thorough analysis of the advantages of embedding output codes in CNN models over different state-of-the-art datasets. First, we describe the considered datasets, methods and evaluation.
Conclusion
In this work, output codes are integrated with the training of deep CNNs on large-scale datasets. We found that CNNs trained on CIFAR-100, CUB200, Imagenet, and MIT Places using our approach show less sparsity at the output neurons. As a result, models trained with our approach showed more robust gradient estimates and faster convergence rates than those trained with the prevalent one-hot encoding at a small cost, especially for huge label spaces. As a side effect, CNNs trained with our
Acknowledgments
Authors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO FEDER), the 2016FI_B 01163 grant (Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya), and the COST Action IC1307 iV&L Net (European Network on Integrating Vision and Language) supported by COST (European Cooperation in Science and Technology). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a GTX TITAN
References (51)
- et al.
Minimal design of error-correcting output codes
Pattern Recogn. Lett.
(2012) - et al.
Error-correcting output coding corrects bias and variance.
- et al.
Facial action unit recognition using multi-class classification
Neurocomputing
(2015) - et al.
On a class of error correcting binary group codes
Inf. Control
(1960) - et al.
Spectral Error Correcting Output Codes for Efficient Multiclass Recognition, Sign (M (r, i)
(2009) - et al.
ImageNet large scale visual recognition challenge
Int. J. Comput. Vis.
(2015) - et al.
Microsoft coco: common objects in context
- et al.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification
- et al.
The Pascal visual object classes challenge: a retrospective
Int. J. Comput. Vis.
(2015) - et al.
Deep learning face attributes in the wild
Learning deep features for scene recognition using places database
Deep Networks With Large Output Spaces
Label embedding trees for large multi-class tasks
Label-embedding for image classification
IEEE Trans. Pattern Anal. Mach. Intell.
Multi-label prediction via compressed sensing.
Devise: a deep visual-semantic embedding model
Label-embedding for attribute-based classification
Large margin methods for structured and interdependent output variables
J. Mach. Learn. Res.
Large scale image annotation: learning to rank with joint word-image embeddings
Mach. Learn.
On the decoding process in ternary error-correcting output codes
IEEE Trans. Pattern Anal. Mach. Intell.
Learning Multiple Layers of Features From Tiny Images
Large margin taxonomy embedding for document categorization
Attribute-based transfer learning for object categorization with zero/one training example
Evaluating knowledge transfer and zero-shot learning in a large-scale setting
Online incremental attribute-based zero-shot learning
Cited by (247)
Adsorption of Cr(VI) ions onto fluorine-free niobium carbide (MXene) and machine learning prediction with high precision
2024, Journal of Environmental Chemical EngineeringHDA-IDS: A Hybrid DoS Attacks Intrusion Detection System for IoT by using semi-supervised CL-GAN
2024, Expert Systems with ApplicationsDeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features
2024, Computers in Biology and MedicineA spatial–temporal deep learning-based warning system against flooding hazards with an empirical study in Taiwan
2024, International Journal of Disaster Risk ReductionMulCNN-HSP: A multi-scale convolutional neural networks-based deep learning method for classification of heat shock proteins
2024, International Journal of Biological Macromolecules
- ☆
This paper has been recommended for acceptance by Robert Walecki.