Elsevier

Image and Vision Computing

Volume 75, July 2018, Pages 21-31
Image and Vision Computing

Beyond one-hot encoding: Lower dimensional target embedding

https://doi.org/10.1016/j.imavis.2018.04.004Get rights and content

Highlights

  • We propose Error-Correcting Ouput Codes as an alternative to one-hot for memory-constrained deep learning.

  • The proposed approach relies on an eigenrepresentation of the class manifold.

  • Networks trained with our approach converge faster than using one-hot on multiple datasets.

Abstract

Target encoding plays a central role when learning Convolutional Neural Networks. In this realm, one-hot encoding is the most prevalent strategy due to its simplicity. However, this so widespread encoding schema assumes a flat label space, thus ignoring rich relationships existing among labels that can be exploited during training. In large-scale datasets, data does not span the full label space, but instead lies in a low-dimensional output manifold. Following this observation, we embed the targets into a low-dimensional space, drastically improving convergence speed while preserving accuracy. Our contribution is two fold: (i) We show that random projections of the label space are a valid tool to find such lower dimensional embeddings, boosting dramatically convergence rates at zero computational cost; and (ii) we propose a normalized eigenrepresentation of the class manifold that encodes the targets with minimal information loss, improving the accuracy of random projections encoding while enjoying the same convergence rates. Experiments on CIFAR-100, CUB200-2011, Imagenet, and MIT Places demonstrate that the proposed approach drastically improves convergence speed while reaching very competitive accuracy rates.

Introduction

Convolutional Neural Networks lie at the core of the latest breakthroughs in large-scale image recognition [1,2], at present even surpassing human performance [3], applied to the classification of objects [4], faces [5], or scenes [6]. Due to its effectiveness and simplicity, one-hot encoding is still the most prevalent procedure for addressing such multi-class classification tasks: in essence, a function f:RpZ2n is modeled, that maps image samples to a probability distribution over a discrete set of the n labels of target categories.

Unfortunately, when the output space grows, class labels do not properly span the full label space, mainly due to existing label cross-correlations. Consequently, one-hot encoding might result inadequate for fine-grained classification tasks, since the projection of the outputs into a higher dimensional (orthogonal) space dramatically increases the parameter space of computed models. In addition, for datasets with a large number of labels, the ratio of samples per label is typically reduced. This constitutes an additional challenge for training CNN models in large output spaces, and the reason of slow convergence rates [7].

In order to address the aforementioned limitations, output embeddings have been proposed as an alternative to the one-hot encoding for training in large output spaces [8]: depending on the specific classification task at hand, using different output embeddings captures different aspects of the structure of the output space. Indeed, since embeddings use weight sharing during training for finding simpler (and more natural) partitions of classes, the latent relationships between categories are included in the modeling process.

According to Akata et al. [9], output embeddings can be categorized as:

  • Data-independent embeddings, such as drawing rows or columns from a Hadamard matrix [10]: data-independent embeddings produce strong baselines [11], since embedded classes are equidistant due to the lack of prior knowledge;

  • Embeddings based on a priori information, like attributes [12], or hierarchies [13]: unfortunately, learning from attributes requires expert knowledge or extra labeling effort and hierarchies require a prior understanding of a taxonomy of classes, and in addition, approaches that use textual data as prior do not guarantee visual similarity [11]; and

  • Learned embeddings, for capturing the semantic structure of word sequences (i.e. annotations) and images jointly [14]. The main drawbacks of learning output embeddings are the need of a high amount of data, and a slow training performance.

Thus, in cases where there exist high quality attributes, methods with prior information are preferred, while in cases of a known equidistant label space, data-independent embeddings are a more suitable alternative. Unfortunately, the architectural design of a model is bound to the particular choice among the above-mentioned embeddings. Thus, once a model is chosen and trained using a specific output embedding, it is hard to reuse it for another tasks requiring a different type of embedding.

In this paper, Error-Correcting Output Codes (ECOCs) are proven to be a better alternative to one-hot encoding for image recognition, since ECOCs are a generalization of the three embedding categories [15], so a change in the ECOC matrix will not constitute a change in the chosen architecture. In addition, ECOCs naturally enable error-correction, low dimensional embedding spaces [16], and bias and variance error reduction [17].

Inspired by the latest advances on ECOCs, we circumvent one-hot encoding by integrating the Error-Correcting Output Codes into CNNs, as a generalization of output embedding. As a result, a best-of-both-worlds approach is indeed proposed: compact outputs, data-based hierarchies, and error correction. Using our approach, training models in low-dimensional spaces drastically improves convergence speed in comparison to one-hot encoding. Fig. 1 shows an overview of the proposed model.

The rest of the paper is organized as follows: Section 2 reviews the existing work most closely related to this paper. Section 3 presents the contribution of the proposed embedding technique, which is two fold: (i) we show that random projections of the label space are suitable for finding useful lower dimensional embeddings, while boosting dramatically convergence rates at zero computational cost; and (ii) In order to generate partitions of the label space that are more discriminative than the random encoding (which generates random partitions of the label space), we also propose a normalized eigenrepresentation of the class manifold to encode the targets with minimal information loss, thus improving the accuracy of random projections encoding while enjoying the same convergence rates. Subsequently, the experimental results on CIFAR-100 [18], CUB200-2011 [19], MIT Places [6], and ImageNet [1] presented in Section 4 show that our approach drastically improves convergence speed while maintaining a competitive accuracy. Lastly, Section 5 concludes the paper discussing how, when gradient sparsity on the output neurons is highly reduced, more robust gradient estimates and better representations can be found.

Section snippets

Related work

This section reviews those works on output embeddings most related to ours, in particular those using ECOC.

Low dimensional target embedding

Fig. 1 depicts our proposed model inspired by the ECOC framework [34] and applied for deep supervised learning. Given a set of n classes, an ECOC consists of a set of k binary partitions of the label space (groups of classes) representing each of the n classes in the dataset. The codes are usually arranged in a design matrix M ∈{−1,1}n×k.

Let's define the output of the last layer of a neural network as zl, with l the depth of the network. For the sake of clarity the identity non-linearity ϕ(⋅)

Experiments

To validate our approach, we perform a thorough analysis of the advantages of embedding output codes in CNN models over different state-of-the-art datasets. First, we describe the considered datasets, methods and evaluation.

Conclusion

In this work, output codes are integrated with the training of deep CNNs on large-scale datasets. We found that CNNs trained on CIFAR-100, CUB200, Imagenet, and MIT Places using our approach show less sparsity at the output neurons. As a result, models trained with our approach showed more robust gradient estimates and faster convergence rates than those trained with the prevalent one-hot encoding at a small cost, especially for huge label spaces. As a side effect, CNNs trained with our

Acknowledgments

Authors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO FEDER), the 2016FI_B 01163 grant (Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya), and the COST Action IC1307 iV&L Net (European Network on Integrating Vision and Language) supported by COST (European Cooperation in Science and Technology). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a GTX TITAN

References (51)

  • B. Zhou et al.

    Learning deep features for scene recognition using places database

  • S. Vijayanarasimhan et al.

    Deep Networks With Large Output Spaces

    (2014)
  • S. Bengio et al.

    Label embedding trees for large multi-class tasks

  • Z. Akata et al.

    Label-embedding for image classification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • D. Hsu et al.

    Multi-label prediction via compressed sensing.

  • A. Frome et al.

    Devise: a deep visual-semantic embedding model

  • Z. Akata et al.

    Label-embedding for attribute-based classification

  • I. Tsochantaridis et al.

    Large margin methods for structured and interdependent output variables

    J. Mach. Learn. Res.

    (2005)
  • J. Weston et al.

    Large scale image annotation: learning to rank with joint word-image embeddings

    Mach. Learn.

    (2010)
  • S. Escalera et al.

    On the decoding process in ternary error-correcting output codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • A. Krizhevsky et al.

    Learning Multiple Layers of Features From Tiny Images

    (2009)
  • K.Q. Weinberger et al.

    Large margin taxonomy embedding for document categorization

  • X. Yu et al.

    Attribute-based transfer learning for object categorization with zero/one training example

  • M. Rohrbach et al.

    Evaluating knowledge transfer and zero-shot learning in a large-scale setting

  • P. Kankuekul et al.

    Online incremental attribute-based zero-shot learning

  • Cited by (247)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Robert Walecki.

    View full text