Language Models with Cheap Inference: Apple’s New Research

Language models are powerful tools for natural language processing, but they often require a lot of computational resources and domain-specific data to perform well. How can we overcome these limitations and make language models more efficient and adaptable?

This is the question that Apple researchers address in their new paper, “Specialized Language Models with Cheap Inference from Limited Domain Data.” The paper, which was accepted at the International Conference on Learning Representations (ICLR) 2024, proposes a framework for evaluating and comparing different methods for training and specializing language models under various constraints.

Unveiling the Power of Hyper-networks, Mixtures of Experts, and More

The paper defines four key variables that affect the performance and cost of language models:

  • Pre-training budget: the amount of resources (such as time, memory, and energy) used to train a generic language model on a large and diverse corpus.
  • Specialization budget: the amount of resources used to fine-tune or adapt a generic language model to a specific task or domain.
  • Inference budget: the amount of resources used to run a specialized language model on new inputs.
  • In-domain training set size: the amount and quality of data available for a specific task or domain.

The paper then explores how different techniques from the machine learning literature can be applied to optimize these variables and achieve better results with fewer resources. The techniques include:

  • Hyper-networks: models that generate the weights of another model (called the target network) based on the input. This allows the target network to have a large number of parameters during pre-training, but a smaller number during specialization and inference.
  • Mixtures of experts: models that consist of several sub-models (called experts) that specialize in different aspects of the input. The model uses a gating network to select which experts to use for each input. This allows the model to have a large capacity during pre-training, but a lower complexity during specialization and inference.
  • Importance sampling: a technique that resamples a dataset based on the importance or relevance of each example. This allows the model to focus on the most informative examples during pre-training or specialization.
  • Distillation: a technique that transfers the knowledge of a large and complex model (called the teacher) to a smaller and simpler model (called the student). This allows the student model to achieve similar performance as the teacher model with fewer resources.

Task-specific Model Evaluation: Unveiling Performance Metrics and Optimal Techniques

The paper evaluates these techniques on two tasks: text classification and natural language inference. The paper uses two metrics to measure the performance and cost of the models: perplexity and inference time. Perplexity is a measure of how well a model predicts the next word in a sequence, and inference time is a measure of how long it takes a model to process an input.

The paper finds that:

  • When the inference budget is low, hyper-networks and mixtures of experts outperform vanilla transformer models (the conventional approach) in terms of perplexity, especially for large pre-training budgets.
  • When the specialization budget is high, small models trained with importance sampling outperform other models in terms of perplexity, especially for small in-domain training set sizes.
  • When the specialization budget is low, hyper-networks and mixtures of experts outperform other models in terms of perplexity, especially for large in-domain training set sizes.
  • Distillation does not provide a competitive advantage across the different scenarios considered in the paper.

Based on these findings, the paper provides some recommendations for practitioners who want to apply language models to tasks with limited resources:

  • For tasks with a high specialization budget, use small models pre-trained with importance sampling. This will allow you to leverage a large and diverse corpus without sacrificing inference efficiency.
  • For tasks with a low specialization budget, use hyper-networks or mixtures of experts pre-trained on a large and diverse corpus. This will allow you to have a large and expressive model during pre-training, but a smaller and faster model during specialization and inference.
  • For tasks with a high inference budget, use vanilla transformer models pre-trained on a large and diverse corpus. This will allow you to have a simple and effective model that can handle a variety of inputs.

Beyond One-Size-Fits-All: Navigating Trade-offs and Future Avenues in Language Models Optimization

The paper concludes that there is no one-size-fits-all solution for language models and that different techniques have different trade-offs depending on the task and the constraints. The paper also suggests some directions for future research, such as exploring more asymmetric models, improving the efficiency of importance sampling, and combining different techniques in a hybrid approach.

Share: