Gemma: Google and NVIDIA’s Amazing New Lightweight Language Models

Gemma is a new family of lightweight language models (LLMs) from Google and NVIDIA that can run on any device, from data centers and clouds to PCs and laptops. It is designed to enable domain-specific applications and faster inference for natural language processing (NLP) tasks.

Gemma is based on the same research and technology that powers the Gemini models, which are state-of-the-art LLMs with up to 137 billion parameters. However, it is much smaller and more efficient, with only 2 billion and 7 billion parameters for the two variants. This makes it more accessible and affordable for developers who want to leverage the power of LLMs for their use cases.

How it works with NVIDIA GPUs

Gemma is optimized to run on NVIDIA GPUs, thanks to the collaboration between Google and NVIDIA. The teams from both companies worked closely together to accelerate the performance of Gemma with NVIDIA TensorRT-LLM, an open-source library for optimizing LLM inference.

TensorRT-LLM enables it to run on NVIDIA GPUs in the data center, in the cloud, and on PCs with NVIDIA RTX GPUs. This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.

Developers can also run it on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s H200 Tensor Core GPUs — featuring 141GB of HBM3e memory at 4.8 terabytes per second — which Google will deploy this year.

How Developers Can Use It

Enterprise developers can take advantage of NVIDIA’s rich ecosystem of tools — including NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM — to fine-tune Gemma and deploy the optimized model in their production application.

Gemma can be used for various NLP tasks, such as text generation, summarization, question answering, sentiment analysis, and more. It can also adapt to different domains, such as healthcare, finance, education, and entertainment.

Developers can learn more about how TensorRT-LLM is revving up inference for Gemma, along with additional information for developers. This includes several model checkpoints it and the FP8-quantized version of the model, all optimized with TensorRT-LLM.

Developers can also experience Gemma 2B and 7B directly from their browser on the NVIDIA AI Playground.

Gemma Coming to Chat With RTX

Another way to experience Gemma is through Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs.

Chat with RTX lets users personalize a chatbot with their data by easily connecting local files on a PC to a large language model.

Since the model runs locally, it provides results fast, and user data stays on the device. Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.


Chat with RTX will add support for Gemma soon, allowing users to chat with a more lightweight and efficient language model.

Gemma is a breakthrough in LLM technology, offering a new level of flexibility and performance for developers and users. It is the result of the collaboration between Google and NVIDIA, two leaders in AI innovation. It is the future of natural language processing.

What is the difference between Gemma and Gemini?

Gemma and Gemini are two families of large language models (LLMs) developed by Google and NVIDIA, but they differ significantly. Here are a few examples:

  • Gemma is a lightweight, open-source model, whereas Gemini is a massive, closed model. It is intended to run on any device, from laptops to clouds, whereas Gemini requires specialized data center hardware.
  • It is available in two sizes: 2 billion and 7 billion parameters, while Gemini comes in four sizes: 1.5 billion, 11 billion, 137 billion, and 1.5 trillion parameters.
  • It is optimized for domain-specific applications and faster inference, whereas Gemini is optimized for general-purpose tasks and greater accuracy.
  • It is currently only in English, whereas Gemini supports multiple languages.
  • It is designed for responsible AI development and includes a toolkit for creating safer AI applications, whereas Gemini adheres to Google’s AI principles and terms of service.

Gemma and Gemini are both powerful LLMs capable of performing a wide range of natural language processing tasks, including text generation, summarization, and question-answering. They have different trade-offs and use cases, depending on the developer’s needs and preferences.