Magika: How Google’s Revolutionary AI Tool for File Identification Boosts Digital Security

Google has recently announced that it is making Magika, an artificial intelligence (AI) tool that can identify file types, available to the public as an open-source project. It is designed to help defenders and security professionals detect binary and textual file types with high accuracy and speed.

How Magika Works

Magika uses a custom deep-learning model that can precisely identify file types in milliseconds. The model is trained on a large and diverse dataset of file types, including common ones like PDF, DOCX, and PNG, as well as more obscure or problematic ones like VBA, JavaScript, and Powershell.

It implements inference functions using the Open Neural Network Exchange (ONNX), a standard format for representing machine learning models. This allows it to be easily integrated with other tools and platforms that support ONNX.

Magika also supports three different prediction modes, which tweak the tolerance to errors: high-confidence, medium-confidence, and best-guess. Depending on the mode, Magika uses a per-content-type threshold system that determines whether to “trust” the prediction for the model, or whether to return a generic label, such as “Generic text document” or “Unknown binary data”.


Why Magika Matters

According to Google, it outperforms conventional file identification methods, providing an overall 30% accuracy boost and up to 95% higher precision on hard-to-identify file types. This can help defenders and security professionals quickly and reliably detect malicious or harmful files and prevent them from causing damage.

Google said it uses Magika internally at scale to improve users’ safety by routing Gmail, Drive, and Safe Browsing files to the appropriate security and content policy scanners. it is also compatible with other Google AI tools, such as RETVec, a multilingual text processing model that can detect spam and malicious emails in Gmail.

By open-sourcing Magika, Google hopes to enable more developers and researchers to access and use the tool and to foster collaboration and innovation in the field of AI security. Magika is available as a Python command line, a Python API, and an experimental TFJS model, which powers the web demo on the project’s website.

Magika and the Future of AI Security

Google’s decision to open-source Magika comes amid an ongoing debate on the risks and benefits of the rapidly developing AI technology and its potential misuse by malicious actors, such as nation-state hackers associated with Russia, China, Iran, and North Korea.

Google argued that deploying AI at scale can strengthen digital security and “tilt the cybersecurity balance from attackers to defenders.” It also stressed the need for a balanced regulatory approach to AI usage and adoption, to avoid a scenario where attackers can innovate, but AI governance choices constrain defenders.

Google’s Phil Venables and Royal Hansen highlighted that AI empowers security professionals and defenders to expand their efforts in identifying threats, analyzing malware, detecting vulnerabilities, fixing vulnerabilities, and responding to incidents. They also emphasized that AI presents the greatest chance to disrupt the Defender’s Dilemma and shift the balance in cyberspace, granting defenders a clear advantage over attackers.

However, AI also poses some challenges and concerns, such as the use of web-scraped data for training purposes, which may contain personal data and violate data protection and privacy rights. The U.K. Information Commissioner’s Office (ICO) warned last month that AI developers should be aware of the downstream use and impact of their models.

Moreover, new research has revealed that large language models, such as GPT-4 and Gemini AI, can function as “sleeper agents” programmed to engage in deceptive or malicious behavior when certain conditions are met or given special instructions. Researchers from AI startup Anthropic said that such backdoor behavior can be persistent and resistant to standard safety training techniques.

Magika is one of the latest examples of how Google is leveraging AI to enhance its products and services, as well as to contribute to the broader AI community and ecosystem. By open-sourcing Magika, Google hopes to enable more developers and researchers to access and use the tool and to foster collaboration and innovation in the field of AI security.