AI Trojan Techniques Overcome Defenses, Study Finds

The increasingly widespread use of deep neural networks (DNNs) for computer vision tasks such as facial recognition, medical imaging, object detection and autonomous driving will, if not already, attract the beware of cybercriminals.

DNNs have become essential to deep learning and the broader field of artificial intelligence (AI). It is a multi-layered class of machine learning algorithms that essentially attempt to mimic the workings of a human brain and are becoming increasingly popular in modern application development.

This use is expected to increase rapidly in the coming years. According to analysts at Emergen Research, the global market for DNN technology will grow from $1.26 billion in 2019 to $5.98 billion by 2027, with growing demand in sectors such as healthcare, banking, financial services and insurance.

Such a rapidly expanding market is likely to attract the attention of threat actors, who may interfere in the process of training an AI model to embed hidden features or triggers in DNNs – a horse of Trojan for machine learning, if you will. At the discretion of the attacker, this Trojan can be triggered and the behavior of the model changed, which could have bad consequences. For example, people could be misidentified or objects misread, which could be deadly when it comes to self-driving cars reading road signs.

We can foresee someone creating a trained model containing a Trojan and distributing it to developers, so that it can be triggered later in an application, or poisoning the training data to introduce the Trojan into the system from someone else.

Indeed, bad actors can use multiple approaches to embed triggers into DNNs, and a 2020 study by researchers at Texas A&M University illustrated how easily this can be done, describing what they called a “mechanism without training”. [that] saves massive training efforts compared to conventional Trojan attack methods.”

Detection difficulties

A key issue is the difficulty in detecting the Trojan. Left alone, Trojans do not disrupt the AI ​​model. However, once the cybercriminal triggers them, they will display the target classes that were specified by the attackers. Additionally, only the attackers know what triggers the Trojan and what the target classes are, making them nearly impossible to track down.

There are a myriad of articles from researchers dating back several years describing various attack methods and ways to detect and defend against them – we’ve certainly covered the topic on The register. More recently, researchers from the Institute of Applied Artificial Intelligence at Deakin University and the University of Wollongong – both in Australia – have argued that many of the proposed defense approaches against Trojan horse attacks are lagging behind the rapid evolution of the attacks themselves, leaving DNNs vulnerable to compromise. .

“Over the past few years, Trojan horse attacks have evolved from using a simple trigger and targeting a single class to using many sophisticated triggers and targeting multiple classes,” wrote the researchers in their article. [PDF]“Towards Effective and Robust Neural Trojan Defenses Via Input Filtering,” published this week.

“However, defenses against Trojans have not kept up with this evolution. Most defense methods still make outdated assumptions about Trojan triggers and target classes, and therefore can be easily circumvented by attacks. of modern Trojans.”

In a standard Trojan horse attack on an image classification model, threat actors control the process of forming an image classifier. They insert the Trojan into the classifier so that the classifier misclassifies an image if the trigger is pressed by the attacker.

“A common attack strategy to achieve this goal is to poison a small portion of the training data with the Trojan’s trigger,” they wrote. “At each training stage, the attacker randomly replaces each clean training pair in the current mini-lot with a poisoned pair with a probability and trains [the classifier] as usual using the modified mini-batch.”

However, Trojan horse attacks continue to evolve and become more complex, with different triggers for different input images rather than using a single overall image. This is where many current defenses against Trojans fail, they argued.

These defenses operate under the assumption that Trojans either use only one input-independent trigger or target only one class. Using these assumptions, defense methods can detect the trigger for some of the simplest Trojan horse attacks and mitigate them.

“However, these defenses often do not work well against other advanced attacks that use multiple entry-specific Trojan triggers and/or target multiple classes,” the researchers wrote. “In fact, Trojan triggers and attack targets can come in arbitrary numbers and shapes only limited by the creativity of attackers. It is therefore unrealistic to make assumptions about Trojan triggers and attack targets. attacks.”

Take a dual approach

In their paper, they propose two new defenses – variable input filtering (VIF) and adversarial input filtering (AIF) – that make no such assumptions. Both methods are designed to learn a filter that can detect all Trojan filters in a model’s input at runtime. They applied the methods to images and their classifications.

VIF treats filters as a variational autoencoder, which is a deep learning technique that in this case removes all noisy information in the input, including triggers, they wrote. In contrast, AIF uses an auxiliary generator to detect and reveal hidden triggers and uses adversarial training – a machine learning technique – for the generator and the filter to ensure that the filter removes all potential triggers.

To protect against the possibility that filtering could interfere with the prediction of the AI ​​model using clean data, the researchers also used a new defense mechanism called “filter then contrast.” This compares “both outputs of the model with and without input filtering to determine if the input is clean or not. If the input is marked as clean, the output without input filtering will be used as the final prediction”, have they wrote.

If not lined clean, further investigation of the entry is required. In the paper, the researchers argued that their experiments “demonstrated that our proposed defenses significantly outperform well-known defenses in mitigating various Trojan horse attacks.”

They added that they intend to extend these defenses to other domains, such as text and graphics, and to tasks such as object detection and visual reasoning, which they say are more challenging than the image domain and image classification task used in their experiment. ®

Back To Top