2025

ErrorTrace: A Black-Box Traceability Mechanism Based on Model Family Error Space
ErrorTrace: A Black-Box Traceability Mechanism Based on Model Family Error Space

Chuanchao Zang, Xiangtao Meng, Wenyu Chen, Tianshuo Cong, Yaxing Zha, Dong Qi, Zheng Li, Shanqing Guo

NeurIPS 2025

The open-source release of large language models (LLMs) enables malicious users to create unauthorized derivative models at low cost, posing significant threats to intellectual property (IP) and market stability. Existing IP protection methods either require access to model parameters or are vulnerable to fine-tuning attacks. To fill this gap, we propose ErrorTrace, a robust and black-box traceability mechanism for protecting LLM IP. Specifically, ErrorTrace leverages the unique error patterns of model families by mapping and analyzing their distinct error spaces, enabling robust and efficient IP protection without relying on internal parameters or specific query responses. Experimental results show that ErrorTrace achieves a traceability accuracy of 0.8518 for 27 base models when the suspect model is not included in ErrorTrace's training set, outperforming the baseline by 0.2593. Additionally,ErrorTrace successfully tracks 34 fine-tuned, pruned and merged models across various scenarios, demonstrating its broad applicability and robustness. In addition, ErrorTrace shows a certain level of resilience when subjected to adversarial attacks. Our code is available at: https://github.com/csdatazcc/ErrorTrace.

ErrorTrace: A Black-Box Traceability Mechanism Based on Model Family Error Space

Chuanchao Zang, Xiangtao Meng, Wenyu Chen, Tianshuo Cong, Yaxing Zha, Dong Qi, Zheng Li, Shanqing Guo

NeurIPS 2025

The open-source release of large language models (LLMs) enables malicious users to create unauthorized derivative models at low cost, posing significant threats to intellectual property (IP) and market stability. Existing IP protection methods either require access to model parameters or are vulnerable to fine-tuning attacks. To fill this gap, we propose ErrorTrace, a robust and black-box traceability mechanism for protecting LLM IP. Specifically, ErrorTrace leverages the unique error patterns of model families by mapping and analyzing their distinct error spaces, enabling robust and efficient IP protection without relying on internal parameters or specific query responses. Experimental results show that ErrorTrace achieves a traceability accuracy of 0.8518 for 27 base models when the suspect model is not included in ErrorTrace's training set, outperforming the baseline by 0.2593. Additionally,ErrorTrace successfully tracks 34 fine-tuned, pruned and merged models across various scenarios, demonstrating its broad applicability and robustness. In addition, ErrorTrace shows a certain level of resilience when subjected to adversarial attacks. Our code is available at: https://github.com/csdatazcc/ErrorTrace.

DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation
DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation

Xinyu Gao, Xiangtao Meng, Yingkai Dong, Zheng Li, Shanqing Guo

CCS 2025

While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG's external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. ...

DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation

Xinyu Gao, Xiangtao Meng, Yingkai Dong, Zheng Li, Shanqing Guo

CCS 2025

While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG's external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. ...

Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models
Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models

Yingkai Dong, Xiangtao Meng, Ning Yu, Zheng Li, Shanqing Guo

IEEE S&P 2025

Text-to-image (T2I) generative models have revolutionized content creation by transforming textual descriptions into high-quality images. However, these models are vulnerable to jailbreaking attacks, where carefully crafted prompts bypass safety mechanisms to produce unsafe content. While researchers have developed various jailbreak attacks to expose this risk, these methods face significant limitations, including impractical access requirements, easily detectable unnatural prompts, restricted search spaces, and high query demands on the target system. ...

Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models

Yingkai Dong, Xiangtao Meng, Ning Yu, Zheng Li, Shanqing Guo

IEEE S&P 2025

Text-to-image (T2I) generative models have revolutionized content creation by transforming textual descriptions into high-quality images. However, these models are vulnerable to jailbreaking attacks, where carefully crafted prompts bypass safety mechanisms to produce unsafe content. While researchers have developed various jailbreak attacks to expose this risk, these methods face significant limitations, including impractical access requirements, easily detectable unnatural prompts, restricted search spaces, and high query demands on the target system. ...

Enhanced Label-Only Membership Inference Attacks with Fewer Queries
Enhanced Label-Only Membership Inference Attacks with Fewer Queries

Hao Li*, Zheng Li*, Siyuan Wu, Yutong Ye, Min Zhang, Dengguo Feng

USENIX Security 2025

Machine Learning (ML) models are vulnerable to membership inference attacks (MIAs), where an adversary aims to determine whether a specific sample was part of the model’s training data. Traditional MIAs exploit differences in the model’s output posteriors, but in more challenging scenarios (label-only scenarios) where only predicted labels are available, existing works directly utilize the shortest distance of samples reaching decision boundaries as membership signals, denoted as the shortestBD. However, they face two key challenges: low distinguishability between members and non-members due to sample diversity,and high query requirements stemming from direction diversity. ...

Enhanced Label-Only Membership Inference Attacks with Fewer Queries

Hao Li*, Zheng Li*, Siyuan Wu, Yutong Ye, Min Zhang, Dengguo Feng

USENIX Security 2025

Machine Learning (ML) models are vulnerable to membership inference attacks (MIAs), where an adversary aims to determine whether a specific sample was part of the model’s training data. Traditional MIAs exploit differences in the model’s output posteriors, but in more challenging scenarios (label-only scenarios) where only predicted labels are available, existing works directly utilize the shortest distance of samples reaching decision boundaries as membership signals, denoted as the shortestBD. However, they face two key challenges: low distinguishability between members and non-members due to sample diversity,and high query requirements stemming from direction diversity. ...

Membership Inference Attacks Against Vision-Language Models
Membership Inference Attacks Against Vision-Language Models

Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, Chun Chen

USENIX Security 2025

Vision-Language Models (VLMs), built on pre-trained vision encoders and large language models (LLMs), have shown exceptional multi-modal understanding and dialog capabilities, positioning them as catalysts for the next technological revolution. However, while most VLM research focuses on enhancing multi-modal interaction, the risks of data misuse and leakage have been largely unexplored. This prompts the need for a comprehensive investigation of such risks in VLMs. ...

Membership Inference Attacks Against Vision-Language Models

Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, Chun Chen

USENIX Security 2025

Vision-Language Models (VLMs), built on pre-trained vision encoders and large language models (LLMs), have shown exceptional multi-modal understanding and dialog capabilities, positioning them as catalysts for the next technological revolution. However, while most VLM research focuses on enhancing multi-modal interaction, the risks of data misuse and leakage have been largely unexplored. This prompts the need for a comprehensive investigation of such risks in VLMs. ...

Safe Driving Adversarial Trajectory Can Mislead: Toward More Stealthy Adversarial Attack Against Autonomous Driving Prediction Module
Safe Driving Adversarial Trajectory Can Mislead: Toward More Stealthy Adversarial Attack Against Autonomous Driving Prediction Module

Yingkai Dong, Li Wang, Zheng Li, Hao Li, Peng Tang, Chengyu Hu, Shanqing Guo

ACM Transactions on Privacy and Security 2025

The prediction module, powered by deep learning models, constitutes a fundamental component of high-level Autonomous Vehicles (AVs). Given the direct influence of the module’s prediction accuracy on AV driving behavior, ensuring its security is paramount. However, limited studies have explored the adversarial robustness of the prediction modules. Furthermore, existing methods still generate adversarial trajectories that deviate significantly from human driving behavior. These deviations can be easily identified as hazardous by AVs’ anomaly detection models and thus cannot effectively evaluate and reflect the robustness of the prediction modules. To bridge this gap, we propose a stealthy and more effective optimization-based attack method. Specifically, we reformulate the optimization problem using Lagrangian relaxation and design a Frenet-based objective function along with a distinct constraint space. We conduct extensive evaluations on 2 popular prediction models and 2 benchmark datasets. Our results show that our attack is highly effective, with over 87% attack success rates, outperforming all baseline attacks. Moreover, our attack method significantly improves the stealthiness of adversarial trajectories while guaranteeing adherence to physical constraints. Our attack is also found robust to noise from upstream modules, transferable across trajectory prediction models, and high realizability. Lastly, to verify its effectiveness in real-world applications, we conduct further simulation evaluations using a production-grade simulator. These simulations reveal that the adversarial trajectory we created could convincingly induce autonomous vehicles (AVs) to initiate hard braking.

Safe Driving Adversarial Trajectory Can Mislead: Toward More Stealthy Adversarial Attack Against Autonomous Driving Prediction Module

Yingkai Dong, Li Wang, Zheng Li, Hao Li, Peng Tang, Chengyu Hu, Shanqing Guo

ACM Transactions on Privacy and Security 2025

The prediction module, powered by deep learning models, constitutes a fundamental component of high-level Autonomous Vehicles (AVs). Given the direct influence of the module’s prediction accuracy on AV driving behavior, ensuring its security is paramount. However, limited studies have explored the adversarial robustness of the prediction modules. Furthermore, existing methods still generate adversarial trajectories that deviate significantly from human driving behavior. These deviations can be easily identified as hazardous by AVs’ anomaly detection models and thus cannot effectively evaluate and reflect the robustness of the prediction modules. To bridge this gap, we propose a stealthy and more effective optimization-based attack method. Specifically, we reformulate the optimization problem using Lagrangian relaxation and design a Frenet-based objective function along with a distinct constraint space. We conduct extensive evaluations on 2 popular prediction models and 2 benchmark datasets. Our results show that our attack is highly effective, with over 87% attack success rates, outperforming all baseline attacks. Moreover, our attack method significantly improves the stealthiness of adversarial trajectories while guaranteeing adherence to physical constraints. Our attack is also found robust to noise from upstream modules, transferable across trajectory prediction models, and high realizability. Lastly, to verify its effectiveness in real-world applications, we conduct further simulation evaluations using a production-grade simulator. These simulations reveal that the adversarial trajectory we created could convincingly induce autonomous vehicles (AVs) to initiate hard braking.

FDINet: Protecting Against DNN Model Extraction Using Feature Distortion Index
FDINet: Protecting Against DNN Model Extraction Using Feature Distortion Index

Hongwei Yao, Zheng Li, Haiqin Weng, Feng Xue, Zhan Qin, Kui Ren

IEEE Transactions on Dependable and Secure Computing 2025

Machine Learning as a Service (MLaaS) platforms have gained popularity due to their accessibility, cost-efficiency, scalability, and rapid development capabilities. However, recent research has highlighted the vulnerability of cloud-based models in MLaaS to model extraction attacks. In this paper, we introduce FDINet, a novel defense mechanism that leverages the feature distribution of deep neural network (DNN) models. Concretely, by analyzing the feature distribution from the adversary’s queries, we reveal that the feature distribution of these queries deviates from that of the model’s problem domain. Based on this key observation, we propose Feature Distortion Index (FDI), a metric designed to quantitatively measure the feature distribution deviation of received queries. The proposed FDINet utilizes FDI to train a binary detector and exploits FDI similarity to identify colluding adversaries from distributed extraction attacks. We conduct extensive experiments to evaluate FDINet against six state-of-the-art extraction attacks on four benchmark datasets and four popular model architectures. Empirical results demonstrate the following findings: 1) FDINet proves to be highly effective in detecting model extraction, achieving a 100% detection accuracy on DFME and DaST. 2) FDINet is highly efficient, using just 50 queries to raise an extraction alarm with an average confidence of 96.08% for GTSRB. 3) FDINet exhibits the capability to identify colluding adversaries with an accuracy exceeding 91%. Additionally, it demonstrates the ability to detect two types of adaptive attacks.

FDINet: Protecting Against DNN Model Extraction Using Feature Distortion Index

Hongwei Yao, Zheng Li, Haiqin Weng, Feng Xue, Zhan Qin, Kui Ren

IEEE Transactions on Dependable and Secure Computing 2025

Machine Learning as a Service (MLaaS) platforms have gained popularity due to their accessibility, cost-efficiency, scalability, and rapid development capabilities. However, recent research has highlighted the vulnerability of cloud-based models in MLaaS to model extraction attacks. In this paper, we introduce FDINet, a novel defense mechanism that leverages the feature distribution of deep neural network (DNN) models. Concretely, by analyzing the feature distribution from the adversary’s queries, we reveal that the feature distribution of these queries deviates from that of the model’s problem domain. Based on this key observation, we propose Feature Distortion Index (FDI), a metric designed to quantitatively measure the feature distribution deviation of received queries. The proposed FDINet utilizes FDI to train a binary detector and exploits FDI similarity to identify colluding adversaries from distributed extraction attacks. We conduct extensive experiments to evaluate FDINet against six state-of-the-art extraction attacks on four benchmark datasets and four popular model architectures. Empirical results demonstrate the following findings: 1) FDINet proves to be highly effective in detecting model extraction, achieving a 100% detection accuracy on DFME and DaST. 2) FDINet is highly efficient, using just 50 queries to raise an extraction alarm with an average confidence of 96.08% for GTSRB. 3) FDINet exhibits the capability to identify colluding adversaries with an accuracy exceeding 91%. Additionally, it demonstrates the ability to detect two types of adaptive attacks.

A Comprehensive Study of Privacy Risks in Curriculum Learning
A Comprehensive Study of Privacy Risks in Curriculum Learning

Joann Qiongna Chen, Xinlei He, Zheng Li, Yang Zhang, Zhou Li

PETS 2025

Training a machine learning model with data following a meaningful order, i.e., from easy to hard, has been proven to be effective in accelerating the training process and achieving better model performance. The key enabling technique is curriculum learning(CL), which has seen great success and has been deployed in areas like image and text classiffcation. Yet, how CL affects the privacy of machine learning is unclear. Given that CL changes the way a model memorizes the training data, its inffuence on data privacy needs to be thoroughly evaluated. To ffll this knowledge gap, we perform the first study and leverage membership inference attack (MIA) and attribute inference attack (AIA) as two vectors to quantify the privacy leakage caused by CL. ...

A Comprehensive Study of Privacy Risks in Curriculum Learning

Joann Qiongna Chen, Xinlei He, Zheng Li, Yang Zhang, Zhou Li

PETS 2025

Training a machine learning model with data following a meaningful order, i.e., from easy to hard, has been proven to be effective in accelerating the training process and achieving better model performance. The key enabling technique is curriculum learning(CL), which has seen great success and has been deployed in areas like image and text classiffcation. Yet, how CL affects the privacy of machine learning is unclear. Given that CL changes the way a model memorizes the training data, its inffuence on data privacy needs to be thoroughly evaluated. To ffll this knowledge gap, we perform the first study and leverage membership inference attack (MIA) and attribute inference attack (AIA) as two vectors to quantify the privacy leakage caused by CL. ...

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models
Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Xiangtao Meng, Yingkai Dong, Ning Yu, Zheng Li, Shanqing Guo

arxiv 2025

Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Xiangtao Meng, Yingkai Dong, Ning Yu, Zheng Li, Shanqing Guo

arxiv 2025

Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment
PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment

Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo

arxiv 2025

The rapid advancement of generative models has led to the proliferation of highly realistic AI-generated images, posing significant challenges for detection methods to generalize across diverse and evolving generative techniques. Existing approaches often fail to adapt to unknown models without costly retraining, limiting their practicability. To fill this gap, we propose Post-hoc Distribution Alignment (PDA), a novel approach for the generalizable detection for AI-generated images. The key idea is to use the known generative model to regenerate undifferentiated test images. ...

PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment

Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo

arxiv 2025

The rapid advancement of generative models has led to the proliferation of highly realistic AI-generated images, posing significant challenges for detection methods to generalize across diverse and evolving generative techniques. Existing approaches often fail to adapt to unknown models without costly retraining, limiting their practicability. To fill this gap, we propose Post-hoc Distribution Alignment (PDA), a novel approach for the generalizable detection for AI-generated images. The key idea is to use the known generative model to regenerate undifferentiated test images. ...

Faceswapguard: Safeguarding facial privacy from deepfake threats through identity obfuscation
Faceswapguard: Safeguarding facial privacy from deepfake threats through identity obfuscation

Li Wang, Zheng Li, Xuhong Zhang, Shouling Ji, Shanqing Guo

arxiv 2025

DeepFakes pose a significant threat to our society. One representative DeepFake application is face-swapping, which replaces the identity in a facial image with that of a victim. Although existing methods partially mitigate these risks by degrading the quality of swapped images, they often fail to disrupt the identity transformation effectively. To fill this gap, we propose FaceSwapGuard (FSG), a novel black-box defense mechanism against deepfake face-swapping threats. Specifically, FSG introduces imperceptible perturbations to a user's facial image, disrupting the features extracted by identity encoders. When shared online, these perturbed images mislead face-swapping techniques, causing them to generate facial images with identities significantly different from the original user. Extensive experiments demonstrate the effectiveness of FSG against multiple face-swapping techniques, reducing the face match rate from 90\% (without defense) to below 10\%. Both qualitative and quantitative studies further confirm its ability to confuse human perception, highlighting its practical utility. Additionally, we investigate key factors that may influence FSG and evaluate its robustness against various adaptive adversaries.

Faceswapguard: Safeguarding facial privacy from deepfake threats through identity obfuscation

Li Wang, Zheng Li, Xuhong Zhang, Shouling Ji, Shanqing Guo

arxiv 2025

DeepFakes pose a significant threat to our society. One representative DeepFake application is face-swapping, which replaces the identity in a facial image with that of a victim. Although existing methods partially mitigate these risks by degrading the quality of swapped images, they often fail to disrupt the identity transformation effectively. To fill this gap, we propose FaceSwapGuard (FSG), a novel black-box defense mechanism against deepfake face-swapping threats. Specifically, FSG introduces imperceptible perturbations to a user's facial image, disrupting the features extracted by identity encoders. When shared online, these perturbed images mislead face-swapping techniques, causing them to generate facial images with identities significantly different from the original user. Extensive experiments demonstrate the effectiveness of FSG against multiple face-swapping techniques, reducing the face match rate from 90\% (without defense) to below 10\%. Both qualitative and quantitative studies further confirm its ability to confuse human perception, highlighting its practical utility. Additionally, we investigate key factors that may influence FSG and evaluate its robustness against various adaptive adversaries.

2024

PRJack: Pruning-Resistant Model Hijacking Attack Against Deep Learning Models
PRJack: Pruning-Resistant Model Hijacking Attack Against Deep Learning Models

Ge Han, Zheng Li, Shanqing Guo

International Joint Conference on Neural Networks (IJCNN) 2024

Deep learning models, pivotal in AI applications, are susceptible to model hijacking attacks. In model hijacking attacks, adversaries can misuse models for unintended tasks, shifting blame and maintenance costs onto the models’ deployers. Existing attack methods re-purpose target models by poisoning their training sets during training. However, leading models like GPT-4 and BERT with vast parameters are often pruned before deployment on resource-limited devices, which presents challenges for in-training attacks, including existing model hi- jacking attacks. In this paper, we propose PRJack, the first pruning-resistant hijacking attack. Specifically, the adversary re-purposes a model to perform a hijacking task different from the original task, which can still be activated even after model pruning. Our experiments across multiple datasets and pruning techniques highlight PRJack’s remarkable superiority on pruned models over existing model hijacking attacks.

PRJack: Pruning-Resistant Model Hijacking Attack Against Deep Learning Models

Ge Han, Zheng Li, Shanqing Guo

International Joint Conference on Neural Networks (IJCNN) 2024

Deep learning models, pivotal in AI applications, are susceptible to model hijacking attacks. In model hijacking attacks, adversaries can misuse models for unintended tasks, shifting blame and maintenance costs onto the models’ deployers. Existing attack methods re-purpose target models by poisoning their training sets during training. However, leading models like GPT-4 and BERT with vast parameters are often pruned before deployment on resource-limited devices, which presents challenges for in-training attacks, including existing model hi- jacking attacks. In this paper, we propose PRJack, the first pruning-resistant hijacking attack. Specifically, the adversary re-purposes a model to perform a hijacking task different from the original task, which can still be activated even after model pruning. Our experiments across multiple datasets and pruning techniques highlight PRJack’s remarkable superiority on pruned models over existing model hijacking attacks.

ModScan: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities
ModScan: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities

Yukun Jiang, Zheng Li, Xinyue Shen, Yugeng Liu, Michael Backes, Yang Zhang

EMNLP 2024

Large vision-language models (LVLMs) have been rapidly developed and widely used in various ffelds, but the (potential) stereotypical bias in the model is largely unexplored. In this study, we present a pioneering measurement framework, ModSCAN, to SCAN the stereotypical bias within LVLMs from both vision and language Modalities. ModSCAN examines stereotypical biases with respect to two typical stereotypical attributes (gender and race) across three kinds of scenarios: occupations, descriptors, and persona traits. ...

ModScan: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities

Yukun Jiang, Zheng Li, Xinyue Shen, Yugeng Liu, Michael Backes, Yang Zhang

EMNLP 2024

Large vision-language models (LVLMs) have been rapidly developed and widely used in various ffelds, but the (potential) stereotypical bias in the model is largely unexplored. In this study, we present a pioneering measurement framework, ModSCAN, to SCAN the stereotypical bias within LVLMs from both vision and language Modalities. ModSCAN examines stereotypical biases with respect to two typical stereotypical attributes (gender and race) across three kinds of scenarios: occupations, descriptors, and persona traits. ...

Membership Inference Attacks Against In-Context Learning
Membership Inference Attacks Against In-Context Learning

Rui Wen, Zheng Li, Michael Backes, Yang Zhang

CCS 2024

Adapting Large Language Models (LLMs) to specific tasks introduces concerns about computational efficiency, prompting an exploration of efficient methods such as In-Context Learning (ICL). However, the vulnerability of ICL to privacy attacks under realistic assumptions remains largely unexplored. In this work, we present the first membership inference attack tailored for ICL, relying solely on generated texts without their associated probabilities. We propose four attack strategies tailored to various constrained scenarios and conduct extensive experiments on four popular large language models. ...

Membership Inference Attacks Against In-Context Learning

Rui Wen, Zheng Li, Michael Backes, Yang Zhang

CCS 2024

Adapting Large Language Models (LLMs) to specific tasks introduces concerns about computational efficiency, prompting an exploration of efficient methods such as In-Context Learning (ICL). However, the vulnerability of ICL to privacy attacks under realistic assumptions remains largely unexplored. In this work, we present the first membership inference attack tailored for ICL, relying solely on generated texts without their associated probabilities. We propose four attack strategies tailored to various constrained scenarios and conduct extensive experiments on four popular large language models. ...

SeqMIA: Sequential-Metric Based Membership Inference Attack
SeqMIA: Sequential-Metric Based Membership Inference Attack

Hao Li*, Zheng Li*, Siyuan Wu, Chengrui Hu, Yutong Ye, Min Zhang, Dnegguo Feng, Yang Zhang

CCS 2024

Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent patterns. Consequently, they struggle to effectively distinguish between members and non-members who exhibit similar metric values, particularly resulting in a high false-positive rate. In this study, we delve deeper into the new membership signals in the black-box scenario. We identify a new, more integrated membership signal: the Pattern of Metric Sequence, derived from the various stages of model training. ...

SeqMIA: Sequential-Metric Based Membership Inference Attack

Hao Li*, Zheng Li*, Siyuan Wu, Chengrui Hu, Yutong Ye, Min Zhang, Dnegguo Feng, Yang Zhang

CCS 2024

Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent patterns. Consequently, they struggle to effectively distinguish between members and non-members who exhibit similar metric values, particularly resulting in a high false-positive rate. In this study, we delve deeper into the new membership signals in the black-box scenario. We identify a new, more integrated membership signal: the Pattern of Metric Sequence, derived from the various stages of model training. ...

BadMerging: Backdoor Attacks Against Model Merging
BadMerging: Backdoor Attacks Against Model Merging

Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, Yuan Tian

CCS 2024

Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. ...

BadMerging: Backdoor Attacks Against Model Merging

Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, Yuan Tian

CCS 2024

Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. ...

SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models
SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models

Boyang Zhang, Zheng Li, Ziqing Yang, Xinlei He, Michael Backes, Mario Fritz, Yang Zhang

USENIX Security 2024

While advanced machine learning (ML) models are deployed in numerous real-world applications, previous works demonstrate these models have security and privacy vulnerabilities. Various empirical research has been done in this field. However, most of the experiments are performed on target ML models trained by the security researchers themselves. Due to the high computational resource requirement for training advanced models with complex architectures, researchers generally choose to train a few target models using relatively simple architectures on typical experiment datasets. We argue that to understand ML models' vulnerabilities comprehensively, experiments should be performed on a large set of models trained with various purposes (not just the purpose of evaluating ML attacks and defenses). ...

SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models

Boyang Zhang, Zheng Li, Ziqing Yang, Xinlei He, Michael Backes, Mario Fritz, Yang Zhang

USENIX Security 2024

While advanced machine learning (ML) models are deployed in numerous real-world applications, previous works demonstrate these models have security and privacy vulnerabilities. Various empirical research has been done in this field. However, most of the experiments are performed on target ML models trained by the security researchers themselves. Due to the high computational resource requirement for training advanced models with complex architectures, researchers generally choose to train a few target models using relatively simple architectures on typical experiment datasets. We argue that to understand ML models' vulnerabilities comprehensively, experiments should be performed on a large set of models trained with various purposes (not just the purpose of evaluating ML attacks and defenses). ...

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders
Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang

ECAI 2024

Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, ...

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang

ECAI 2024

Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, ...

Detection and Attribution of Models Trained on Generated Data
Detection and Attribution of Models Trained on Generated Data

Ge Han, Ahmed Salem, Zheng Li, Shanqing Guo, Michael Backes, Yang Zhang

ICASSP 2024

Generative Adversarial Networks (GANs) have become widely used in model training, as they can improve performance and/or protect sensitive information by generating data. However, this also raises potential risks, as malicious GANs may compromise or sabotage models by poisoning their training data. Therefore, it is important to verify the origin of a model’s training data for accountability purposes. In this work, we take the first step in the forensic analysis of models trained on GAN-generated data. Specifically, we first detect whether a model is trained on GAN-generated or real data. ...

Detection and Attribution of Models Trained on Generated Data

Ge Han, Ahmed Salem, Zheng Li, Shanqing Guo, Michael Backes, Yang Zhang

ICASSP 2024

Generative Adversarial Networks (GANs) have become widely used in model training, as they can improve performance and/or protect sensitive information by generating data. However, this also raises potential risks, as malicious GANs may compromise or sabotage models by poisoning their training data. Therefore, it is important to verify the origin of a model’s training data for accountability purposes. In this work, we take the first step in the forensic analysis of models trained on GAN-generated data. Specifically, we first detect whether a model is trained on GAN-generated or real data. ...

Model Hijacking Attack in Federated Learning
Model Hijacking Attack in Federated Learning

Zheng Li, Siyuan Wu, Ruichuan Chen, Paarijaat Aditya, Istemi Ekin Akkus, Manohar Vanga, Min Zhang, Hao Li, Yang Zhang

arxiv 2024

Machine learning (ML), driven by prominent paradigms such as centralized and federated learning, has made significant progress in various critical applications ranging from autonomous driving to face recognition. However, its remarkable success has been accompanied by various attacks. Recently, the model hijacking attack has shown that ML models can be hijacked to execute tasks different from their original tasks, which increases both accountability and parasitic computational risks. Nevertheless, thus far, this attack has only focused on centralized learning. In this work, we broaden the scope of this attack to the federated learning domain, where multiple clients collaboratively train a global model without sharing their data. Specifically, we present HijackFL, ...

Model Hijacking Attack in Federated Learning

Zheng Li, Siyuan Wu, Ruichuan Chen, Paarijaat Aditya, Istemi Ekin Akkus, Manohar Vanga, Min Zhang, Hao Li, Yang Zhang

arxiv 2024

Machine learning (ML), driven by prominent paradigms such as centralized and federated learning, has made significant progress in various critical applications ranging from autonomous driving to face recognition. However, its remarkable success has been accompanied by various attacks. Recently, the model hijacking attack has shown that ML models can be hijacked to execute tasks different from their original tasks, which increases both accountability and parasitic computational risks. Nevertheless, thus far, this attack has only focused on centralized learning. In this work, we broaden the scope of this attack to the federated learning domain, where multiple clients collaboratively train a global model without sharing their data. Specifically, we present HijackFL, ...

Membership Inference Attack Against Masked Image Modeling
Membership Inference Attack Against Masked Image Modeling

Zheng Li, Xinlei He, Ning Yu, Yang Zhang

arxiv 2024

Masked Image Modeling (MIM) has achieved significant success in the realm of self-supervised learning (SSL) for visual recognition. The image encoder pre-trained through MIM, involving the masking and subsequent reconstruction of input images, attains state-of-the-art performance in various downstream vision tasks. However, most existing works focus on improving the performance of this http URL this work, we take a different angle by studying the pre-training data privacy of MIM. Specifically, we propose the first membership inference attack against image encoders pre-trained by MIM, ...

Membership Inference Attack Against Masked Image Modeling

Zheng Li, Xinlei He, Ning Yu, Yang Zhang

arxiv 2024

Masked Image Modeling (MIM) has achieved significant success in the realm of self-supervised learning (SSL) for visual recognition. The image encoder pre-trained through MIM, involving the masking and subsequent reconstruction of input images, attains state-of-the-art performance in various downstream vision tasks. However, most existing works focus on improving the performance of this http URL this work, we take a different angle by studying the pre-training data privacy of MIM. Specifically, we propose the first membership inference attack against image encoders pre-trained by MIM, ...

2023

On the privacy risks of machine learning models
On the privacy risks of machine learning models

Zheng Li

ERCIM WG STM Best Ph.D. Thesis Award 2024

Machine learning (ML) has made huge progress in the last decade and has been applied to a wide range of critical applications. However, driven by the increasing adoption of machine learning models, the significance of privacy risks has become more crucial than ever. These risks can be classified into two categories depending on the role played by ML models: one in which the models themselves are vulnerable to leaking sensitive information, and the other in which the models are abused to violate privacy. In this dissertation, we investigate the privacy risks of machine learning models from two perspectives, i.e., the vulnerability of ML models and the abuse of ML models. To study the vulnerability of ML models to privacy risks, we conduct two studies on one of the most severe privacy attacks against ML models, namely the membership inference attack (MIA). Firstly, we explore membership leakage in label-only exposure of ML models. We present the first label-only membership inference attack and reveal that membership leakage is more severe than previously shown. Secondly, we perform the first privacy analysis of multi-exit networks through the lens of membership leakage. We leverage existing attack methodologies to quantify the vulnerability of multi-exit networks to membership inference attacks and propose a hybrid attack that exploits the exit information to improve the attack performance. From the perspective of abusing ML models to violate privacy, we focus on deepfake face manipulation that can create visual misinformation. We propose the first defense system \system against GAN-based face manipulation by jeopardizing the process of GAN inversion, which is an essential step for subsequent face manipulation. All findings contribute to the community's insight into the privacy risks of machine learning models. We appeal to the community's consideration of the in-depth investigation of privacy risks, like ours, against the rapidly-evolving machine learning techniques.

On the privacy risks of machine learning models

Zheng Li

ERCIM WG STM Best Ph.D. Thesis Award 2024

Machine learning (ML) has made huge progress in the last decade and has been applied to a wide range of critical applications. However, driven by the increasing adoption of machine learning models, the significance of privacy risks has become more crucial than ever. These risks can be classified into two categories depending on the role played by ML models: one in which the models themselves are vulnerable to leaking sensitive information, and the other in which the models are abused to violate privacy. In this dissertation, we investigate the privacy risks of machine learning models from two perspectives, i.e., the vulnerability of ML models and the abuse of ML models. To study the vulnerability of ML models to privacy risks, we conduct two studies on one of the most severe privacy attacks against ML models, namely the membership inference attack (MIA). Firstly, we explore membership leakage in label-only exposure of ML models. We present the first label-only membership inference attack and reveal that membership leakage is more severe than previously shown. Secondly, we perform the first privacy analysis of multi-exit networks through the lens of membership leakage. We leverage existing attack methodologies to quantify the vulnerability of multi-exit networks to membership inference attacks and propose a hybrid attack that exploits the exit information to improve the attack performance. From the perspective of abusing ML models to violate privacy, we focus on deepfake face manipulation that can create visual misinformation. We propose the first defense system \system against GAN-based face manipulation by jeopardizing the process of GAN inversion, which is an essential step for subsequent face manipulation. All findings contribute to the community's insight into the privacy risks of machine learning models. We appeal to the community's consideration of the in-depth investigation of privacy risks, like ours, against the rapidly-evolving machine learning techniques.

UnGANable: Defending Against GAN-based Face Manipulation
UnGANable: Defending Against GAN-based Face Manipulation

Zheng Li, Ning Yu, Ahmed Salem, Michael Backes, Mario Fritz, Yang Zhang

USENIX Security 2023

Deepfakes pose severe threats of visual misinformation to our society. One representative deepfake application is face manipulation that modifies a victim's facial attributes in an image, e.g., changing her age or hair color. The state-of-the-art face manipulation techniques rely on Generative Adversarial Networks (GANs). In this paper, we propose the first defense system, namely UnGANable, against GAN-inversion-based face manipulation. In specific, UnGANable focuses on defending GAN inversion, an essential step for face manipulation. Its core technique is to search for alternative images (called cloaked images) around the original images (called target images) in image space. When posted online, these cloaked images can jeopardize the GAN inversion process. We consider two state-of-the-art inversion techniques including optimization-based inversion and hybrid inversion, and design five different defenses under five scenarios depending on the defender's background knowledge. Extensive experiments on four popular GAN models trained on two benchmark face datasets show that UnGANable achieves remarkable effectiveness and utility performance, and outperforms multiple baseline methods. We further investigate four adaptive adversaries to bypass UnGANable and show that some of them are slightly effective.

UnGANable: Defending Against GAN-based Face Manipulation

Zheng Li, Ning Yu, Ahmed Salem, Michael Backes, Mario Fritz, Yang Zhang

USENIX Security 2023

Deepfakes pose severe threats of visual misinformation to our society. One representative deepfake application is face manipulation that modifies a victim's facial attributes in an image, e.g., changing her age or hair color. The state-of-the-art face manipulation techniques rely on Generative Adversarial Networks (GANs). In this paper, we propose the first defense system, namely UnGANable, against GAN-inversion-based face manipulation. In specific, UnGANable focuses on defending GAN inversion, an essential step for face manipulation. Its core technique is to search for alternative images (called cloaked images) around the original images (called target images) in image space. When posted online, these cloaked images can jeopardize the GAN inversion process. We consider two state-of-the-art inversion techniques including optimization-based inversion and hybrid inversion, and design five different defenses under five scenarios depending on the defender's background knowledge. Extensive experiments on four popular GAN models trained on two benchmark face datasets show that UnGANable achieves remarkable effectiveness and utility performance, and outperforms multiple baseline methods. We further investigate four adaptive adversaries to bypass UnGANable and show that some of them are slightly effective.

DE-FAKE Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models
DE-FAKE Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Zeyang Sha, Zheng Li, Ning Yu, Yang Zhang

CCS 2023

Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALLE 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person'' topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.

DE-FAKE Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Zeyang Sha, Zheng Li, Ning Yu, Yang Zhang

CCS 2023

Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALLE 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person'' topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.

Backdoor Attacks Against Dataset Distillation
Backdoor Attacks Against Dataset Distillation

Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, Yang Zhang

NDSS 2023

Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.

Backdoor Attacks Against Dataset Distillation

Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, Yang Zhang

NDSS 2023

Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.

NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models
NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models

Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, Shiqing Ma

ACL 2023

Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor attacks against prompt-based models consider injecting backdoors into the entire embedding layers or word embedding vectors. Such attacks can be easily affected by retraining on downstream tasks and with different prompting strategies, limiting the transferability of backdoor attacks. In this work, we propose transferable backdoor attacks against prompt-based models, called NOTABLE, which is independent of downstream tasks and prompting strategies. Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It activates the backdoor by pasting input with triggers to reach adversary-desired anchors, achieving independence from downstream tasks and prompting strategies. We conduct experiments on six NLP tasks, three popular models, and three prompting strategies. Empirical results show that NOTABLE achieves superior attack performance (i.e., attack success rate over 90% on all the datasets), and outperforms two state-of-the-art baselines. Evaluations on three defenses show the robustness of NOTABLE. Our code can be found at https://github.com/RU-System-Software-and-Security/Notable.

NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models

Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, Shiqing Ma

ACL 2023

Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor attacks against prompt-based models consider injecting backdoors into the entire embedding layers or word embedding vectors. Such attacks can be easily affected by retraining on downstream tasks and with different prompting strategies, limiting the transferability of backdoor attacks. In this work, we propose transferable backdoor attacks against prompt-based models, called NOTABLE, which is independent of downstream tasks and prompting strategies. Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It activates the backdoor by pasting input with triggers to reach adversary-desired anchors, achieving independence from downstream tasks and prompting strategies. We conduct experiments on six NLP tasks, three popular models, and three prompting strategies. Empirical results show that NOTABLE achieves superior attack performance (i.e., attack success rate over 90% on all the datasets), and outperforms two state-of-the-art baselines. Evaluations on three defenses show the robustness of NOTABLE. Our code can be found at https://github.com/RU-System-Software-and-Security/Notable.

Data Poisoning Attacks Against Multimodal Encoders
Data Poisoning Attacks Against Multimodal Encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, Yang Zhang

ICML 2023

Recently, the newly emerged multimodal models, which leverage both visual and linguistic modalities to train powerful encoders, have gained increasing attention. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model’s training data to trigger malicious behaviors in it. In contrast to previous work, only poisoning visual modality, in this work, we take the first step to studying poisoning attacks against multimodal models in both visual and linguistic modalities. Specially, we focus on answering two questions:(1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we propose three types of poisoning attacks against multimodal models. Extensive evaluations on different datasets and model architectures show that all three attacks can achieve significant attack performance while maintaining model utility in both visual and linguistic modalities. Furthermore, we observe that the poisoning effect differs between different modalities. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model’s utility.

Data Poisoning Attacks Against Multimodal Encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, Yang Zhang

ICML 2023

Recently, the newly emerged multimodal models, which leverage both visual and linguistic modalities to train powerful encoders, have gained increasing attention. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model’s training data to trigger malicious behaviors in it. In contrast to previous work, only poisoning visual modality, in this work, we take the first step to studying poisoning attacks against multimodal models in both visual and linguistic modalities. Specially, we focus on answering two questions:(1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we propose three types of poisoning attacks against multimodal models. Extensive evaluations on different datasets and model architectures show that all three attacks can achieve significant attack performance while maintaining model utility in both visual and linguistic modalities. Furthermore, we observe that the poisoning effect differs between different modalities. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model’s utility.

RemovalNet: DNN Fingerprint Removal Attacks
RemovalNet: DNN Fingerprint Removal Attacks

Hongwei Yao, Zheng Li, Kunzhe Huang, Jian Lou, Zhan Qin, Kui Ren

IEEE Transactions on Dependable and Secure Computing 2023

With the performance of deep neural networks (DNNs) remarkably improving, DNNs have been widely used in many areas. Consequently, the DNN model has become a valuable asset, and its intellectual property is safeguarded by ownership verification techniques (e.g., DNN fingerprinting). However, the feasibility of the DNN fingerprint removal attack and its potential influence remains an open problem. In this article, we perform the first comprehensive investigation of DNN fingerprint removal attacks. Generally, the knowledge contained in a DNN model can be categorized into general semantic and fingerprint-specific knowledge. ...

RemovalNet: DNN Fingerprint Removal Attacks

Hongwei Yao, Zheng Li, Kunzhe Huang, Jian Lou, Zhan Qin, Kui Ren

IEEE Transactions on Dependable and Secure Computing 2023

With the performance of deep neural networks (DNNs) remarkably improving, DNNs have been widely used in many areas. Consequently, the DNN model has become a valuable asset, and its intellectual property is safeguarded by ownership verification techniques (e.g., DNN fingerprinting). However, the feasibility of the DNN fingerprint removal attack and its potential influence remains an open problem. In this article, we perform the first comprehensive investigation of DNN fingerprint removal attacks. Generally, the knowledge contained in a DNN model can be categorized into general semantic and fingerprint-specific knowledge. ...

Watermarking Diffusion Model
Watermarking Diffusion Model

Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, Yang Zhang

arXiv 2023

The availability and accessibility of diffusion models (DMs) have significantly increased in recent years, making them a popular tool for analyzing and predicting the spread of information, behaviors, or phenomena through a population. Particularly, text-to-image diffusion models (e.g., DALLE 2 and Latent Diffusion Models (LDMs) have gained significant attention in recent years for their ability to generate high-quality images and perform various image synthesis tasks. Despite their widespread adoption in many fields, DMs are often susceptible to various intellectual property violations. These can include not only copyright infringement but also more subtle forms of misappropriation, such as unauthorized use or modification of the model. Therefore, DM owners must be aware of these potential risks and take appropriate steps to protect their models. In this work, we are the first to protect the intellectual property of DMs. We propose a simple but effective watermarking scheme that injects the watermark into the DMs and can be verified by the pre-defined prompts. In particular, we propose two different watermarking methods, namely NAIVEWM and FIXEDWM. The NAIVEWM method injects the watermark into the LDMs and activates it using a prompt containing the watermark. On the other hand, the FIXEDWM is considered more advanced and stealthy compared to the NAIVEWM, as it can only activate the watermark when using a prompt containing a trigger in a fixed position. We conducted a rigorous evaluation of both approaches, demonstrating their effectiveness in watermark injection and verification with minimal impact on the LDM's functionality.

Watermarking Diffusion Model

Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, Yang Zhang

arXiv 2023

The availability and accessibility of diffusion models (DMs) have significantly increased in recent years, making them a popular tool for analyzing and predicting the spread of information, behaviors, or phenomena through a population. Particularly, text-to-image diffusion models (e.g., DALLE 2 and Latent Diffusion Models (LDMs) have gained significant attention in recent years for their ability to generate high-quality images and perform various image synthesis tasks. Despite their widespread adoption in many fields, DMs are often susceptible to various intellectual property violations. These can include not only copyright infringement but also more subtle forms of misappropriation, such as unauthorized use or modification of the model. Therefore, DM owners must be aware of these potential risks and take appropriate steps to protect their models. In this work, we are the first to protect the intellectual property of DMs. We propose a simple but effective watermarking scheme that injects the watermark into the DMs and can be verified by the pre-defined prompts. In particular, we propose two different watermarking methods, namely NAIVEWM and FIXEDWM. The NAIVEWM method injects the watermark into the LDMs and activates it using a prompt containing the watermark. On the other hand, the FIXEDWM is considered more advanced and stealthy compared to the NAIVEWM, as it can only activate the watermark when using a prompt containing a trigger in a fixed position. We conducted a rigorous evaluation of both approaches, demonstrating their effectiveness in watermark injection and verification with minimal impact on the LDM's functionality.

Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis
Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis

Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Backes, Yang Zhang

arxiv 2023

Large text-to-image models have shown remarkable performance in synthesizing high-quality images. In particular, the subject-driven model makes it possible to personalize the image synthesis for a specific subject, e.g., a human face or an artistic style, by fine-tuning the generic text-to-image model with a few images from that subject. Nevertheless, misuse of subject-driven image synthesis may violate the authority of subject owners. For example, malicious users may use subject-driven synthesis to mimic specific artistic styles or to create fake facial images without authorization. To protect subject owners against such misuse, recent attempts have commonly relied on adversarial examples to indiscriminately disrupt subject-driven image synthesis. However, this essentially prevents any benign use of subject-driven synthesis based on protected images. ...

Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis

Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Backes, Yang Zhang

arxiv 2023

Large text-to-image models have shown remarkable performance in synthesizing high-quality images. In particular, the subject-driven model makes it possible to personalize the image synthesis for a specific subject, e.g., a human face or an artistic style, by fine-tuning the generic text-to-image model with a few images from that subject. Nevertheless, misuse of subject-driven image synthesis may violate the authority of subject owners. For example, malicious users may use subject-driven synthesis to mimic specific artistic styles or to create fake facial images without authorization. To protect subject owners against such misuse, recent attempts have commonly relied on adversarial examples to indiscriminately disrupt subject-driven image synthesis. However, this essentially prevents any benign use of subject-driven synthesis based on protected images. ...

2022

Auditing Membership Leakages of Multi-Exit Networks
Auditing Membership Leakages of Multi-Exit Networks

Zheng Li, Yiyong Liu, Xinlei He, Ning Yu, Michael Backes, Yang Zhang

CCS 2022

Relying on the truth that not all inputs require the same level of computational cost to produce reliable predictions, multi-exit networks are gaining attention as a prominent approach for pushing the limits of efficient deployment. Multi-exit networks endow a backbone model with early exits, allowing predictions at intermediate layers of the model and thus saving computation time and energy. However, various current designs of multi-exit networks are only considered to achieve the best trade-off between resource usage efficiency and prediction accuracy, the privacy risks stemming from them have never been explored. This prompts the need for a comprehensive investigation of privacy risks in multi-exit networks. ...

Auditing Membership Leakages of Multi-Exit Networks

Zheng Li, Yiyong Liu, Xinlei He, Ning Yu, Michael Backes, Yang Zhang

CCS 2022

Relying on the truth that not all inputs require the same level of computational cost to produce reliable predictions, multi-exit networks are gaining attention as a prominent approach for pushing the limits of efficient deployment. Multi-exit networks endow a backbone model with early exits, allowing predictions at intermediate layers of the model and thus saving computation time and energy. However, various current designs of multi-exit networks are only considered to achieve the best trade-off between resource usage efficiency and prediction accuracy, the privacy risks stemming from them have never been explored. This prompts the need for a comprehensive investigation of privacy risks in multi-exit networks. ...

FuzzGAN: A Generation-Based Fuzzing Framework For Testing Deep Neural Networks
FuzzGAN: A Generation-Based Fuzzing Framework For Testing Deep Neural Networks

Ge Han, Zheng Li, Peng Tang, Chengyu Hu, Shanqing Guo

HPCC 2022

Deep neural networks (DNNs) are increasingly deployed in various fields. Despite their spectacular advances, DNNs are known to suffer from adversarial vulnerabilities. The robustness of DNNs is then threatened by leading them to misclassifications with unexpected inputs (adversarial examples). The fuzzing technique frequently used for testing traditional software has recently been adopted to evaluate the robustness of DNNs. Current DNN fuzzing techniques focus on image classification DNNs and generate test cases by mutations, e.g., image transformations and adversarial perturbations. However, mutation-based test cases usually lack diversity and have distribution deflection from the original DNN input space, which impacts the evaluation of DNNs. ...

FuzzGAN: A Generation-Based Fuzzing Framework For Testing Deep Neural Networks

Ge Han, Zheng Li, Peng Tang, Chengyu Hu, Shanqing Guo

HPCC 2022

Deep neural networks (DNNs) are increasingly deployed in various fields. Despite their spectacular advances, DNNs are known to suffer from adversarial vulnerabilities. The robustness of DNNs is then threatened by leading them to misclassifications with unexpected inputs (adversarial examples). The fuzzing technique frequently used for testing traditional software has recently been adopted to evaluate the robustness of DNNs. Current DNN fuzzing techniques focus on image classification DNNs and generate test cases by mutations, e.g., image transformations and adversarial perturbations. However, mutation-based test cases usually lack diversity and have distribution deflection from the original DNN input space, which impacts the evaluation of DNNs. ...

Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models
Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models

Xinlei He*, Zheng Li*, Weilin Xu, Cory Cornelius, Yang Zhang

arxiv 2022

Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to infer whether an input sample was used to train the model. Over the past few years, researchers have produced many membership inference attacks and defenses. However, these attacks and defenses employ a variety of strategies and are conducted in different models and datasets. The lack of comprehensive benchmark, however, means we do not understand the strengths and weaknesses of existing attacks and defenses. ...

Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models

Xinlei He*, Zheng Li*, Weilin Xu, Cory Cornelius, Yang Zhang

arxiv 2022

Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to infer whether an input sample was used to train the model. Over the past few years, researchers have produced many membership inference attacks and defenses. However, these attacks and defenses employ a variety of strategies and are conducted in different models and datasets. The lack of comprehensive benchmark, however, means we do not understand the strengths and weaknesses of existing attacks and defenses. ...

Membership Inference Attacks Against Text-to-image Generation Models
Membership Inference Attacks Against Text-to-image Generation Models

Yixin Wu, Ning Yu, Zheng Li, Michael Backes, Yang Zhang

arxiv 2022

Text-to-image generation models have recently attracted unprecedented attention as they unlatch imaginative applications in all areas of life. However, developing such models requires huge amounts of data that might contain privacy-sensitive information, e.g., face identity. While privacy risks have been extensively demonstrated in the image classification and GAN generation domains, privacy risks in the text-to-image generation domain are largely unexplored. ...

Membership Inference Attacks Against Text-to-image Generation Models

Yixin Wu, Ning Yu, Zheng Li, Michael Backes, Yang Zhang

arxiv 2022

Text-to-image generation models have recently attracted unprecedented attention as they unlatch imaginative applications in all areas of life. However, developing such models requires huge amounts of data that might contain privacy-sensitive information, e.g., face identity. While privacy risks have been extensively demonstrated in the image classification and GAN generation domains, privacy risks in the text-to-image generation domain are largely unexplored. ...

Backdoor Attacks in the Supply Chain of Masked Image Modeling
Backdoor Attacks in the Supply Chain of Masked Image Modeling

Xinyue Shen, Xinlei He, Zheng Li, Yun Shen, Michael Backes, Yang Zhang

arxiv 2022

Masked image modeling (MIM) revolutionizes self-supervised learning (SSL) for image pre-training. In contrast to previous dominating self-supervised methods, i.e., contrastive learning, MIM attains state-of-the-art performance by masking and reconstructing random patches of the input image. However, the associated security and privacy risks of this novel generative method are unexplored. In this paper, we perform the first security risk quantification of MIM through the lens of backdoor attacks. Different from previous work, we are the first to systematically threat modeling on SSL in every phase of the model supply chain, i.e., pre-training, release, and downstream phases. ...

Backdoor Attacks in the Supply Chain of Masked Image Modeling

Xinyue Shen, Xinlei He, Zheng Li, Yun Shen, Michael Backes, Yang Zhang

arxiv 2022

Masked image modeling (MIM) revolutionizes self-supervised learning (SSL) for image pre-training. In contrast to previous dominating self-supervised methods, i.e., contrastive learning, MIM attains state-of-the-art performance by masking and reconstructing random patches of the input image. However, the associated security and privacy risks of this novel generative method are unexplored. In this paper, we perform the first security risk quantification of MIM through the lens of backdoor attacks. Different from previous work, we are the first to systematically threat modeling on SSL in every phase of the model supply chain, i.e., pre-training, release, and downstream phases. ...

2021

Membership Leakage in Label-Only Exposures
Membership Leakage in Label-Only Exposures

Zheng Li, Yang Zhang

CCS 2021

Machine learning (ML) has been widely adopted in various privacy-critical applications, e.g., face recognition and medical image analysis. However, recent research has shown that ML models are vulnerable to attacks against their training data. Membership inference is one major attack in this domain: Given a data sample and model, an adversary aims to determine whether the sample is part of the model's training set. Existing membership inference attacks leverage the confidence scores returned by the model as their inputs (score-based attacks). However, these attacks can be easily mitigated if the model only exposes the predicted label, i.e., the final model decision. ...

Membership Leakage in Label-Only Exposures

Zheng Li, Yang Zhang

CCS 2021

Machine learning (ML) has been widely adopted in various privacy-critical applications, e.g., face recognition and medical image analysis. However, recent research has shown that ML models are vulnerable to attacks against their training data. Membership inference is one major attack in this domain: Given a data sample and model, an adversary aims to determine whether the sample is part of the model's training set. Existing membership inference attacks leverage the confidence scores returned by the model as their inputs (score-based attacks). However, these attacks can be easily mitigated if the model only exposes the predicted label, i.e., the final model decision. ...

2019

How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of DNN
How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of DNN

Zheng Li, Chengyu Hu, Yang Zhang, Shanqing Guo

ACSAC 2019

Deep learning techniques have made tremendous progress in a variety of challenging tasks, such as image recognition and machine translation, during the past decade. Training deep neural networks is computationally expensive and requires both human and intellectual resources. Therefore, it is necessary to protect the intellectual property of the model and externally verify the ownership of the model. However, previous studies either fail to defend against the evasion attack or have not explicitly dealt with fraudulent claims of ownership by adversaries. Furthermore, they can not establish a clear association between the model and the creator's identity. ...

How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of DNN

Zheng Li, Chengyu Hu, Yang Zhang, Shanqing Guo

ACSAC 2019

Deep learning techniques have made tremendous progress in a variety of challenging tasks, such as image recognition and machine translation, during the past decade. Training deep neural networks is computationally expensive and requires both human and intellectual resources. Therefore, it is necessary to protect the intellectual property of the model and externally verify the ownership of the model. However, previous studies either fail to defend against the evasion attack or have not explicitly dealt with fraudulent claims of ownership by adversaries. Furthermore, they can not establish a clear association between the model and the creator's identity. ...

DeepKeyStego: Protecting Communication by Key-dependent Steganography with Deep Networks
DeepKeyStego: Protecting Communication by Key-dependent Steganography with Deep Networks

Zheng Li, Ge Han, Shanqing Guo, Chengyu Hu

HPCC 2019

Hiding both the presence and the content of secret information against eavesdropping over public communication channels is crucial for protecting privacy sensitive communication. Steganography is the art of hiding confidential information into normal carriers. Images are the most widely used containers for steganography. However, most of the current popular image steganographic schemes are designed with prescribed human based rules and can be effectively detected by existing steganalysis tools. Even though the steganography implemented by deep networks has a better performance against steganalysis to some extent, it is still exposed to a threat that an attacker with access to the decoding model can recover the embedded information from steganographic images. ...

DeepKeyStego: Protecting Communication by Key-dependent Steganography with Deep Networks

Zheng Li, Ge Han, Shanqing Guo, Chengyu Hu

HPCC 2019

Hiding both the presence and the content of secret information against eavesdropping over public communication channels is crucial for protecting privacy sensitive communication. Steganography is the art of hiding confidential information into normal carriers. Images are the most widely used containers for steganography. However, most of the current popular image steganographic schemes are designed with prescribed human based rules and can be effectively detected by existing steganalysis tools. Even though the steganography implemented by deep networks has a better performance against steganalysis to some extent, it is still exposed to a threat that an attacker with access to the decoding model can recover the embedded information from steganographic images. ...