Skip to content

Commit

Permalink
Update about.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Sizhe-Chen authored Nov 16, 2024
1 parent ebe23d9 commit 84f736f
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions _pages/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,21 @@ Invited Talks
Selected Publications
------
+ Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2410.05451) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1-HFnET2azKniaS4k5dvgVwoRLa4Eg584/view?usp=sharing) [![](https://img.shields.io/badge/Talk-316879)](https://docs.google.com/document/d/1pip5y_HGU4qjN0K6NEFuI379RPdL9T6o/edit?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as the preference optimization. From an SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable output" responds to the benign instruction; and the "undesirable output" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune an SFT model on our preference dataset. Preserving utility, SecAlign reduces strong optimization-based attack success rate by a factor of >3 from StruQ.
+ StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/> [![](https://img.shields.io/badge/USENIX%20Security'25-e1dd72)](http://arxiv.org/abs/2402.06363) [![](https://img.shields.io/badge/Paper-a8c66c)](http://arxiv.org/pdf/2402.06363) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1UUz4t43sGqFOPZqNxf8izR--iLAl16QX/view?usp=sharing) [![](https://img.shields.io/badge/Talk-316879)](https://simons.berkeley.edu/talks/david-wagner-uc-berkeley-2024-10-14) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of (1) a secure front-end that formats a prompt and data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. We augment the SFT dataset with examples that additionally include instructions in data besides in prompt, and do SFT on the model to ignore instructions in data. Preserving utility, StruQ stops all existing (optimization-free) prompt injections to an attack success rate of <2%.
+ One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2205.12141) \| [Poster](https://drive.google.com/file/d/1p5SSuoGPcQCMul9N7pmp_1ON_xupKeoD/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1maneRbPHAbKd8-toYXnAcpqabNhciOEK/view?usp=sharing) \| [Video](https://iclr.cc/virtual/2023/oral/12603) \| [Code](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison model training** from the view of shortcut learning. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that could mostly and stably alter the original images. OPS degrades the model accuracy on clean data to almost an untrained counterpart. The perturbations, for the first time, are crafted within seconds (CIFAR-10) or minutes (ImageNet) and cannot be erased by adversarial training.
+ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2205.12134) \| [Poster](https://drive.google.com/file/d/1DaVrjP0uTaolardNIYQDNO9z9NsH7ziM/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1oexH2EjV0k9tBNOHkesHD9lIJlQKoE1o/view?usp=sharing) \| [Video](https://drive.google.com/file/d/1e7tsEvbT10R750eldANDAlLRxqwT2pgg/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new direction to especially **defend against score-based query attacks** by maintaining predictions while disrupting gradients. We note that the efficient and realistic score-based attacks could be easily misled if the model logits are perturbed to create a periodically reverse loss trend. AAA secures WideResNet-28 with 80.59% accuracy under attack, compared to 67.44% from the best prior adversarial training defense. AAA does not hurt the accuracy, calibration, or inference speed, and can be directly plugged into any trained classifiers.
+ Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet <br/> **Sizhe Chen**, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang <br/> [IEEE TPAMI'22](https://ieeexplore.ieee.org/document/9238430) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2001.06325) \| [Slides](https://drive.google.com/file/d/1KkcXy5No_hQ7wiqN5aawTpoBkms2jAy3/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/DAmageNet) <br/> **AoA** follows the proposed principle that **transfer attacks** should seek for features that are shared across different architectures, which tend to reveal their common vulnerabilities. We note that the attention heatmap (from the model interpretation tool) could be a shared feature, and constrain the attention as our attack loss, which improves the attack transferability by 30%. We apply AoA to generate 50K adversarial samples from the ImageNet validation set to get the **DAmageNet**, leading to >85% error rate on 13 undefended models and >70% error rate on most defended models.
+ Subspace Adversarial Training <br/> Tao Li, Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [CVPR'22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Li_Subspace_Adversarial_Training_CVPR_2022_paper) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2111.12229) \| [Poster](https://drive.google.com/file/d/1AMKDIKvcaOmG1Y-p9aWDsoJzhOrrsFv3/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1NaF_bZkrPvfsScLfVcjPqcPVQ3CW8hoK/view?usp=sharing) \| [Video](https://drive.google.com/file/d/1NCwOfILYPF6SOudDrHp4t9Q1lu-BfPFf/view?usp=sharing) \| [Code](https://github.com/nblt/Sub-AT) <br/> **Sub-AT** approaches catastrophic overfitting and robust overfitting in **adversarial training** (AT) by constraining AT in a carefully extracted subspace. Sub-AT saves checkpoints during the regular training and performs SVD on the parameter matrix (each vector is a squeezed checkpoint) to get mutually orthogonal bases of the subspace. Then Sub-AT projects gradients to those bases in the remaining training, i.e., only alters very few independent parameters like the following LoRA. 1-step Sub-AT achieves a competitive performance v.s. standard 10-step AT, with even 40% less computation than standard 1-step AT.
+ StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/> [![](https://img.shields.io/badge/USENIX%20Security'25-e1dd72)](http://arxiv.org/abs/2402.06363) [![](https://img.shields.io/badge/Paper-a8c66c)](http://arxiv.org/pdf/2402.06363) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1UUz4t43sGqFOPZqNxf8izR--iLAl16QX/view?usp=sharing) [![](https://img.shields.io/badge/Talk-316879)](https://simons.berkeley.edu/talks/david-wagner-uc-berkeley-2024-10-14) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of a secure front-end that formats a prompt and data into a special format, and a specially trained LLM that can produce high-quality outputs from these inputs. We augment the SFT dataset with examples that additionally include instructions in data besides in prompt, and do SFT on the model to ignore instructions in data. Preserving utility, StruQ stops all existing (optimization-free) prompt injections to an attack success rate of <2%.
+ One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2205.12141) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1p5SSuoGPcQCMul9N7pmp_1ON_xupKeoD/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1maneRbPHAbKd8-toYXnAcpqabNhciOEK/view?usp=sharing) \| [Video](https://iclr.cc/virtual/2023/oral/12603) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison model training** from the view of shortcut learning. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that could mostly and stably alter the original images. OPS degrades the model accuracy on clean data to almost an untrained counterpart. The perturbations, for the first time, are crafted within seconds (CIFAR-10) or minutes (ImageNet) and cannot be erased by adversarial training.
+ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2205.12134) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1DaVrjP0uTaolardNIYQDNO9z9NsH7ziM/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1oexH2EjV0k9tBNOHkesHD9lIJlQKoE1o/view?usp=sharing) \| [Video](https://drive.google.com/file/d/1e7tsEvbT10R750eldANDAlLRxqwT2pgg/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new direction to especially **defend against score-based query attacks** by maintaining predictions while disrupting gradients. We note that the efficient and realistic score-based attacks could be easily misled if the model logits are perturbed to create a periodically reverse loss trend. AAA secures WideResNet-28 with 80.59% accuracy under attack, compared to 67.44% from the best prior adversarial training defense. AAA does not hurt the accuracy, calibration, or inference speed, and can be directly plugged into any trained classifiers.
+ Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet <br/> **Sizhe Chen**, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang <br/> [IEEE TPAMI'22](https://ieeexplore.ieee.org/document/9238430) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2001.06325) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1KkcXy5No_hQ7wiqN5aawTpoBkms2jAy3/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/DAmageNet) <br/> **AoA** follows the proposed principle that **transfer attacks** should seek for features that are shared across different architectures, which tend to reveal their common vulnerabilities. We note that the attention heatmap (from the model interpretation tool) could be a shared feature, and constrain the attention as our attack loss, which improves the attack transferability by 30%. We apply AoA to generate 50K adversarial samples from the ImageNet validation set to get the **DAmageNet**, leading to >85% error rate on 13 undefended models and >70% error rate on most defended models.
+ Subspace Adversarial Training <br/> Tao Li, Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [CVPR'22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Li_Subspace_Adversarial_Training_CVPR_2022_paper) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2111.12229) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/1AMKDIKvcaOmG1Y-p9aWDsoJzhOrrsFv3/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1NaF_bZkrPvfsScLfVcjPqcPVQ3CW8hoK/view?usp=sharing) \| [Video](https://drive.google.com/file/d/1NCwOfILYPF6SOudDrHp4t9Q1lu-BfPFf/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/nblt/Sub-AT) <br/> **Sub-AT** approaches catastrophic overfitting and robust overfitting in **adversarial training** (AT) by constraining AT in a carefully extracted subspace. Sub-AT saves checkpoints during the regular training and performs SVD on the parameter matrix (each vector is a squeezed checkpoint) to get mutually orthogonal bases of the subspace. Then Sub-AT projects gradients to those bases in the remaining training, i.e., only alters very few independent parameters like the following LoRA. 1-step Sub-AT achieves a competitive performance v.s. standard 10-step AT, with even 40% less computation than standard 1-step AT.

Other Publications
------
+ Jatmo: Prompt Injection Defense by Task-Specific Finetuning <br/> Julien Piet, Maha Alrashed, Chawin Sitawarin, **Sizhe Chen**, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner <br/> [ESORICS'24](https://dl.acm.org/doi/abs/10.1007/978-3-031-70879-4_6) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2312.17673) \| [Slides](https://drive.google.com/file/d/1dz23r986NxCFWXYuIMBQvHZ4Eg4liq_o/view?usp=sharing) \| [Code](https://github.com/wagner-group/prompt-injection-defense)
+ Can LLMs Follow Simple Rules? <br/> Norman Mu, Sarah Chen, Zifan Wang, **Sizhe Chen**, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner <br/> [Paper](https://arxiv.org/pdf/2311.04235) \| [Website](https://eecs.berkeley.edu/~normanmu/llm_rules) \| [Demo](https://huggingface.co/spaces/normster/llm_rules) \| [Code](https://github.com/normster/llm_rules)
+ Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors <br/> **Sizhe Chen**, Geng Yuan, Xinwen Cheng, Yifan Gong, Minghai Qin, Yanzhi Wang, Xiaolin Huang <br/> [ICLR'23](https://openreview.net/forum?id=9MO7bjoAfIA) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2211.12005) \| [Poster](https://drive.google.com/file/d/171xgU_zcwvP0X9hlq1W-CJG4giCCxJIu/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1dIwpAVxqozog63t5J1ql2BUxBYaxSMky/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/SEP)
+ Query Attack by Multi-Identity Surrogates <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Xiaolin Huang <br/> [IEEE TAI'23](https://ieeexplore.ieee.org/document/10070787) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2105.15010) \| [Slides](https://drive.google.com/file/d/1ExOSIkz45BwYQKN1nfxtpxuBeg9gNut0/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/QueryNet)
+ Measuring the Transferability of $\ell_\infty$ Attacks by the $\ell_2$ Norm <br/> **Sizhe Chen**, Qinghua Tao, Zhixing Ye, Xiaolin Huang <br/> [ICASSP'23](https://ieeexplore.ieee.org/document/10096892) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2102.10343) \| [Code](https://github.com/Sizhe-Chen/FairAttack)
+ Unifying Gradients to Improve Real-World Robustness for Deep Networks <br/> Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [ACM TIST'23](https://dl.acm.org/doi/abs/10.1145/3617895) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2208.06228) \| [Code](https://github.com/snowien/UniG-pytorch)
+ Relevance Attack on Detectors <br/> **Sizhe Chen**, Fan He, Xiaolin Huang, Kun Zhang <br/> [Pattern Recognition'22](https://www.sciencedirect.com/science/article/abs/pii/S0031320321006671) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2008.06822) \| [Code](https://github.com/Sizhe-Chen/RAD)
+ Jatmo: Prompt Injection Defense by Task-Specific Finetuning <br/> Julien Piet, Maha Alrashed, Chawin Sitawarin, **Sizhe Chen**, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner <br/> [ESORICS'24](https://dl.acm.org/doi/abs/10.1007/978-3-031-70879-4_6) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2312.17673) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1dz23r986NxCFWXYuIMBQvHZ4Eg4liq_o/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/wagner-group/prompt-injection-defense)
+ Can LLMs Follow Simple Rules? <br/> Norman Mu, Sarah Chen, Zifan Wang, **Sizhe Chen**, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner <br/> [Paper](https://arxiv.org/pdf/2311.04235) \| [Website](https://eecs.berkeley.edu/~normanmu/llm_rules) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/normster/llm_rules)
+ Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors <br/> **Sizhe Chen**, Geng Yuan, Xinwen Cheng, Yifan Gong, Minghai Qin, Yanzhi Wang, Xiaolin Huang <br/> [ICLR'23](https://openreview.net/forum?id=9MO7bjoAfIA) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2211.12005) [![](https://img.shields.io/badge/Poster-1b6535)](https://drive.google.com/file/d/171xgU_zcwvP0X9hlq1W-CJG4giCCxJIu/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1dIwpAVxqozog63t5J1ql2BUxBYaxSMky/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/SEP)
+ Query Attack by Multi-Identity Surrogates <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Xiaolin Huang <br/> [IEEE TAI'23](https://ieeexplore.ieee.org/document/10070787) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2105.15010) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1ExOSIkz45BwYQKN1nfxtpxuBeg9gNut0/view?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/QueryNet)
+ Measuring the Transferability of $\ell_\infty$ Attacks by the $\ell_2$ Norm <br/> **Sizhe Chen**, Qinghua Tao, Zhixing Ye, Xiaolin Huang <br/> [ICASSP'23](https://ieeexplore.ieee.org/document/10096892) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2102.10343) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/FairAttack)
+ Unifying Gradients to Improve Real-World Robustness for Deep Networks <br/> Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [ACM TIST'23](https://dl.acm.org/doi/abs/10.1145/3617895) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2208.06228) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/snowien/UniG-pytorch)
+ Relevance Attack on Detectors <br/> **Sizhe Chen**, Fan He, Xiaolin Huang, Kun Zhang <br/> [Pattern Recognition'22](https://www.sciencedirect.com/science/article/abs/pii/S0031320321006671) [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2008.06822) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/Sizhe-Chen/RAD)

Services
------
Expand Down

0 comments on commit 84f736f

Please sign in to comment.