Update about.md

Sizhe-Chen · Nov 15, 2024 · 60d43b8 · 60d43b8
1 parent b4ab21c
commit 60d43b8
Showing 1 changed file with 4 additions and 2 deletions.
diff --git a/_pages/about.md b/_pages/about.md
@@ -25,8 +25,10 @@ Invited Talks
 
 Selected Publications
 ------
-+ Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [![](https://img.shields.io/badge/Paper-7fe7dc)](https://arxiv.org/pdf/2410.05451) [![](https://img.shields.io/badge/Poster-ced7db)](https://drive.google.com/file/d/1-HFnET2azKniaS4k5dvgVwoRLa4Eg584/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) [![](https://img.shields.io/badge/Talk-316879)](https://docs.google.com/document/d/1pip5y_HGU4qjN0K6NEFuI379RPdL9T6o/edit?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as the preference optimization. From an SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable output" responds to the benign instruction; and the "undesirable output" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune an SFT model on our preference dataset. Preserving utility, SecAlign reduces strong optimization-based attack success rate by a factor of >3 from StruQ.
-+ StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/> [USENIX Security'25](http://arxiv.org/abs/2402.06363) \| [Paper](http://arxiv.org/pdf/2402.06363) \| [Slides](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) \| [Talk](https://simons.berkeley.edu/talks/david-wagner-uc-berkeley-2024-10-14) \| [Code](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of (1) a secure front-end that formats a prompt and data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. We augment the SFT dataset with examples that additionally include instructions in data besides in prompt, and do SFT on the model to ignore instructions in data. Preserving utility, StruQ stops all existing (optimization-free) prompt injections to an attack success rate of <2%.
++ Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [![](https://img.shields.io/badge/Paper-a8c66c)](https://arxiv.org/pdf/2410.05451) [![](https://img.shields.io/badge/Poster-ced7db)](https://drive.google.com/file/d/1-HFnET2azKniaS4k5dvgVwoRLa4Eg584/view?usp=sharing) [![](https://img.shields.io/badge/Slides-f47a60)](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) [![](https://img.shields.io/badge/Talk-316879)](https://docs.google.com/document/d/1pip5y_HGU4qjN0K6NEFuI379RPdL9T6o/edit?usp=sharing) [![](https://img.shields.io/badge/Code-4d5198)](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as the preference optimization. From an SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable output" responds to the benign instruction; and the "undesirable output" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune an SFT model on our preference dataset. Preserving utility, SecAlign reduces strong optimization-based attack success rate by a factor of >3 from StruQ.
++ StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/>
++
++ [![](https://img.shields.io/badge/Paper-e1dd72)](http://arxiv.org/abs/2402.06363) \| [![](https://img.shields.io/badge/Paper-a8c66c)](http://arxiv.org/pdf/2402.06363) \| [Slides](https://drive.google.com/file/d/1baUbgFMILhPWBeGrm67XXy_H-jO7raRa/view?usp=sharing) \| [Talk](https://simons.berkeley.edu/talks/david-wagner-uc-berkeley-2024-10-14) \| [Code](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of (1) a secure front-end that formats a prompt and data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. We augment the SFT dataset with examples that additionally include instructions in data besides in prompt, and do SFT on the model to ignore instructions in data. Preserving utility, StruQ stops all existing (optimization-free) prompt injections to an attack success rate of <2%.
 + One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) \| [Paper](https://arxiv.org/pdf/2205.12141) \| [Poster](https://drive.google.com/file/d/1p5SSuoGPcQCMul9N7pmp_1ON_xupKeoD/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1maneRbPHAbKd8-toYXnAcpqabNhciOEK/view?usp=sharing) \| [Video](https://iclr.cc/virtual/2023/oral/12603) \| [Code](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison model training** from the view of shortcut learning. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that could mostly and stably alter the original images. OPS degrades the model accuracy on clean data to almost an untrained counterpart. The perturbations, for the first time, are crafted within seconds (CIFAR-10) or minutes (ImageNet) and cannot be erased by adversarial training.
 + Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) \| [Paper](https://arxiv.org/pdf/2205.12134) \| [Poster](https://drive.google.com/file/d/1DaVrjP0uTaolardNIYQDNO9z9NsH7ziM/view?usp=sharing) \| [Slides](https://drive.google.com/file/d/1oexH2EjV0k9tBNOHkesHD9lIJlQKoE1o/view?usp=sharing) \| [Video](https://drive.google.com/file/d/1e7tsEvbT10R750eldANDAlLRxqwT2pgg/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new direction to especially **defend against score-based query attacks** by maintaining predictions while disrupting gradients. We note that the efficient and realistic score-based attacks could be easily misled if the model logits are perturbed to create a periodically reverse loss trend. AAA secures WideResNet-28 with 80.59% accuracy under attack, compared to 67.44% from the best prior adversarial training defense. AAA does not hurt the accuracy, calibration, or inference speed, and can be directly plugged into any trained classifiers.
 + Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet <br/> **Sizhe Chen**, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang <br/> [IEEE TPAMI'22](https://ieeexplore.ieee.org/document/9238430) \| [Paper](https://arxiv.org/pdf/2001.06325) \| [Slides](https://drive.google.com/file/d/1KkcXy5No_hQ7wiqN5aawTpoBkms2jAy3/view?usp=sharing) \| [Code](https://github.com/Sizhe-Chen/DAmageNet) <br/> **AoA** follows the proposed principle that **transfer attacks** should seek for features that are shared across different architectures, which tend to reveal their common vulnerabilities. We note that the attention heatmap (from the model interpretation tool) could be a shared feature, and constrain the attention as our attack loss, which improves the attack transferability by 30%. We apply AoA to generate 50K adversarial samples from the ImageNet validation set to get the **DAmageNet**, leading to >85% error rate on 13 undefended models and >70% error rate on most defended models.