Update about.md

Sizhe-Chen · Nov 1, 2024 · 51437cb · 51437cb
1 parent 0b552ec
commit 51437cb
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/_pages/about.md b/_pages/about.md
@@ -27,8 +27,8 @@ Selected Publications
 ------
 + Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [ArXiv Preprint](https://arxiv.org/abs/2410.05451) \| [Code](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as preference optimization. From a SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable response" responds to the benign instruction; and the "undesirable response" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune the LLM to be robust against these simulated attacks. Preserving utility, SecAlign secures Mistral-7B against GCG with a 2% attack success rate, compared to 56% in StruQ. The strong optimization-based GCG gets only 2% attack success rate on SecAlign Mistral-7B.
 + StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/> [USENIX Security'25](http://arxiv.org/abs/2402.06363) \| [Code](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of (1) a secure front-end that formats a prompt and data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. We augment SFT datasets with examples that also include instructions in the data portion besides the prompt portion, and fine-tune the model to ignore these. Preserving utility, StruQ discourages all existing (optimization-free) prompt injections to an attack success rate <2%.
-+ One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) \| [Code](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison** the training of DNNs. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that, if changed to a boundary value, could mostly and stably deviate from all original images. By injecting the tiny shortcut, OPS degrades the model accuracy on clean data to almost an untrained counterpart, and the generated noise cannot be erased by adversarial training.
-+ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) \| [Code](https://github.com/Sizhe-Chen/AAA)
++ One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) \| [Code](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison DNN training**. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that, if changed to a boundary value, could mostly and stably deviate from all original images. By injecting the tiny shortcut, OPS degrades the model accuracy on clean data to almost an untrained counterpart, and the generated noise, for the first time, cannot be erased by adversarial training.
++ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) \| [Code](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new research direction to especially **defend against score-based query attacks** (SQAs) by maintaining maintaining predictions while disrupting gradients. SQAs makes adversarial threat practical as they achieve high attack success rates with limited queries and information about the victim model. We note that if the loss trend of the outputs is slightly and strategically perturbed, SQAs could be easily misled and thereby become much less effective. AAA helps WideResNet-28 secure 80.59% accuracy under Square attack, while the best prior defense (i.e., adversarial training) only attains 67.44%. More importantly, AAA does not hurt the accuracy, calibration, or inference speed and can be directly plugged into any trained classifiers.
 + Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet <br/> **Sizhe Chen**, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang <br/> [TPAMI'22](https://ieeexplore.ieee.org/document/9238430) \| [Code](https://github.com/Sizhe-Chen/DAmageNet)
 + Subspace Adversarial Training <br/> Tao Li, Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [CVPR'22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Li_Subspace_Adversarial_Training_CVPR_2022_paper) \| [Code](https://github.com/nblt/Sub-AT)