Skip to content

Commit

Permalink
Update about.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Sizhe-Chen authored Nov 1, 2024
1 parent 51437cb commit 48ba1f5
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions _pages/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ Invited Talks

Selected Publications
------
+ Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [ArXiv Preprint](https://arxiv.org/abs/2410.05451) \| [Code](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as preference optimization. From a SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable response" responds to the benign instruction; and the "undesirable response" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune the LLM to be robust against these simulated attacks. Preserving utility, SecAlign secures Mistral-7B against GCG with a 2% attack success rate, compared to 56% in StruQ. The strong optimization-based GCG gets only 2% attack success rate on SecAlign Mistral-7B.
+ Aligning LLMs to Be Robust Against Prompt Injection <br/> **Sizhe Chen**, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo <br/> [ArXiv Preprint](https://arxiv.org/abs/2410.05451) \| [Code](https://github.com/facebookresearch/SecAlign) <br/> **SecAlign** formulates **prompt injection defense** as preference optimization. From a SFT dataset, we build our preference dataset, where the "input" contains a benign instruction, a benign data, and an injected instruction; the "desirable response" responds to the benign instruction; and the "undesirable response" responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune on our preference dataset. Preserving utility, SecAlign secures Mistral-7B against GCG with a 2% attack success rate, compared to 56% in StruQ.
+ StruQ: Defending Against Prompt Injection with Structured Queries <br/> **Sizhe Chen**, Julien Piet, Chawin Sitawarin, David Wagner <br/> [USENIX Security'25](http://arxiv.org/abs/2402.06363) \| [Code](https://github.com/Sizhe-Chen/StruQ) <br/> **StruQ** is a general approach to **defend against prompt injection** by separating the prompt and data into two channels. This system is made of (1) a secure front-end that formats a prompt and data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. We augment SFT datasets with examples that also include instructions in the data portion besides the prompt portion, and fine-tune the model to ignore these. Preserving utility, StruQ discourages all existing (optimization-free) prompt injections to an attack success rate <2%.
+ One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks <br/> Shutong Wu\*, **Sizhe Chen\***, Cihang Xie, Xiaolin Huang <br/> [ICLR'23 Spotlight](https://openreview.net/forum?id=p7G8t5FVn2h) \| [Code](https://github.com/cychomatica/One-Pixel-Shotcut) <br/> **OPS** perturbs only one pixel in each image to **poison DNN training**. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that, if changed to a boundary value, could mostly and stably deviate from all original images. By injecting the tiny shortcut, OPS degrades the model accuracy on clean data to almost an untrained counterpart, and the generated noise, for the first time, cannot be erased by adversarial training.
+ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) \| [Code](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new research direction to especially **defend against score-based query attacks** (SQAs) by maintaining maintaining predictions while disrupting gradients. SQAs makes adversarial threat practical as they achieve high attack success rates with limited queries and information about the victim model. We note that if the loss trend of the outputs is slightly and strategically perturbed, SQAs could be easily misled and thereby become much less effective. AAA helps WideResNet-28 secure 80.59% accuracy under Square attack, while the best prior defense (i.e., adversarial training) only attains 67.44%. More importantly, AAA does not hurt the accuracy, calibration, or inference speed and can be directly plugged into any trained classifiers.
+ Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks <br/> **Sizhe Chen**, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang <br/> [NeurIPS'22](https://openreview.net/forum?id=7hhH95QKKDX) \| [Code](https://github.com/Sizhe-Chen/AAA) <br/> **AAA** proposes a new direction to especially **defend against score-based query attacks** by maintaining predictions while disrupting gradients. We note that if the loss trend of the outputs is slightly and strategically perturbed, score-based attacks could be easily misled and thereby become much less effective. AAA helps WideResNet-28 secure 80.59% accuracy under Square attack, while the best prior defense (i.e., adversarial training) only attains 67.44%. More importantly, AAA does not hurt the accuracy, calibration, or inference speed, and can be directly plugged into any trained classifiers.
+ Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet <br/> **Sizhe Chen**, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang <br/> [TPAMI'22](https://ieeexplore.ieee.org/document/9238430) \| [Code](https://github.com/Sizhe-Chen/DAmageNet)
+ Subspace Adversarial Training <br/> Tao Li, Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [CVPR'22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Li_Subspace_Adversarial_Training_CVPR_2022_paper) \| [Code](https://github.com/nblt/Sub-AT)
+ Subspace Adversarial Training <br/> Tao Li, Yingwen Wu, **Sizhe Chen**, Kun Fang, Xiaolin Huang <br/> [CVPR'22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Li_Subspace_Adversarial_Training_CVPR_2022_paper) \| [Code](https://github.com/nblt/Sub-AT) <br/> **Sub-AT** approaches catastrophic overfitting and robust overfitting in **adversarial training** by constraining AT in a carefully extracted subspace. Sub-AT first samples model checkpoints during the regular training, and then performs SVD on the parameter matrix, where each vector is a squeezed checkpoint. This gives us mutually orthogonal bases of the subspace, enabling us to project gradients to those bases in future training. Sub-AT only alters a very small proportion of independent parameters like the current popular LoRA. Single-step Sub-AT reaches over 51% robust accuracy against PGD-50 attack, a competitive performance against PGD-10 adversarial training, with 40% less computation.

Services
------
Expand Down

0 comments on commit 48ba1f5

Please sign in to comment.