Skip to content

Commit

Permalink
Add Twitter icon for Twitter links
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelkeenan committed Apr 20, 2024
1 parent db4cc7d commit 7d4c374
Show file tree
Hide file tree
Showing 5 changed files with 70 additions and 27 deletions.
52 changes: 26 additions & 26 deletions _includes/aisafety.html
Original file line number Diff line number Diff line change
Expand Up @@ -27,57 +27,57 @@ <h2>Learn about large-scale risks from advanced AI</h2>
-->

<ul>
<li class="expandable expanded" data-toggle="overviews"><span style="background-color: lightgrey; padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Overviews</b></span></li>
<li class="expandable expanded" data-toggle="overviews"><span style="background-color: lightgrey" class="color-bubble"><b>Overviews</b></span></li>
<ul>
<p>What types of risks from advanced AI might we face? These papers provide an overview of anticipated problems and relevant technical research directions.</p>
<li><a href="https://arxiv.org/abs/2209.00626"><span style="background-color: lightgrey; padding: 0.25em 0.3em; border-radius: 0.5em;">(Ngo et al., 2022)</span> The Alignment Problem from a Deep Learning Perspective</a></li>
<li><a href="https://www.safe.ai/ai-risk"><span style="background-color: lightgrey; padding: 0.25em 0.3em; border-radius: 0.5em;">(Hendrycks et al., 2023)</span> An Overview of Catastrophic AI Risks</a></li>
<li><a href="https://arxiv.org/abs/2302.10329"><span style="background-color: lightgrey; padding: 0.25em 0.3em; border-radius: 0.5em;">(Chan et al., 2023)</span> Harms from Increasingly Agentic Algorithmic Systems</a></li>
<li><a href="https://llm-safety-challenges.github.io/"><span style="background-color: lightgrey; padding: 0.25em 0.3em; border-radius: 0.5em;">(Anwar et al., 2024)</span> Foundational Challenges in Assuring Alignment and Safety of Large Language Models</a></li>
<li><a href="https://arxiv.org/abs/2209.00626"><span style="background-color: lightgrey" class="color-bubble">(Ngo et al., 2022)</span> The Alignment Problem from a Deep Learning Perspective</a></li>
<li><a href="https://www.safe.ai/ai-risk"><span style="background-color: lightgrey" class="color-bubble">(Hendrycks et al., 2023)</span> An Overview of Catastrophic AI Risks</a></li>
<li><a href="https://arxiv.org/abs/2302.10329"><span style="background-color: lightgrey" class="color-bubble">(Chan et al., 2023)</span> Harms from Increasingly Agentic Algorithmic Systems</a><a href="https://twitter.com/tegan_maharaj/status/1668637520177905665" class="no-underline twitter-link twitter-link-small">{% include twitter_icon.html %}</a></li>
<li><a href="https://llm-safety-challenges.github.io/"><span style="background-color: lightgrey" class="color-bubble">(Anwar et al., 2024)</span> Foundational Challenges in Assuring Alignment and Safety of Large Language Models</a></li>
</ul>
</ul>
<ul>
<li class="expandable" data-toggle="evaluations"><span style="background-color: pink; padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Model Evaluations</b></span></li>
<ul>
<li class="expandable" data-toggle="evaluations"><span style="background-color: pink" class="color-bubble"><b>Model Evaluations</b></span></li>
<ul>
<p>To ensure that advanced AI systems are safe, we need societal agreement on what is <i>unsafe</i> so that we can make appropriate tradeoffs with AI's anticipated benefits. Developing technical benchmarks for what dangerous capabilities are in advanced AI systems, as part of "model evaluations", is a first necessary step to concretize these tradeoffs for policymakers, researchers, and the public.</p>
<li><a href="https://arxiv.org/abs/2212.09251"><span style="background-color: pink; padding: 0.25em 0.3em; border-radius: 0.5em;">(Perez et al., 2023)</span> Discovering Language Model Behaviors with Model-Written Evaluations</a></li>
<li><a href="https://arxiv.org/abs/2403.13793"><span style="background-color: pink; padding: 0.25em 0.3em; border-radius: 0.5em;">(Phuong et al., 2024)</span> Evaluating Frontier Models for Dangerous Capabilities</a></li>
<li><a href="https://www.anthropic.com/news/anthropics-responsible-scaling-policy"><span style="background-color: pink; padding: 0.25em 0.3em; border-radius: 0.5em;">(Anthropic, 2023)</span> Anthropic's Responsible Scaling Policy</a></li>
<li><a href="https://arxiv.org/abs/2212.09251"><span style="background-color: pink" class="color-bubble">(Perez et al., 2023)</span> Discovering Language Model Behaviors with Model-Written Evaluations</a></li>
<li><a href="https://arxiv.org/abs/2403.13793"><span style="background-color: pink" class="color-bubble">(Phuong et al., 2024)</span> Evaluating Frontier Models for Dangerous Capabilities</a></li>
<li><a href="https://www.anthropic.com/news/anthropics-responsible-scaling-policy"><span style="background-color: pink" class="color-bubble">(Anthropic, 2023)</span> Anthropic's Responsible Scaling Policy</a></li>
</ul>
</ul>
<ul>
<li class="expandable" data-toggle="robustness"><span style="background-color: peachpuff; padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Robustness and Generalization</b></span></li>
<li class="expandable" data-toggle="robustness"><span style="background-color: peachpuff" class="color-bubble"><b>Robustness and Generalization</b></span></li>
<ul>
<p>Today and in the future, AI needs to be robust to adversarial attacks and generalize well with incomplete data on human preferences. As models become more capable, ensuring models will represent human intentions across distributional shift is even more important... since humans will be less economically incentivized and less capable of monitoring the processes generating AI outputs, and models will become better at deception.</p>
<li><a href="https://arxiv.org/abs/2401.05566"><span style="background-color: peachpuff; padding: 0.25em 0.3em; border-radius: 0.5em;">(Hubinger et al., 2024)</span> Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training</a></li>
<li><a href="http://arxiv.org/abs/2307.15043"><span style="background-color: peachpuff; padding: 0.25em 0.3em; border-radius: 0.5em;">(Zou et al., 2023)</span> Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
<li><a href="https://arxiv.org/abs/2306.15447"><span style="background-color: peachpuff; padding: 0.25em 0.3em; border-radius: 0.5em;">(Carlini et al., 2023)</span> Are aligned neural networks adversarially aligned?</a></li>
<li><a href="https://arxiv.org/abs/2401.05566"><span style="background-color: peachpuff" class="color-bubble">(Hubinger et al., 2024)</span> Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training</a></li>
<li><a href="http://arxiv.org/abs/2307.15043"><span style="background-color: peachpuff" class="color-bubble">(Zou et al., 2023)</span> Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
<li><a href="https://arxiv.org/abs/2306.15447"><span style="background-color: peachpuff" class="color-bubble">(Carlini et al., 2023)</span> Are aligned neural networks adversarially aligned?</a></li>
</ul>
</ul>
<ul>
<li class="expandable" data-toggle="interpretability"><span style="background-color: moccasin; padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Interpretability</b></span></li>
<ul>
<p>If we do not understand how AI models arrive at their outputs, we cannot robustly monitor or modify them. We can ask models to describe their reasoning, but models may be synchophantic or deceptive, especially as they become more capable. One approach is to understand model processes just by examining their weights -- though the major challenges with this approach are superposition and scaling.</p>
<li><a href="https://distill.pub/2020/circuits/zoom-in/"><span style="background-color: moccasin; padding: 0.25em 0.3em; border-radius: 0.5em;">(Olah et al., 2020)</span> Zoom In: An Introduction to Circuits</a></li>
<li><a href="https://transformer-circuits.pub/2022/toy_model/index.html"><span style="background-color: moccasin; padding: 0.25em 0.3em; border-radius: 0.5em;"> (Elhage et al., 2022)</span> Toy models of superposition</a></li>
<li><a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html"><span style="background-color: moccasin; padding: 0.25em 0.3em; border-radius: 0.5em;">(Bricken et al., 2023)</span> Towards Monosemanticity: Decomposing Language Models With Dictionary Learning</a></li>
<li><a href="https://distill.pub/2020/circuits/zoom-in/"><span style="background-color: moccasin" class="color-bubble">(Olah et al., 2020)</span> Zoom In: An Introduction to Circuits</a></li>
<li><a href="https://transformer-circuits.pub/2022/toy_model/index.html"><span style="background-color: moccasin" class="color-bubble"> (Elhage et al., 2022)</span> Toy models of superposition</a></li>
<li><a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html"><span style="background-color: moccasin" class="color-bubble">(Bricken et al., 2023)</span> Towards Monosemanticity: Decomposing Language Models With Dictionary Learning</a></li>
</ul>
</ul>
<ul>
<li class="expandable" data-toggle="reward_misspecification"><span style="background-color: rgb(192,238,192); padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Reward Misspecification and Goal Misgeneralization</b></span></li>
<li class="expandable" data-toggle="reward_misspecification"><span style="background-color: rgb(192,238,192)" class="color-bubble"><b>Reward Misspecification and Goal Misgeneralization</b></span></li>
<ul>
<p>How do we ensure that AI systems represent the goals we intend them to at deployment and under distributional shift, when we cannot maximally specify our preferences during training and many goals are compatible with the training data? Proxy reward signals generally correlate with designers' true objectives, but AI systems can break this correlation when they strongly optimize towards reward objectives.</p>
<li><a href="https://arxiv.org/abs/2210.01790"><span style="background-color: rgb(192,238,192); padding: 0.25em 0.3em; border-radius: 0.5em;">(Shah et al., 2022)</span> Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals</a></li>
<li><a href="https://arxiv.org/abs/2201.03544"><span style="background-color: rgb(192,238,192); padding: 0.25em 0.3em; border-radius: 0.5em;">(Pan, Bhatia and Steinhardt, 2022)</span> The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models</a></li>
<li><a href="https://arxiv.org/abs/2210.01790"><span style="background-color: rgb(192,238,192)" class="color-bubble">(Shah et al., 2022)</span> Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals</a></li>
<li><a href="https://arxiv.org/abs/2201.03544"><span style="background-color: rgb(192,238,192)" class="color-bubble">(Pan, Bhatia and Steinhardt, 2022)</span> The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models</a></li>
</ul>
</ul>
<!-- paleturquoise -->
<ul>
<li class="expandable" data-toggle="scalable_oversight"><span style="background-color: lavender; padding: 0.25em 0.3em; border-radius: 0.5em;"><b>Scalable Oversight</b></span></li>
<li class="expandable" data-toggle="scalable_oversight"><span style="background-color: lavender" class="color-bubble"><b>Scalable Oversight</b></span></li>
<ul>
<p>How do we supervise systems that are more capable than human overseers? Perhaps we can use aligned AI overseers to oversee more capable AI, but the core problems still remain.</p>
<li><a href="https://arxiv.org/abs/2211.03540"><span style="background-color: lavender; padding: 0.25em 0.3em; border-radius: 0.5em;"> (Bowman et al., 2022)</span> Measuring Progress on Scalable Oversight for Large Language Models</a></li>
<li><a href="https://arxiv.org/abs/2212.08073"><span style="background-color: lavender; padding: 0.25em 0.3em; border-radius: 0.5em;">(Bai et al., 2022)</span> Constitutional AI: Harmlessness from AI Feedback</a></li>
<li><a href="https://arxiv.org/abs/2211.03540"><span style="background-color: lavender" class="color-bubble"> (Bowman et al., 2022)</span> Measuring Progress on Scalable Oversight for Large Language Models</a></li>
<li><a href="https://arxiv.org/abs/2212.08073"><span style="background-color: lavender" class="color-bubble">(Bai et al., 2022)</span> Constitutional AI: Harmlessness from AI Feedback</a></li>
<br>
</ul>
<br>
Expand Down Expand Up @@ -294,7 +294,7 @@ <h3 class="Title"></h3>
paper.Title = linkify(paper.Link, paper.Title.replace("\n", ''));
paper['Transcripts / Audio / SlidesDisplay'] = linkify(paper['Transcripts / Audio / Slides'], 'Transcripts / Audio / Slides');
paper['Supplementary MaterialDisplay'] = linkify(paper['Supplementary Material'], 'Supplementary Material');
paper['TwitterDisplay'] = linkify(paper['Twitter'], 'Twitter');
paper['TwitterDisplay'] = linkify(paper['Twitter'], '{% include twitter_icon.html %}', 'no-underline twitter-link');
paper['AbstractDisplay'] = paper.Abstract ? `<h4>Abstract</h4>${paper.Abstract}` : '';
return paper;
});
Expand Down Expand Up @@ -489,9 +489,9 @@ <h3 class="Title"></h3>
updateQueryParams();
}

function linkify(s, linkText) {
function linkify(s, linkText, cssClass = '') {
if (!s) return '';
return `<a href="${s}" target="_top">${linkText || s}</a>`
return `<a href="${s}" target="_top" class="${cssClass}">${linkText || s}</a>`
}

function slugify(s) {
Expand Down
2 changes: 1 addition & 1 deletion _includes/papers.html
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ <h3 class="Title"></h3>
paper.Title = linkify(paper.Link, paper.Title.replace("\n", ''));
paper['Transcripts / Audio / SlidesDisplay'] = linkify(paper['Transcripts / Audio / Slides'], 'Transcripts / Audio / Slides');
paper['Supplementary MaterialDisplay'] = linkify(paper['Supplementary Material'], 'Supplementary Material');
paper['TwitterDisplay'] = linkify(paper['Twitter'], 'Twitter');
paper['TwitterDisplay'] = linkify(paper['Twitter'], '<img src="{% link assets/images/twitter.svg %}" class="" alt="Twitter" />');
paper['AbstractDisplay'] = paper.Abstract ? `<h4>Abstract</h4>${paper.Abstract}` : '';
return paper;
});
Expand Down
1 change: 1 addition & 0 deletions _includes/twitter_icon.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" width="32" height="32" shape-rendering="geometricPrecision" text-rendering="geometricPrecision" image-rendering="optimizeQuality" fill="#888" fill-rule="evenodd" clip-rule="evenodd" viewBox="0 0 640 640"><path d="M640.012 121.513c-23.528 10.524-48.875 17.516-75.343 20.634 27.118-16.24 47.858-41.977 57.756-72.615-25.347 14.988-53.516 25.985-83.363 31.866-24-25.5-58.087-41.35-95.848-41.35-72.508 0-131.21 58.736-131.21 131.198 0 10.228 1.134 20.232 3.355 29.882-109.1-5.528-205.821-57.757-270.57-137.222a131.423 131.423 0 0 0-17.764 66c0 45.497 23.102 85.738 58.347 109.207-21.508-.638-41.74-6.638-59.505-16.359v1.642c0 63.627 45.225 116.718 105.32 128.718-11.008 2.988-22.63 4.642-34.606 4.642-8.48 0-16.654-.874-24.78-2.35 16.783 52.11 65.233 90.095 122.612 91.205-44.989 35.245-101.493 56.233-163.09 56.233-10.63 0-20.988-.65-31.334-1.89 58.229 37.359 127.206 58.997 201.31 58.997 241.42 0 373.552-200.069 373.552-373.54 0-5.764-.13-11.35-.366-16.996 25.642-18.343 47.87-41.493 65.469-67.844l.059-.059z"/></svg>
41 changes: 41 additions & 0 deletions _sass/arkose.scss
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,42 @@
}
}

.twitter-link {
svg {
transition: fill 0.25s ease-in;
&:hover {
cursor: pointer;
fill: #1DA1F2;
}
}
}

.twitter-link-small {
&:hover {
filter: unset; // absolute-positioned image inside a tag causes weird bug so we re-do this on the image
cursor: pointer;
}
svg {
height: 1.1em;
position: absolute;
margin-left: 5px;
margin-top: 4px;
transition: fill 0.25s ease-in;
&:hover {
fill: #1DA1F2;
}
}
}

.TwitterDisplay .twitter-link img {
height: 1.5em;
}

.color-bubble {
padding: 0.25em 0.3em;
border-radius: 0.5em;
}

.hero {
width: 100vw;
transform: translateX(-50%);
Expand All @@ -257,6 +293,11 @@
border-left:8px solid var(--color-arkose-dark-blue);
}

a.no-underline {
text-decoration: none;
border-bottom: none;
}

.grid {
display: -webkit-box;
display: -ms-flexbox;
Expand Down
1 change: 1 addition & 0 deletions assets/images/twitter.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7d4c374

Please sign in to comment.