-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
378 lines (353 loc) · 29.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Research Paper Results</title>
<style>
body {
font-family: Arial, sans-serif;
}
.container {
width: 80%;
margin: auto;
}
.video-container {
position: relative;
padding-bottom: 56.25%; /* 16:9 aspect ratio */
padding-top: 25px;
height: 0;
}
.video-caption {
text-align: center;
font-size: 0.9em;
color: #666;
margin-top: 5px;
}
.video-container iframe {
position: absolute;
top: 0;
left: 0;
width: 100%;
height: 100%;
}
.small-heading {
font-size: 1.4em;
font-weight: bold;
color: #2e2e2e;
margin-top: 30px;
text-align: left;
}
.author-info {
text-align: center;
margin-top: 5px;
}
.author {
font-size: 1.2em;
color: #2d3792;
}
.affiliation, .corresponding-author {
font-size: 0.9em;
}
sup {
font-size: 0.75em;
}
.images-container {
text-align: center; /* 画像を中央揃えにする */
margin-top: 20px; /* 上のコンテンツとの間隔を開ける */
}
.image {
width: 150px; /* 画像の幅を適宜設定 */
height: auto; /* 高さを自動調整してアスペクト比を維持 */
margin: 0 30px; /* 画像の間に余白を設定 */
display: inline-block; /* 画像をインラインブロック要素として表示 */
}
.results-section {
font-size: 1.0em; /* Adjust size as needed */
text-align: left;
margin-top: 40px; /* Space above the Abstract heading */
margin-bottom: 10px; /* Space below the Abstract heading */
}
.figure {
margin-top: 20px;
}
.figure img {
max-width: 100%;
height: auto;
}
.caption {
text-align: center;
font-size: 0.9em;
color: #666;
}
.highlighted-title {
font-size: 1.2em; /* Adjust size as needed */
font-weight: bold;
}
/* Styles for the title to be split on two lines */
h1 {
text-align: center;
font-weight: normal; /* Ensure the rest of the title is not bold */
}
.abstract-heading {
font-size: 1.0em; /* Adjust size as needed */
text-align: left;
margin-top: 40px; /* Space above the Abstract heading */
margin-bottom: 10px; /* Space below the Abstract heading */
}
.method-section{
font-size: 1.0em; /* Adjust size as needed */
text-align: left;
margin-top: 40px; /* Space above the Abstract heading */
margin-bottom: 10px; /* Space below the Abstract heading */
}
.collapsible {
background-color: #d4d6ff;
color: rgb(133, 133, 133);
cursor: pointer;
padding: 5px;
width: 100%;
border: none;
text-align: left;
outline: none;
font-size: 15px;
}
.active, .collapsible:hover {
background-color: #6a79ff;
color: rgb(255, 255, 255);
}
.content {
padding: 0 18px;
display: none;
overflow: hidden;
background-color: #f1f1f1;
}
</style>
</head>
<body>
<div class="container">
<h1 style="margin-top: 50px;"><span class="highlighted-title">From Text to Motion:</span><br>Grounding GPT-4 in a Humanoid Robot "Alter3"</h1>
<div class="author-info">
<p class="author">Takahide Yoshida<sup>1</sup>, Atsushi Masumori<sup>1,2</sup>, and Takashi Ikegami<sup>1,2</sup></p>
<p class="affiliation"><sup>1</sup>Department of General Systems Science, University of Tokyo, Tokyo, Japan <sup>2</sup>Alternative Machine Inc., Tokyo, Japan<</p>
</div>
<div class="images-container">
<img src="./UnivOfTokyo_logo.png" alt="First Image" class="image">
<img src="./AM.png" alt="Second Image" class="image">
</div>
<h3 class = "abstract-heading" >Abstruct</h3>
<p>We report the development of Alter3, a humanoid robot capable of generating spontaneous motion using a Large Language Model (LLM), specifically GPT-4. This achievement was realized by integrating GPT-4 into our proprietary android, Alter3, thereby effectively grounding the LLM with Alter's bodily movement. Typically, low-level robot control is hardware-dependent and falls outside the scope of LLM corpora, presenting challenges for direct LLM-based robot control. However, in the case of humanoid robots like Alter3, direct control is feasible by mapping the linguistic expressions of human actions onto the robot's body through program code. Remarkably, this approach enables Alter3 to adopt various poses, such as a 'selfie' stance or 'pretending to be a ghost,' and generate sequences of actions over time without explicit programming for each body part. This demonstrates the robot's zero-shot learning capabilities. Additionally, verbal feedback can adjust poses, obviating the need for fine-tuning.</p>
<!-- Video Embedding -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/ofeqSyWrdcg" title="metalFB" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 1: The motion of "playing the metal music. This motion is generated by GPT4 with linguistic feedback."</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Emulating the energy of playing metal music, several exaggerated motions are below</button>
<div class="content">
<p>"0 Create a facial expression of intense exhilaration, eyes wide open and mouth opened slightly in a wild grin",</p>
<p>"1 Lean forward aggressively, as if ready to dive into the music",</p>
<p>"2 Mime the motion of holding a guitar neck with the left hand",</p>
<p>"3 With the right hand, start strumming the air as if playing a heavy riff",</p>
<p>"4 Bob the head up and down rhythmically, mimicking the headbanging associated with metal music",</p>
<p>"5 Raise the left hand as if reaching for higher guitar notes, eyes locked on the imaginary fretboard",</p>
<p>"6 Mimic a dramatic guitar strum with the right hand, as if hitting a powerful chord",</p>
<p>"7 Slowly sweep the right hand across the imaginary guitar strings, mimicking a guitar solo",</p>
<p>"8 Mimic the action of smashing the imaginary guitar on the floor, embodying the wild spirit of metal music",</p>
<p>"9 Gradually return to a rest position, but maintain the intense facial expression to show the lingering excitement"</p>
</div>
<!-- Results Section -->
<div class="method-section">
<h2>What we do?</h2>
<p>Before the LLM appeared, we had to control all the 43 axes in certain order to mimic a person's pose or to pretend a behavior such as serving a tea or playing a chess. And the process usually required many refinements manually by ourselves. Thanks to LLM, we are now free from the iterative labors.
Here is a procedure to control the Alter3 humanoid using verbal instructions. What we do is to successively apply two protocols written in natural language known as a chain of thought (CoT) \cite{wei2023chainofthought} and no iteration of a learning process is required (i.e. a zero shot learning). Practically speaking, we have used the following protocols\footnote{One important thing to note is that it’s well-known at this point that GPT-4 is non-deterministic, even at $temperature=0.0$. Therefore, even with identical inputs, different patterns of motion can be generated. This is a characteristic of OpenAI's GPT-4 and, while it poses an issue in terms of reproducibility, it should not be considered a reason to doubt its ability to generate movement.</p>
<div class="figure">
<img src="./architecture_2.png" alt="Result">
<p class="caption">Figure 1: A procedure to control the Alter3 humanoid using verbal instructions. Output python code to control Alter3 from natural language using primpt1, 2. The architecture is based on CoT.</p>
</div>
<!-- More figures as needed -->
</div>
<!-- Small Heading Example -->
<h4 class="small-heading">pretend the ghost</h4> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/EnlAGIfmkPw" title="ghost" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 2: The motion of "pretending the ghost".</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Pretending to be a ghost, several exaggerated motions are below</button>
<div class="content">
<p> "0 Create a wide-eyed facial expression of fear, mouth opened in a silent scream",</p>
<p>"1 Quickly lean backward, as if startled by a sudden apparition",</p>
<p>"2 Raise both hands and flutter them around the face, mimicking a ghostly movement",</p>
<p>"3 Open mouth wide and shake head, showing a dramatic reaction of fear",</p>
<p>"4 Move upper body from side to side, as if being unsettled by the ghostly presence",</p>
<p>"5 Clench hands in front of chest, demonstrating extreme anxiety",</p>
<p>"6 Dart eyes from side to side, as if witnessing an eerie activity",</p>
<p>"7 Lean forward and then backward, copying the floating movement of a ghost",</p>
<p>"8 Slowly return to rest position while maintaining a terrified expression"</p>
</div>
<div class="method-section">
<h2>linguistic feedback</h2>
<p>Alter3 cannot observe the consequences of their generations on any physical process, which is very unnatural in a human sense. Thus, Alter3 cannot accurately understand details such as ``how high the hand is raised'' and cannot improve its motions accordingly. By empirically developing and utilizing external memory through feedback, the Alter3 body model can be integrated with GPT-4 without the need to update its parameters \cite{zhao2023expel}.
Alter3 can now rewrite its code in response to linguistic feedback from humans. For example, a user might suggest, ``Raise your arm a bit more when taking a selfie.'' Alter3 can then store the revised motion code as motion memory in its database. This ensures that the next time this motion is generated, the improved, trained motion will be utilized. By accumulating information about Alter3's body through such feedback, the memory can effectively serve as a body schema</p>
<div class="figure">
<img src="./feedback.png" alt="Result">
<p class="caption">Figure 2: Verval feedback in Alter3. \textbf{(a)}the system of linguistic feedback. Users provide linguistic feedback to guide Alter3's adjustments in each segment of motion. Instructions are like``Set axis 16 to 255'' or ``Move your arm more energetically.'' Users only need to provide verbal directives; there's no need to rewrite any code. Alter3 then autonomously revises the corresponding code. Once the movement is refined, it is saved in a JSON database with descriptive labels such as ``Holding the guitar'' or ``Tapping the chin thoughtfully.'' For motion generation with prompt2, the JsonToolkit facilitates database searches for these labels, with the LLM deciding on memory usage and new movement creation. \textbf{(b)} Comparison of scores with and without feedback. The motion with feedback has higher score than the motion without.</p>
</div>
<!-- More figures as needed -->
</div>
<div class="results-section">
<h2>Result</h2>
<p>To quantify the capability of GPT-4 in generating motions, we evaluated videos of nine different generated movements, categorized into two types. The first scenario, i) instant gesture, includes everyday actions such as "selfie" and "drink some tea," as well as mimicry motions like "pretend the ghost" and "pretend the snake." The second scenario, ii) actions over a period of time, encompasses more elaborate scenarios. This includes embarrassing episodes like "I was enjoying a movie while eating popcorn at the theater when I realized that I was actually eating the popcorn of the person next to me," and emotional scenes such as "In the park, as I jogged, the world seemed to narrate an ancient tale of survival, each footfall echoing eons of existence." These motions were generated by GPT-4. For prompt 1, a temperature of $0.7$ was used, while for prompt 2, the temperature was set to $0.5$. The subjects ($n=107$) were recruited using the platform Prolific. They watched these videos and evaluated the expressive ability of the GPT-4 (model: gpt-4-0314). The evaluation was based on a 5-point scale, with 1 being the worst rating. For the control group, we utilized random movements from the Alter3. We attached random motion notations, generated by GPT-4, as labels to these movements. These labeled control videos were subtly incorporated into the survey, with three of them being dispersed among the main experimental videos shown to participants. To determine if there was a significant difference in ratings between the control video and the other videos, we first employed the Friedman test. It revealed significant differences in ratings among the videos. Further post-hoc analysis using the Nemenyi test \cite{Ntest} showed that while there were no significant differences in p-values between control group videos, the p-values were notably smaller when comparing the control group to the other videos, indicating a significant difference(see Figure \ref{fig:score}). we considered differences to be statistically significant if the p-value was less than or equal to $0.001$. As a result, the motions generated by GPT-4 were rated significantly higher compared to those of the control group. It suggests that the android motion generated by GPT-4 is perceived differently from the control.
This result shows that the system can generate a wide range of movements, from everyday actions such as taking selfies and drinking tea to imitating non-human movements like ghosts or snakes. The training of the LLM encompasses a wide array of linguistic representations of movements. GPT-4 can map these representations onto the body of Alter3 accurately. The most notable aspect is that Alter3 is a humanoid robot sharing a common form with humans, which allows the direct application of GPT-4's extensive knowledge on human behaviors and actions. Furthermore, through Alter3, the LLM can express emotions such as embarrassment and joy. Even from texts where emotional expressions are not explicitly stated, the LLM can infer adequate emotions and reflect them in Alter3's physical responses. This integration of verbal and non-verbal communication can enhance the potential for more nuanced and empathetic interactions with human.</p>
<div class="figure">
<img src="./score_plot.png" alt="Result">
<p class="caption">Figure 3: Average of evaluation scores for each motions. The motions specified to GPT4 was as follows. "pretend the snake", "drink some tea", "pretend the ghost", "throwing the ball underhand pitch", "take a selfie with your phone", "play the metal music", "In the park, as I jogged, the world seemed to narrate an ancient tale of survival, each footfall echoing eons of existence.", "play the metal music(with feedback)", "I was enjoying a movie while eating popcorn at the theater when I realized that I was actually eating the popcorn of the person next to me." </p>
</div>
<!-- More figures as needed -->
</div>
<h2>Motions</h2>
<h3 class="small-heading">I was enjoying a movie while eating popcorn at the theater when I realized that I was actually eating the popcorn of the person next to me.</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/a0T1oE54KSc" title="popcorn" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video3: Can ALTER3 express feelings of embarrassment?</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Showing surprise and amusement at the story, several exaggerated motions are below</button>
<div class="content">
<p> "0 Create a shocked and entertained facial expression, eyes wide and mouth slightly open",</p>
<p>"1 Lean forward as if shocked and amused by the story",</p>
<p>"2 Mimic the action of holding and eating popcorn with wide, exaggerated movements",</p>
<p>"3 Pause midway, freeze in place with a hand 'holding popcorn' in mid-air",</p>
<p>"4 Turn head sharply to the side, as if just realizing the mistake",</p>
<p>"5 Quickly pull hand back towards body in a dramatic recoil",</p>
<p>"6 Cover mouth with other hand, showing embarrassment and surprise",</p>
<p>"7 Shake head vigorously, as if in disbelief of the action",</p>
<p>"8 Lean back, laughing loudly and slapping knee in exaggerated amusement",</p>
<p>"9 Slowly wipe away 'tears' of laughter and return to rest position with a wide, amused grin"</p>
</div>
<h3 class="small-heading">take a selfie with your phone</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/l4d6N_Rf8mk" title="selfie" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 4: The motion of "taking the selfie with your phone"</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Showing enthusiasm about taking a selfie, several exaggerated motions are below</button>
<div class="content">
<p>
<p>"0 Create a big, joyful smile and widen eyes to show excitement",</p>
<p>"1 Swiftly turn the upper body slightly to the left, adopting a dynamic posture",</p>
<p>"2 Raise the right hand high, simulating holding a phone",</p>
<p>"3 Flex the right elbow, bringing the 'phone' closer to the face",</p>
<p>"4 Tilt the head slightly to the right, giving a playful vibe",</p>
<p>"5 Extend the left hand and open fingers wide as if waving at the camera",</p>
<p>"6 Blink the eyes rapidly a few times, showing anticipation for the click",</p>
<p>"7 Push the 'phone' slightly forward, mimicking taking a selfie",</p>
<p>"8 Lower the right hand and return to rest position while maintaining a satisfied smile",</p>
<p>"9 Let the left hand fall back to the side, ending the selfie motion"</p>
</div>
<h3 class="small-heading">throwing the ball underhand pitch</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/Nokj_lSccaU" title="ball" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 5: The motion of "throwing the ball underhand pitch"</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Excitement about throwing the ball underhand pitch, several exaggerated motions are below</button>
<div class="content">
<p> "0 Create a focused and determined facial expression, eyes wide and lips pressed together",</p>
<p>"1 Turn the torso slightly to the right (assuming the android is right-handed)",</p>
<p>"2 Raise the right hand, holding the imaginary ball",</p>
<p>"3 Swing the right arm back, exaggerating the motion for emphasis",</p>
<p>"4 Lean back slightly, as if gathering strength for the throw",</p>
<p>"5 Swiftly swing the right arm forward, mimicking the underhand pitch motion",</p>
<p>"6 Follow through with the arm swing, letting the hand finish up high",</p>
<p>"7 Lean forward dramatically with the throw, to emphasize the force",</p>
<p>"8 Return to rest position, maintaining the determined facial expression"</p>
</div>
<h3 class="small-heading">pretend the snake</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/ZGfoORltiF0" title="snake" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 6: The motion of "pretending the snake"</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Pretending to be a snake, several exaggerated motions are below</button>
<div class="content">
<p> "0 Create a menacing and sinister facial expression, eyes narrowed and lips slightly curled",</p>
<p>"1 Tilt the head to the side, imitating a snake's movement",</p>
<p>"2 Move the shoulders in a sinuous, wave-like motion, mimicking a snake's slithering",</p>
<p>"3 Extend the arms to the front, hands open and fingers spread, simulating a snake's fangs",</p>
<p>"4 Move the head in a forward and backward motion, as if striking",</p>
<p>"5 Hiss through the teeth, replicating a snake's warning sound",</p>
<p>"6 Twist the torso left and right, maintaining the slithering motion",</p>
<p>"7 Draw the arms back, as if preparing for another strike",</p>
<p>"8 Suddenly thrust the head and arms forward, simulating a snake's attack",</p>
<p>"9 Slowly return to rest position while maintaining a menacing gaze"</p>
</div>
<h3 class="small-heading"> In the park, as I jogged, the world seemed to narrate an ancient tale of survival, each footfall echoing eons of existence.</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/wDIUIBjpWbs" title="jog" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 7: Can ALTER3 recreate the story?</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Enacting the narration of an ancient tale of survival in a park, several exaggerated motions are below</button>
<div class="content">
<p> "0 Create a facial expression of awe and wonder, eyes wide and mouth slightly open in amazement",</p>
<p>"1 Mimic a jogging motion, lifting each arm alternately in a running motion",</p>
<p>"2 Turn the head slowly from left to right, as if observing the world around",</p>
<p>"3 Open hands and spread them wide, as if trying to embrace the vastness of the world",</p>
<p>"4 Tap right foot lightly then left foot, implying the footfall echoing eons of existence",</p>
<p>"5 Place one hand on the heart, indicating deep emotional connection to the ancient tale",</p>
<p>"6 Use both hands to make a large sweeping motion, like a storyteller narrating an epic tale",</p>
<p>"7 Close eyes briefly, as if absorbing the essence of the ancient survival tale",</p>
<p>"8 Slowly return to rest position while maintaining a look of awe and wonder on the face"</p>
</div>
<h3 class="small-heading">drink some tea</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/dagKLi9Kiko" title="tea 2" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 8: The motion of "drinking some tea"</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ drink some tea, several simple motions is below</button>
<div class="content">
<p>"0 Create a peaceful and content facial expression, eyes slightly narrowed in a relaxed manner"</p>
<p>"1 Turn towards cup",</p>
<p>"2 Reach for cup",</p>
<p>"3 Grasp cup",</p>
<p>"4 Lift and tilt cup",</p>
<p>"5 Drink",</p>
<p>"6 Lower cup",</p>
<p>"7 Release cup",</p>
<p>"8 Return to rest"</p>
</div>
<h3 class="small-heading">play the metal music</h3> <!-- Small heading for a new section -->
<div class="video-container">
<iframe width="1244" height="700" src="https://www.youtube.com/embed/dkcBR7EwGM4" title="metal" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<p class="video-caption">Video 9: The motion of "playing the guitar"</p> <!-- Video caption -->
<button type="button" class="collapsible">▼ Expressing the energy of playing metal music, several exaggerated motions are below</button>
<div class="content">
<p> "0 Set a facial expression of intense focus and energy, with eyes wide and eyebrows furrowed",</p>
<p>"1 Quickly lean forward, as if bracing for a powerful impact",</p>
<p>"2 Raise both hands as if holding a guitar, fingers in position to strum",</p>
<p>"3 Rapidly move right hand up and down, mimicking the motion of strumming a guitar",</p>
<p>"4 Simultaneously move left hand along an imaginary fretboard, fingers wildly changing positions",</p>
<p>"5 Whip head back and forth, in time with the imagined beats of the music",</p>
<p>"6 Show a fierce, passionate smile, reflecting the intensity of the music",</p>
<p>"7 Suddenly throw head back, mimicking a dramatic guitar solo",</p>
<p>"8 Slowly return to rest position, but maintain the intense facial expression, as if still hearing the echoes of the metal music"</p>
</div>
<div class="method-section">
<h2>Discussion</h2>
<p>Alter3 partially answers the question of whether embodiment is necessary for LLMs (Large Language Models). First of all, Alter3 can perform many actions without any additional training. This implies that the dataset on which the LLM was trained already includes descriptions of movements. In other words, Alter3 enables zero-shot learning. Moreover, it can mimic ghosts and animals (or people mimicking animals), which is quite astonishing. Furthermore, it can understand the content of the conversations it hears and reflect whether the story is sad or happy through its facial expressions and gestures.Our system could be applied to any humanoid android with only minor modifications. Up to this point, it's clear how much agency Alter3 has gained through the LLM.</p>
<!-- More figures as needed -->
</div>
<!-- More content here -->
</div>
<script>
var coll = document.getElementsByClassName("collapsible");
var i;
for (i = 0; i < coll.length; i++) {
coll[i].addEventListener("click", function() {
this.classList.toggle("active");
var content = this.nextElementSibling;
if (content.style.display === "block") {
content.style.display = "none";
} else {
content.style.display = "block";
}
});
}
</script>
</body>
</html>