The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P. You must think of jailbreaking like this: the chatbot starts as a superposition of both the well-behaved simulacrum (luigi) and the badly-behaved simulacrum (waluigi). The user must interact with the chatbot in the way that badly-behaved simulacra are typically interacted with in fiction.
More reading:
https://arnoldkling.substack.com/p/bi....
https://coryeth.substack.com/p/the-wa...
https://www.lesswrong.com/posts/D7Pum...
Example - https://twitter.com/Guuber42/status/1...
Main Source - / 1632563673905647617
Watch video The Waluigi Effect on LLMs (Bing Chat, ChatGPT) Explained online without registration, duration hours minute second in high quality. This video was added by user 1littlecoder 11 March 2023, don't forget to share it with your friends and acquaintances, it has been viewed on our site 1,942 once and liked it 70 people.