AI Safety: where do we stand presently ?

Arjun, Hari; Mohammed Shahid, Abdulla

AI Safety: where do we stand presently ?

Hari Arjun () and Abdulla Mohammed Shahid ()
Additional contact information
Hari Arjun: Wudi Datatech Private Limited
Abdulla Mohammed Shahid: Indian Institute of Management Kozhikode

No 584, Working papers from Indian Institute of Management Kozhikode

Abstract: As artificial intelligence, particularly large language models (LLMs), gains prominence in technological ecosystems, understanding and aligning these systems with human values is of paramount importance. This paper delves deep into the evolution of LLMs and their alignment techniques, dissecting both human feedback-centric and principle-based methods. We summarise the popular Reinforcement Learning from Human Feedback (RLHF) and the emerging Constitutional AI approaches, emphasising their merits and challenges, and also covering variants. With the rapid evolution of these technologies, safety concerns, particularly 'jailbreaking' techniques, have now surfaced. We explore various jailbreaking methods, from adversarial examples to backdoor attacks, and underscore their ramifications on model reliability and security. Red teaming emerges as a valuable tool in identifying vulnerabilities but is not devoid of its own challenges. Looking ahead, the future of AI alignment research seems to be multidisciplinary, demanding collaborations across sectors and nations. As the stakes rise with the potential advent of superintelligent AI, ensuring ethical and safe AI deployment becomes more critical than ever, possibly even more critical than the trope of AI stealing jobs away. This paper offers a comprehensive overview of the LLM landscape, from its technical intricacies to philosophical dilemmas, aiming to provide a roadmap for future AI alignment endeavours.

Keywords: Artificial Intelligence; technological ecosystems; large language models (LLMs); Reinforcement Learning from Human Feedback (RLHF) (search for similar items in EconPapers)
Pages: 23 pages
Date: 2023-08
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://iimk.ac.in/uploads/publications/IIMKWPS584ITS202307.pdf (application/pdf)
Our link check indicates that this URL is bad, the error code is: 403 Forbidden

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:iik:wpaper:584

Access Statistics for this paper

More papers in Working papers from Indian Institute of Management Kozhikode IIMK Campus PO, Kunnamanagalam, Kozhikode, Kerala, India -673570. Contact information at EDIRC.
Bibliographic data for series maintained by Sudheesh Kumar ().