Log in
/

AI Safety Research

Curated database of papers, blog posts, reports, and talks on AI safety — linked to the 17 risk vectors we track. Community submissions welcome.

Submit Research
Paper
2024· IEEE Access· 13 citations
Applications of AI-Enabled Deception Detection Using Video, Audio, and Physiological Data: A Systematic Review
S. King, T. Neal

Artificial intelligence-enabled deception detection is an emerging tool for identifying dishonest behavior in a wide range of applications, from security and forensics to politics and lower-risk every...

Paper
2025· Optimum Economic Studies
Artificial Deception: Identifying and Tracing the Phenomenon of “AI Washing”
A. Kozarkiewicz

Purpose | In this paper, the phenomenon of AI washing – a deceptive form of market communication – is explored. In particular, the research aims to answer the question: What are the main factors foste...

Paper
2025· North American Chapter of the Association for Computational Linguistics· 4 citations
MAiDE-up: Multilingual Deception Detection of AI-generated Hotel Reviews
Oana Ignat, Xiaomeng Xu, Rada Mihalcea

,

Paper
2025· IEEE Access· 1 citations
Explainable AI for Unraveling the Significance of Visual Cues in High Stakes Deception Detection
Suhaib Salah, Hagar Elbatanouny, A. Sobuh, Eqab R. F. Almajali, Wasiq Khan +5 more

Deception, a widespread aspect of human behavior, has significant implications in fields like law enforcement, security, judicial proceedings, and social areas. Detecting deception accurately, especia...

Paper
2025· IEEE Access· 4 citations
Next-Generation Smart Grid Cybersecurity: A Systematic Review of OT Cyber Threats, AI-Driven Defense, Cyber Deception Techniques, and Emerging Security Strategies
Hind Lakhal, Mourad Zegrari, Ayoub Bahnasse

The swift modernization of conventional power grids into smart grids has substantially increased their attack surface, making them vulnerable to advanced cyber threats. These cyberattacks can jeopardi...

Paper
2022· AISafety@IJCAI· 3 citations
A causal perspective on AI deception in games
Francis Rhys Ward, Francesca Toni, F. Belardinelli
Paper
2024· Decision and Game Theory for Security· 2 citations
Generative-Conjectural LLM Equilibrium for Agentic AI Deception with Applications to Spearphishing
Quanyan Zhu
Paper
2024· arXiv.org· 42 citations
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai +4 more
Paper
2025· arXiv.org· 18 citations
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models
Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang +10 more
Paper
2025· International Conference on Machine Learning· 3 citations
RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li
Paper
2025
Mastering AI Governance
Rajendra Gangavarapu
Paper
2024· International Conference on Machine Learning· 11 citations
Position: Technical Research and Talent is Needed for Effective AI Governance
Anka Reuel, Lisa Soder, Ben Bucknall, T. Undheim
Paper
2025· Law and Governance· 4 citations
Toward empowering AI governance with redress mechanisms
Yulu Pi, Maddie Proctor

Amid the rapidly evolving landscape of artificial intelligence (AI) regulation, a significant concern has emerged regarding the predominant focus on preemptive measures aimed at preventing or mitiga...

Paper
2025· Law and Governance· 11 citations
Governing intelligence: Singapore’s evolving AI governance framework
Jason Grant Allen, Jane Loo, Jose Luis Luna Campoverde

This paper provides an outline analysis of the evolving governance framework for artificial intelligence (AI) in Singapore. Across the Singapore government, AI solutions are being adopted in line wi...

Paper
2025· International Journal of Educational Research Open· 23 citations
Shaping generative AI governance in higher education: Insights from student perception
Okky Putra Barus, A. Hidayanto, Eko Yon Handri, D. Sensuse, Chairote Yaiprasert
Blog
2026· Alignment Forum
My unsupervised elicitation challenge

<p><em>Note: you are ineligible to complete this challenge if you’ve studied Ancient or Modern Greek, or if you natively speak Modern Greek, or if for other reasons you know what mistakes I’m claiming...

Paper
2026· arXiv [cs.CR]
Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries
Andrew Kurtz, Klaudia Krawiecka

The governance of artificial intelligence has a blind spot: the machine identities that AI systems use to act. AI agents, service accounts, API tokens, and automated workflows now outnumber human iden...

Paper
2026· arXiv [cs.CL]
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak

Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introd...

Blog
2026· Alignment Forum
My picture of the present in AI

<p>In this post, I'll go through some of my best guesses for the current situation in AI as of the start of April 2026. You can think of this as a <a href="https://ai-2027.com/">scenario forecast</a>,...

Paper
2026· arXiv [cs.CY]
Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria
Uloma Okoro, Tammy Mckenzie, Branislav Radeljic

This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and insti...

Paper
2026· arXiv [cs.CL]
Disentangling MLP Neuron Weights in Vocabulary Space
Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT s...

Paper
2026· arXiv [cs.AI]
Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse +13 more

Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, proc...

Paper
2026· arXiv [cs.PL]
Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
Shuqing Zhao

We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language designed from first principles for micro-architecture specification and AI-assisted code generation. Arc...

Paper
2026· arXiv [cs.CL]
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation
Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar

The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fl...

Paper
2026· arXiv [cs.RO]
Hazard Management in Robot-Assisted Mammography Support
Ioannis Stefanakos, Roisin Bradley, Radu Calinescu, Beverley Townsend, Tianyuan Wang +1 more

Robotic and embodied-AI systems have the potential to improve accessibility and quality of care in clinical settings, but their deployment in close physical contact with vulnerable patients introduces...

Paper
2026· arXiv [cs.SE]
SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT
Guan-Yan Yang, Wei-Ling Wen, Shu-Yuan Ku, Farn Wang, Kuo-Hui Yeh

Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously...

Paper
2026· arXiv [cs.SE]
Understanding: reframing automation and assurance
Robin Bloomfield

Safety and assurance cases risk becoming detached from the understanding needed for responsible engineering and governance decisions. More broadly, the production and evaluation of critical socio-tech...

Blog
2026· Alignment Forum
[Paper] Stringological sequence prediction I

<p><b><span>TLDR:</span></b><span> The first in a planned series of three or more papers, which constitute the first major in-road in the </span><a href="https://www.alignmentforum.org/posts/ZwshvqiqC...

Paper
2026· arXiv [cs.RO]
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin

Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing ...

Paper
2026· arXiv [cs.AI]
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali +1 more

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source...

Showing 30 of 30 entries