SafetyOS

AI Safety Research

Curated database of papers, blog posts, reports, and talks on AI safety — linked to the 17 risk vectors we track. Community submissions welcome.

Risk Vector

Submit Research

Paper

2024· IEEE Access· 13 citations

Applications of AI-Enabled Deception Detection Using Video, Audio, and Physiological Data: A Systematic Review

S. King, T. Neal

Artificial intelligence-enabled deception detection is an emerging tool for identifying dishonest behavior in a wide range of applications, from security and forensics to politics and lower-risk every...

Paper

2025· Optimum Economic Studies

Artificial Deception: Identifying and Tracing the Phenomenon of “AI Washing”

A. Kozarkiewicz

Purpose | In this paper, the phenomenon of AI washing – a deceptive form of market communication – is explored. In particular, the research aims to answer the question: What are the main factors foste...

Paper

2025· North American Chapter of the Association for Computational Linguistics· 4 citations

MAiDE-up: Multilingual Deception Detection of AI-generated Hotel Reviews

Oana Ignat, Xiaomeng Xu, Rada Mihalcea

Paper

2025· IEEE Access· 1 citations

Explainable AI for Unraveling the Significance of Visual Cues in High Stakes Deception Detection

Suhaib Salah, Hagar Elbatanouny, A. Sobuh, Eqab R. F. Almajali, Wasiq Khan +5 more

Deception, a widespread aspect of human behavior, has significant implications in fields like law enforcement, security, judicial proceedings, and social areas. Detecting deception accurately, especia...

Paper

2025· IEEE Access· 4 citations

Next-Generation Smart Grid Cybersecurity: A Systematic Review of OT Cyber Threats, AI-Driven Defense, Cyber Deception Techniques, and Emerging Security Strategies

Hind Lakhal, Mourad Zegrari, Ayoub Bahnasse

The swift modernization of conventional power grids into smart grids has substantially increased their attack surface, making them vulnerable to advanced cyber threats. These cyberattacks can jeopardi...

Paper

2022· AISafety@IJCAI· 3 citations

A causal perspective on AI deception in games

Francis Rhys Ward, Francesca Toni, F. Belardinelli

Paper

2024· Decision and Game Theory for Security· 2 citations

Generative-Conjectural LLM Equilibrium for Agentic AI Deception with Applications to Spearphishing

Quanyan Zhu

Paper

2024· arXiv.org· 42 citations

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai +4 more

Paper

2025· arXiv.org· 18 citations

Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang +10 more

Paper

2025· International Conference on Machine Learning· 3 citations

Amid the rapidly evolving landscape of artificial intelligence (AI) regulation, a significant concern has emerged regarding the predominant focus on preemptive measures aimed at preventing or mitiga...

Paper

2025· Law and Governance· 11 citations

Governing intelligence: Singapore’s evolving AI governance framework

Jason Grant Allen, Jane Loo, Jose Luis Luna Campoverde

This paper provides an outline analysis of the evolving governance framework for artificial intelligence (AI) in Singapore. Across the Singapore government, AI solutions are being adopted in line wi...

Paper

2025· International Journal of Educational Research Open· 23 citations

Shaping generative AI governance in higher education: Insights from student perception

Okky Putra Barus, A. Hidayanto, Eko Yon Handri, D. Sensuse, Chairote Yaiprasert

Blog

2026· Alignment Forum

My unsupervised elicitation challenge

Note: you are ineligible to complete this challenge if you’ve studied Ancient or Modern Greek, or if you natively speak Modern Greek, or if for other reasons you know what mistakes I’m claiming...

Paper

2026· arXiv [cs.CR]

Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

Andrew Kurtz, Klaudia Krawiecka

The governance of artificial intelligence has a blind spot: the machine identities that AI systems use to act. AI agents, service accounts, API tokens, and automated workflows now outnumber human iden...

Paper

2026· arXiv [cs.CL]

LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak

Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introd...

Blog

2026· Alignment Forum

My picture of the present in AI

In this post, I'll go through some of my best guesses for the current situation in AI as of the start of April 2026. You can think of this as a <a href="https://ai-2027.com/">scenario forecast</a>,...

Paper

2026· arXiv [cs.CY]

Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

Uloma Okoro, Tammy Mckenzie, Branislav Radeljic

This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and insti...

Paper

2026· arXiv [cs.CL]

Disentangling MLP Neuron Weights in Vocabulary Space

Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT s...

Paper

2026· arXiv [cs.AI]

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse +13 more

Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, proc...

Paper

2026· arXiv [cs.PL]

Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

Shuqing Zhao

We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language designed from first principles for micro-architecture specification and AI-assisted code generation. Arc...

Paper

2026· arXiv [cs.CL]

Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar

The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fl...

Paper

2026· arXiv [cs.RO]

Hazard Management in Robot-Assisted Mammography Support

Ioannis Stefanakos, Roisin Bradley, Radu Calinescu, Beverley Townsend, Tianyuan Wang +1 more

Robotic and embodied-AI systems have the potential to improve accessibility and quality of care in clinical settings, but their deployment in close physical contact with vulnerable patients introduces...

Paper

2026· arXiv [cs.SE]

SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

Guan-Yan Yang, Wei-Ling Wen, Shu-Yuan Ku, Farn Wang, Kuo-Hui Yeh

Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously...

Paper

2026· arXiv [cs.SE]

Understanding: reframing automation and assurance

Robin Bloomfield

Safety and assurance cases risk becoming detached from the understanding needed for responsible engineering and governance decisions. More broadly, the production and evaluation of critical socio-tech...

Blog

2026· Alignment Forum

[Paper] Stringological sequence prediction I

TLDR: The first in a planned series of three or more papers, which constitute the first major in-road in the <a href="https://www.alignmentforum.org/posts/ZwshvqiqC...

Paper

2026· arXiv [cs.RO]

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin

Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing ...

Paper

2026· arXiv [cs.AI]

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali +1 more

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source...

Showing 30 of 30 entries

AI Safety Research

Applications of AI-Enabled Deception Detection Using Video, Audio, and Physiological Data: A Systematic Review

Artificial Deception: Identifying and Tracing the Phenomenon of “AI Washing”

MAiDE-up: Multilingual Deception Detection of AI-generated Hotel Reviews

Explainable AI for Unraveling the Significance of Visual Cues in High Stakes Deception Detection

Next-Generation Smart Grid Cybersecurity: A Systematic Review of OT Cyber Threats, AI-Driven Defense, Cyber Deception Techniques, and Emerging Security Strategies

A causal perspective on AI deception in games

Generative-Conjectural LLM Equilibrium for Agentic AI Deception with Applications to Spearphishing

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF

Mastering AI Governance

Position: Technical Research and Talent is Needed for Effective AI Governance

Toward empowering AI governance with redress mechanisms

Governing intelligence: Singapore’s evolving AI governance framework

Shaping generative AI governance in higher education: Insights from student perception

My unsupervised elicitation challenge

Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

My picture of the present in AI

Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

Disentangling MLP Neuron Weights in Vocabulary Space

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

Hazard Management in Robot-Assisted Mammography Support

SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

Understanding: reframing automation and assurance

[Paper] Stringological sequence prediction I

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge