The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.
Understanding Epistemic Language with a Bayesian Theory of Mind
Lance Ying, Tan Zhi-Xuan, Lionel Wong, Vikash Mansinghka, and Joshua B Tenenbaum
How do people understand and evaluate claims about others’ beliefs, even though these beliefs cannot be directly observed? In this paper, we introduce a cognitive model of epistemic language interpretation, grounded in Bayesian inferences about other agents’ goals, beliefs, and intentions: a language-augmented Bayesian theory-of-mind (LaBToM). By translating natural language into an epistemic “language-of-thought”, then evaluating these translations against the inferences produced by inverting a probabilistic generative model of rational action and perception, LaBToM captures graded plausibility judgments about epistemic claims. We validate our model in an experiment where participants watch an agent navigate a maze to find keys hidden in boxes needed to reach their goal, then rate sentences about the agent’s beliefs. In contrast with multimodal LLMs (GPT-4o, Gemini Pro) and ablated models, our model correlates highly with human judgments for a wide range of expressions, including modal language, uncertainty expressions, knowledge claims, likelihood comparisons, and attributions of false belief.
Building Machines that Learn and Think with People
Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, Adrian Weller, Joshua B Tenenbaum, and Thomas L Griffiths
What do we want from machine intelligence? We envision machines that are not just tools for thought, but partners in thought: reasonable, insightful, knowledgeable, reliable, and trustworthy systems that think with us. Current artificial intelligence (AI) systems satisfy some of these criteria, some of the time. In this Perspective, we show how the science of collaborative cognition can be put to work to engineer systems that really can be called “thought partners,” systems built to meet our expectations and complement our limitations. We lay out several modes of collaborative thought in which humans and AI thought partners can engage and propose desiderata for human-compatible thought partnerships. Drawing on motifs from computational cognitive science, we motivate an alternative scaling path for the design of thought partners and ecosystems around their use through a Bayesian lens, whereby the partners we construct actively build and reason over models of the human and world.
The space of human goals is tremendously vast; and yet, from just a few moments of watching a scene or reading a story, we seem to spontaneously infer a range of plausible motivations for the people and characters involved. What explains this remarkable capacity for intuiting other agents’ goals, despite the infinitude of ends they might pursue? And how does this cohere with our understanding of other people as approximately rational agents? In this paper, we introduce a sequential Monte Carlo model of open-ended goal inference, which combines top-down Bayesian inverse planning with bottom-up sampling based on the statistics of co-occurring subgoals. By proposing goal hypotheses related to the subgoals achieved by an agent, our model rapidly generates plausible goals without exhaustive search, then filters out goals that would be irrational given the actions taken so far. We validate this model in a goal inference task called Block Words, where participants try to guess the word that someone is stacking out of lettered blocks. In comparison to both heuristic bottom-up guessing and exact Bayesian inference over hundreds of goals, our model better predicts the mean, variance, efficiency, and resource rationality of human goal inferences, achieving similar accuracy to the exact model at a fraction of the cognitive cost, while also explaining garden-path effects that arise from misleading bottom-up cues. Our experiments thus highlight the importance of uniting top-down and bottom-up models for explaining the speed, accuracy, and generality of human theory-of-mind.
Despite the fact that beliefs are mental states that cannot be directly observed, humans talk about each others’ beliefs on a regular basis, often using rich compositional language to describe what others think and know. What explains this capacity to interpret the hidden epistemic content of other minds? In this paper, we take a step towards an answer by grounding the semantics of belief statements in a Bayesian theory-of-mind: By modeling how humans jointly infer coherent sets of goals, beliefs, and plans that explain an agent’s actions, then evaluating statements about the agent’s beliefs against these inferences via epistemic logic, our framework provides a functional role semantics for belief, explaining the gradedness and compositionality of human belief attributions, as well as their intimate connection with goals and plans. We evaluate this framework by studying how humans attribute goals and evaluate belief sentences while watching an agent solve a doors-and-keys gridworld puzzle that requires instrumental reasoning about hidden objects. In contrast to pure logical deduction, non-mentalizing baselines, and mentalizing that ignores the role of instrumental plans, our model provides a much better fit to human goal and belief attributions, demonstrating the importance of theory-of-mind for modeling how humans understand language about beliefs.
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum
People often give instructions whose meaning is ambiguous without further context, expecting that their actions or goals will disambiguate their intentions. How can we build assistive agents that follow such instructions in a flexible, context-sensitive manner? This paper introduces cooperative language-guided inverse plan search (CLIPS), a Bayesian agent architecture for pragmatic instruction following and goal assistance. Our agent assists a human by modeling them as a cooperative planner who communicates joint plans to the assistant, then performs multimodal Bayesian inference over the human’s goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, our assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. We evaluate these capabilities in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters.
A universal feature of human societies is the adoption of systems of rules and norms in the service of cooperative ends. How can we build learning agents that do the same, so that they may flexibly cooperate with the human institutions they are embedded in? We hypothesize that agents can achieve this by assuming there exists a shared set of norms that most others comply with while pursuing their individual desires, even if they do not know the exact content of those norms. By assuming shared norms, a newly introduced agent can infer the norms of an existing population from observations of compliance and violation. Furthermore, groups of agents can converge to a shared set of norms, even if they initially diverge in their beliefs about what the norms are. This in turn enables the stability of the normative system: since agents can bootstrap common knowledge of the norms, this leads the norms to be widely adhered to, enabling new entrants to rapidly learn those norms. We formalize this framework in the context of Markov games and demonstrate its operation in a multi-agent environment via approximately Bayesian rule induction of obligative and prohibitive norms. Using our approach, agents are able to rapidly learn and sustain a variety of cooperative institutions, including resource management norms and compensation for pro-social labor, promoting collective welfare while still allowing agents to act in their own interests.
2023
Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs
Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka
In ICML 2023 Workshop on Sampling and Optimization in Discrete Spaces (SoDS) Jul 2023
Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to enforcing syntactic and semantic constraints on the outputs of LLMs, called sequential Monte Carlo (SMC) steering. The key idea is to specify language generation tasks as posterior inference problems in a class of discrete probabilistic sequence models, and replace standard decoding with sequential Monte Carlo inference. For a computational cost similar to that of beam search, SMC can steer LLMs to solve diverse tasks, including infilling, generation under syntactic constraints, and prompt intersection. To facilitate experimentation with SMC steering, we present a probabilistic programming library, LLaMPPL (this https URL), for concisely specifying new generation tasks as language model probabilistic programs, and automating steering of LLaMA-family Transformers.
Inferring the goals of communicating agents from actions and instructions
Lance Ying, Tan Zhi-Xuan, Vikash Mansinghka, and Joshua B Tenenbaum
In Proceedings of the 2023 AAAI Fall Symposia Jul 2023
When humans cooperate, they frequently coordinate their activity through both verbal communication and non-verbal actions, using this information to infer a shared goal and plan. How can we model this inferential ability? In this paper, we introduce a model of a cooperative team where one agent, the principal, may communicate natural language instructions about their shared plan to another agent, the assistant, using GPT-3 as a likelihood function for instruction utterances. We then show how a third person observer can infer the team’s goal via multi-modal Bayesian inverse planning from actions and instructions, computing the posterior distribution over goals under the assumption that agents will act and communicate rationally to achieve them. We evaluate this approach by comparing it with human goal inferences in a multi-agent gridworld, finding that our model’s inferences closely correlate with human judgments (R = 0.96). When compared to inference from actions alone, we also find that instructions lead to more rapid and less uncertain goal inference, highlighting the importance of verbal communication for cooperative agents.
The Neuro-Symbolic Inverse Planning Engine (NIPE): Modeling Probabilistic Social Inferences from Linguistic Inputs
Lance Ying, Katherine M Collins, Megan Wei, Cedegao E Zhang, Tan Zhi-Xuan, Adrian Weller, Joshua B Tenenbaum, and Lionel Wong
In ICML 2023 Workshop on Theory of Mind in Communicating Agents Jul 2023
Human beings are social creatures. We routinely reason about other agents, and a crucial component of this social reasoning is inferring people’s goals as we learn about their actions. In many settings, we can perform intuitive but reliable goal inference from language descriptions of agents, actions, and the background environments. In this paper, we study this process of language driving and influencing social reasoning in a probabilistic goal inference domain. We propose a neuro-symbolic model that carries out goal inference from linguistic inputs of agent scenarios. The "neuro" part is a large language model (LLM) that translates language descriptions to code representations, and the "symbolic" part is a Bayesian inverse planning engine. To test our model, we design and run a human experiment on a linguistic goal inference task. Our model closely matches human response patterns and better predicts human judgements than using an LLM alone.
How do we know when it’s OK to break moral rules? We propose that — alongside well-studied outcome-based measures of welfare and harm — people sometimes use universalization, asking "What if everyone felt at liberty to ignore the rule?" We develop a virtual environment where agents stand in line to gather water. Subjects judge agents who get out of line to try to get water more quickly. If subjects use universalization, they would need to imagine all agents getting out of line and going straight for the water in each environment. To test this prediction, we model an action’s universalizability by simulating what would happen if every agent tried to follow a path directly to the water, then evaluating the effects. We also investigate the role of several outcome-based measures, including welfare aggregation and harm-based measures. We find that universalizability plays an important role in rule-breaking judgments alongside outcome-based concerns.
Bayesian Inverse Motion Planning for Online Goal Inference in Continuous Domains
Tan Zhi-Xuan, Jovana Kondic, Stewart Slocum, Joshua B Tenenbaum, Vikash K Mansinghka, and Dylan Hadfield-Menell
In ICRA 2023 Workshop on Cognitive Modeling in Robot Learning Jun 2023
Humans and other agents navigate their environments by acting efficiently to achieve their goals. In order to infer agents’ goals from their actions, it is thus necessary to model how agents achieve their goals efficiently. Here, we show how online goal inference and trajectory prediction in continuous domains can be performed via Bayesian inverse motion planning: By modeling an agent as an approximately Boltzmann-rational motion planner that produces low-cost trajectories while avoiding obstacles, and placing a prior over goals, we can infer the agent’s goal and future trajectory from partial trajectory observations. We compute these inferences online using a sequential Monte Carlo algorithm, which accounts for the multimodal distribution of trajectories due to obstacles, and exhibits better calibration at early timesteps than a Laplace approximation and a greedy baseline.
This paper introduces SMCP3, a method for automatically implementing custom sequential Monte Carlo samplers for inference in probabilistic programs. Unlike particle filters and resample-move SMC (Gilks and Berzuini, 2001), SMCP3 algorithms can improve the quality of samples and weights using pairs of Markov proposal kernels that are also specified by probabilistic programs. Unlike Del Moral et al. (2006b), these proposals can themselves be complex probabilistic computations that generate auxiliary variables, apply deterministic transformations, and lack tractable marginal densities. This paper also contributes an efficient implementation in Gen that eliminates the need to manually derive incremental importance weights. SMCP3 thus simultaneously expands the design space that can be explored by SMC practitioners and reduces the implementation effort. SMCP3 is illustrated using applications to 3D object tracking, state-space modeling, and data clustering, showing that SMCP3 methods can simultaneously improve the quality and reduce the cost of marginal likelihood estimation and posterior inference.
2022
Abstract Interpretation for Generalized Heuristic Search in Model-Based Planning
Tan Zhi-Xuan, Joshua B Tenenbaum, and Vikash K Mansinghka
In ICML 2022 Workshop on Beyond Bayes: Paths Towards Universal Reasoning Systems Apr 2022
Domain-general model-based planners often derive their generality by constructing search heuristics through the relaxation or abstraction of symbolic world models. We illustrate how abstract interpretation can serve as a unifying framework for these abstraction-based heuristics, extending the reach of heuristic search to richer world models that make use of more complex datatypes and functions (e.g. sets, geometry), and even models with uncertainty and probabilistic effects. These heuristics can also be integrated with learning, allowing agents to jumpstart planning in novel world models via abstraction-derived information that is later refined by experience. This suggests that abstract interpretation can play a key role in building universal reasoning systems.
Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind
Tan Zhi-Xuan, Nishad Gothoskar, Falk Pollok, Dan Gutfreund, Joshua B Tenenbaum, and Vikash K Mansinghka
In RSS 2022 Workshop on Social Intelligence in Humans and Robots Apr 2022
To facilitate the development of new models to bridge the gap between machine and human social intelligence, the recently proposed Baby Intuitions Benchmark (arXiv:2102.11938) provides a suite of tasks designed to evaluate commonsense reasoning about agents’ goals and actions that even young infants exhibit. Here we present a principled Bayesian solution to this benchmark, based on a hierarchically Bayesian Theory of Mind (HBToM). By including hierarchical priors on agent goals and dispositions, inference over our HBToM model enables few-shot learning of the efficiency and preferences of an agent, which can then be used in commonsense plausibility judgements about subsequent agent behavior. This approach achieves near-perfect accuracy on most benchmark tasks, outperforming deep learning and imitation learning baselines while producing interpretable human-like inferences, demonstrating the advantages of structured Bayesian models of human social cognition.
PDDL. jl: An Extensible Interpreter and Compiler Interface for Fast and Flexible AI Planning
The Planning Domain Definition Language (PDDL) is a formal specification language for symbolic planning problems and domains that is widely used by the AI planning community. However, most implementations of PDDL are closely tied to particular planning systems and algorithms, and are not designed for interoperability or modular use within larger AI systems. This limitation also makes it difficult to support extensions to PDDL without implementing a dedicated planner for that extension, inhibiting the generality and reach of automated planning. To address these limitations, we present PDDL.jl, an extensible interpreter and compiler interface for fast and flexible AI planning. PDDL.jl exposes the semantics of planning domains through a common interface for executing actions, querying state variables, and other basic operations used within AI planning applications. PDDL.jl also supports the extension of PDDL semantics (e.g. to stochastic and continuous domains), domain abstraction for generalized heuristic search (via abstract interpretation), and domain compilation for efficient planning, enabling speed and flexibility for PDDL and its many descendants. Collectively, these features allow PDDL.jl to serve as a general high-performance platform for AI applications and research programs that leverage the integration of symbolic planning with other AI technologies, such as neuro-symbolic reinforcement learning, probabilistic programming, and Bayesian inverse planning for value learning and goal inference.
When inferring the goals that others are trying to achieve, people intuitively understand that others might make mistakes along the way. This is crucial for activities such as teaching, offering assistance, and deciding between blame or forgiveness. However, Bayesian models of theory of mind have generally not accounted for these mistakes, instead modeling agents as mostly optimal in achieving their goals. As a result, they are unable to explain phenomena like locking oneself out of one’s house, or losing a game of chess. Here, we extend the Bayesian Theory of Mind framework to model boundedly rational agents who may have mistaken goals, plans, and actions. We formalize this by modeling agents as probabilistic programs, where goals may be confused with semantically similar states, plans may be misguided due to resource-bounded planning, and actions may be unintended due to execution errors. We present experiments eliciting human goal inferences in two domains: (i) a gridworld puzzle with gems locked behind doors, and (ii) a block-stacking domain. Our model better explains human inferences than alternatives, while generalizing across domains. These findings indicate the importance of modeling others as bounded agents, in order to account for the full richness of human intuitive psychology.
Genify.jl: Transforming Julia into Gen to enable programmable inference
Tan Zhi-Xuan, McCoy R Becker, and Vikash K. Mansinghka
In Languages For Inference Workshop (LAFI 2021), 48th ACM SIGPLAN Symposium on Principles of Programming Languages Jan 2021
A wide variety of libraries written in Julia implement stochastic simulators of natural and social phenomena for the purposes of computational science. However, these simulators are not generally amenable to Bayesian inference, as they do not provide likelihoods for execution traces, support constraining of observed random variables, or allow random choices and subroutines to be selectively updated in Monte Carlo algorithms. To address these limitations, we present Genify.jl, an approach to transforming plain Julia code into generative functions in Gen, a universal probabilistic programming system with programmable inference. We accomplish this via lightweight transformation of lowered Julia code into Gen’s dynamic modeling language, combined with a user-friendly random variable addressing scheme that enables straightforward implementation of custom inference programs. We demonstrate the utility of this approach by transforming an existing agent-based simulator from plain Julia into Gen, and designing custom inference programs that increase accuracy and efficiency relative to generic SMC and MCMC methods. This performance improvement is achieved by proposing, constraining, or re-simulating random variables that are internal to the simulator, which is made possible by transformation into Gen.
People routinely infer the goals of others by observing their actions over time. Remarkably, we can do so even when those actions lead to failure, enabling us to assist others when we detect that they might not achieve their goals. How might we endow machines with similar capabilities? Here we present an architecture capable of inferring an agent’s goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning, thereby accounting for sub-optimal behavior. These models are specified as probabilistic programs, allowing us to represent and perform efficient Bayesian inference over an agent’s goals and internal planning processes. To perform such inference, we develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models, limiting computation by incrementally extending inferred plans as new actions are observed. We present experiments showing that this modeling and inference architecture outperforms Bayesian inverse reinforcement learning baselines, accurately inferring goals from both optimal and non-optimal trajectories involving failure and back-tracking, while generalizing across domains with compositional structure and sparse rewards.
Integrating deep learning with latent state space models has the potential to yield temporal models that are powerful, yet tractable and interpretable. Unfortunately, current models are not designed to handle missing data or multiple data modalities, which are both prevalent in real-world data. In this work, we introduce a factorized inference method for Multimodal Deep Markov Models (MDMMs), allowing us to filter and smooth in the presence of missing data, while also performing uncertainty-aware multimodal fusion. We derive this method by factorizing the posterior p (z| x) for non-linear state space models, and develop a variational backward-forward algorithm for inference. Because our method handles incompleteness over both time and modalities, it is capable of interpolation, extrapolation, conditional generation, label prediction, and weakly supervised learning of multimodal time series. We demonstrate these capabilities on both synthetic and real-world multimodal data under high levels of data deletion. Our method performs well even with more than 50% missing data, and outperforms existing deep approaches to inference in latent time series.
Predator dormancy is a stable adaptive strategy due to Parrondo’s paradox
Zhi-Xuan Tan, Jin Ming Koh, Eugene V Koonin, and Kang Hao Cheong
Many predators produce dormant offspring to escape harsh environmental conditions, but the evolutionary stability of this adaptation has not been fully explored. Like seed banks in plants, dormancy provides a stable competitive advantage when seasonal variations occur, because the persistence of dormant forms under harsh conditions compensates for the increased cost of producing dormant offspring. However, dormancy also exists in environments with minimal abiotic variation—an observation not accounted for by existing theory. Here it is demonstrated that dormancy can out-compete perennial activity under conditions of extensive prey density fluctuation caused by overpredation. It is shown that at a critical level of prey density fluctuations, dormancy becomes an evolutionarily stable strategy. This is interpreted as a manifestation of Parrondo’s paradox: although neither the active nor dormant forms of a dormancy-capable predator can individually out-compete a perennially active predator, alternating between these two losing strategies can paradoxically result in a winning strategy. Parrondo’s paradox may thus explain the widespread success of quiescent behavioral strategies such as dormancy, suggesting that dormancy emerges as a natural evolutionary response to the self-destructive tendencies of overpredation and related biological phenomena.
2019
Modeling emotion in complex stories: the Stanford Emotional Narratives Dataset
People act upon their desires, but often, also act in adherence to implicit social norms. How do people infer these unstated social norms from others’ behavior, especially in novel social contexts? We propose that laypeople have intuitive theories of social norms as behavioral constraints shared across different agents in the same social context. We formalize inference of norms using a Bayesian Theory of Mind approach, and show that this computational approach provides excellent predictions of how people infer norms in two scenarios. Our results suggest that people separate the influence of norms and individual desires on others’ actions, and have implications for modelling generalizations of hidden causes of behavior.
The ability for autonomous agents to learn and conform to human norms is crucial for their safety and effectiveness in social environments. While recent work has led to frameworks for the representation and inference of simple social rules, research into norm learning remains at an exploratory stage. Here, we present a robotic system capable of representing, learning, and inferring ownership relations and norms. Ownership is represented as a graph of probabilistic relations between objects and their owners, along with a database of predicate-based norms that constrain the actions permissible on owned objects. To learn these norms and relations, our system integrates (i) a novel incremental norm learning algorithm capable of both one-shot learning and induction from specific examples,(ii) Bayesian inference of ownership relations in response to apparent rule violations, and (iii) perceptbased prediction of an object’s likely owners. Through a series of simulated and real-world experiments, we demonstrate the competence and flexibility of the system in performing object manipulation tasks that require a variety of norms to be followed, laying the groundwork for future research into the acquisition and application of social norms.
2018
Cross-issue solidarity and truth convergence in opinion dynamics
Zhi Xuan Tan, and Kang Hao Cheong
Journal of Physics A: Mathematical and Theoretical Jan 2018
How do movements and coalitions which engage with multiple social issues succeed in cross-issue solidarity, and when do they instead become fragmented? To address this, the mechanisms of cross-issue interaction have to be understood. Prior work on opinion dynamics and political disagreement has focused on single-issue consensus and polarization. Inspired by practices of cross-issue movement building, we have developed a general model of multi-issue opinion dynamics where agreement on one issue can promote greater inclusivity in discussing other issues, thereby avoiding the pitfalls of exclusivist interaction, where individuals engage only if they agree sufficiently on every issue considered. Our model shows that as more issues come into play, consensus and solidarity can only be maintained if inclusivity towards differing positions is increased. We further investigate whether greater inclusivity and compromise across issues lead people towards or away from normative truth, thereby addressing concerns about the non-ideal nature of political consensus.
2017
Nomadic-colonial life strategies enable paradoxical survival and growth despite habitat destruction