Harsh Satija

I am a PhD student supervised by Prof. Joelle Pineau at McGill University and Mila. I am primarily interested in Reinforcement Learning and Artificial Intelligence Safety, with a focus on building efficient algorithms that prevent harm.

Research Summary: With an increasing number of automated decision-making algorithms being deployed around us, it becomes important to address the safety risks and biases associated with these algorithms, as machine learning algorithms in general, have shown to have the ability to inflict unintended behavior or harm if not developed or deployed with care in a societal setting. My research takes steps towards this goal by building efficient Reinforcement Learning algorithms in settings where the primary focus is on the problem of learning intelligent behaviour for accomplishing a task, but with additional requirements on the nature of the algorithm’s behaviour that can be related to safety, reliability or fairness.

I am on the job market for industry positions. Please reach out if I'd be a good fit for your research group.

Email | Google Scholar | LinkedIn | CV

News

04/2023 The Group Fairness in RL project is accepted at TMLR!

09/2022 Our work on Group Fairness in RL is selected for an Oral at EWRL!

06/2022 We are organizing an ICML workshop on responsible decision-making in dynamic environments with focus on long-term welfare, safety and fairness.

09/2021 "Multi-Objective SPIBB" has been accepted to NeurIPS 2021!

Research

	Group Fairness in Reinforcement Learning Harsh Satija, Matteo Pirotta, Alessandro Lazaric, Joelle Pineau. In Transactions on Machine Learning Research (TMLR), 2023. An earlier version appeared in European Workshop on Reinforcement Learning (EWRL), 2022, Oral pdf \| abstract \| bibtex \| code We pose and study the problem of satisfying fairness in the online Reinforcement Learning (RL) setting. We focus on the group notions of fairness, according to which agents belonging to different groups should have similar performance based on some given measure. We consider the setting of maximizing return in an unknown environment (unknown transition and reward function) and show that it is possible to have RL algorithms that learn the best fair policies without violating the fairness requirements at any point in time during the learning process. In the tabular finite-horizon episodic setting, we provide an algorithm that combines the principle of optimism and pessimism under uncertainty to achieve zero fairness violation with arbitrarily high probability while also maintaining sub-linear regret guarantees. For the high-dimensional Deep-RL setting, we present algorithms based on the performance-difference style approximate policy improvement update step and we report encouraging empirical results on various traditional RL-inspired benchmarks showing that our algorithms display the desired behavior of learning the optimal policy while performing a fair learning process.
	Multi-Objective SPIBB Harsh Satija, Philip S. Thomas, Joelle Pineau, Romain Laroche. Neural Information Processing Systems (NeurIPS), 2021. pdf \| abstract \| bibtex \| code We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting. We consider the scenario where: (i) we have a dataset collected under a known baseline policy, (ii) multiple reward signals are received from the environment inducing as many objectives to optimize. We present an SPI formulation for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals while ensuring that the new policy performs at least as well as the baseline policy along each individual objective. We build on traditional SPI algorithms and propose a novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment. We show the effectiveness of our method on a synthetic grid-world safety task as well as in a real-world critical care context to learn a policy for the administration of IV fluids and vasopressors to treat sepsis.
	Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards. Susan Amin, Maziar Gomrokchi, Hossein Aboutalebi, Harsh Satija, Doina Precup . International Conference on Machine Learning (ICML), 2021. pdf \| abstract \| bibtex \| code A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structures and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.
	Constrained Markov Decision Processes via Backward Value Functions. Harsh Satija, Philip Amortila, Joelle Pineau. International Conference on Machine Learning (ICML), 2020. pdf \| abstract \| bibtex \| code Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight the computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.
	A Survey of Exploration Methods in Reinforcement Learning. Susan Amin, Harsh Satija, Maziar Gomrokchi, Herke Van Hoof, Doina Precup. In review* pdf \| abstract \| bibtex Exploration is an essential component of reinforcement learning algorithms, where agents need to learn how to predict and control unknown and often stochastic environments. Reinforcement learning agents depend crucially on exploration to obtain informative data for the learning process as the lack of enough information could hinder effective learning. In this article, we provide a survey of modern exploration methods in (Sequential) reinforcement learning, as well as a taxonomy of exploration methods.
	Randomized value functions via multiplicative normalizing flows. Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, Pascal Vincent. Uncertainty in Artificial Intelligence (UAI), 2020 pdf \| abstract \| bibtex \| code Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimensional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent’s behavior policy from prematurely exploiting early estimates and falling into local optima. In this work, we leverage recent advances in variational Bayesian neural networks and combine these with traditional Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG) to achieve randomized value functions for high-dimensional domains. In particular, we augment DQN and DDPG with multiplicative normalizing flows in order to track a rich approximate posterior distribution over the parameters of the value function. This allows the agent to perform approximate Thompson sampling in a computationally efficient manner via stochastic gradient methods. We demonstrate the benefits of our approach through an empirical comparison in high dimensional environments.
	Decoupling dynamics and reward for transfer learning. *Harsh Satija,** Amy Zhang*, Joelle Pineau. International Conference on Learning Representations (ICLR) Workshop Track, 2018. pdf \| abstract \| bibtex \| code Current reinforcement learning (RL) methods can successfully learn single tasks, but often generalize poorly to modest perturbations in task domain or training procedure. In this work we present a decoupled learning strategy for RL that creates a shared representation space where knowledge can be robustly transferred. We separate learning the task representation, the forward dynamics, the inverse dynamics and the reward function of the domain, and show that this decoupling improves performance within task, transfers well to changes in dynamics and reward, and can be effectively used for online planning. Empirical results show good performance in both continuous and discrete RL domains.

Modified version of the template from this, this and this .