Emergence of Grounded Compositional Language in Multi-Agent Populations

12 Sep 2018

Introduction

The paper provides a multi-agent learning environment and proposes a learning approach that facilitates the emergence of a basic compositional language.
The language is quite rudimentary and is essentially a sequence of abstract discrete symbols. But it does comprise of a defined vocabulary and syntax.
Link to the paper

Cooperative, partially observable Markov game (multi-agent extension of MDP).
All agents have identical action and observation spaces, use the same policy and receive a shared reward.

Physically simulated 2-D environment in continuous space and discrete time with N agents and M landmarks.
The agents and the landmarks would occupy some location and would have some attributes (colour, shape).
Within the environment, the agents can go to a location, look at a location or do nothing. Additionally, they can utter communication symbols c (from a shared vocabulary C). Agents themselves learn to assign a meaning to the symbols.
Each agent has an internal goal (which could require interaction with other agents to complete) which the other agents cannot see.
Goal for agent i consists of an action to perform, a landmark location where to perform the action and another agent who should be performing the action.
Since the agent is continuously emitting symbols, a memory module is provided and simple additive memory updates are done.
For interaction, the agents could use verbal utterances, non-verbal signals (gaze) or non-communicative strategies (pushing other agents).

A model of all agent and environment state dynamics is created over time and the return gradient is computed.
Gumbel-Softmax distribution is used to obtain categorical word emission c.
A multi-layer perceptron is used to model the policy which returns action, communication symbol and the memory update for each agent.
Since the number of agents (and hence the number of communication streams etc) can vary across instantiations, an identical model is instantiated per agent and per communication stream.
The output of individual processing modules are pooled into feature vectors corresponding to communication and physical observations. These pooled features and the goal vectors are fed to the final processing module from which actions and categorical symbols are sampled.
In practice, using an additional task (each agent predicts the goal for another agent) encouraged more meaningful communication utterances.

Authors recommend using a large vocabulary with a soft penalty that discourages use of too many words. This leads to use of a large vocabulary in the intermediate state which converges to a small vocabulary.
Along the lines of rich gets richer dynamics, the communication symbol c’s are modelled as being generated by a Dirichlet process. The resulting reward across all agents is the log-likelihood of all communication utterances to have been generated by a Dirichlet process.
Since the agents can only communicate in discrete symbols and do not have a global positioning reference, they need to unambiguously communicate landmark references to other agents.

Non-verbal communication is not possible.
When trained with just 2 agents, symbols are assigned for each landmark and action.
As the number of agents is increased, additional symbols are used to refer to agents.
If the agents of the same colour are asked to perform conflicting tasks, they perform the average of conflicting tasks. If distractor locations are added, the agents learn to ignore them.

Agents are allowed to observe other agents’ position, gaze etc.
Now the location can be pointed to using gaze.
If gaze is disabled, the agent could indicate the goal landmark by moving to it.
Basically even when the communication is disabled the agents can come up with strategies to complete the task.