Expectation maximization for average reward decentralized POMDPs

This page contains additional material for the paper "Expectation maximization for average reward decentralized POMDPs" [1] to be presented at the ECML PKDD 2013 conference. Below you will find a description of the two average reward decentralized POMDP (Dec-POMDP) problems introduced in the paper, problem definitions for the problems, and information on applying non-linear programming to average reward Dec-POMDP problems.

Wireless network with overhead (|S| = 64, |A_i| = 2, |O_i| = 6). In the wireless networking problem [2], two wireless agents try to keep their transmit buffers, modeled with four states, as empty as possible. Each buffer gets data from a two-state source model. Buffer fullness is modeled as few states at rough intervals; insertions/transmissions have a probability to change the buffer state. If both agents transmit simultaneously both transmissions fail and data is not removed from the buffers. The world state is the cross product of the transmit buffers and source models, in total 64 states. In [2] the objective corresponded to minimizing delay. In the new problem, successful transmissions are rewarded, corresponding to maximizing throughput. In real wireless networks, decisions are made at 10 microsecond intervals; to reflect this, we multiplied the probability to transition from one buffer state to another and the probability to insert data into a buffer with 0.01. As overhead from packet headers etc. is proportionally smaller for larger packets, the new wireless problem allows transmission of more data, when the buffer is fuller: for buffer size x, y=2 x / (x+1) data units are transmitted (probability to change buffer state is proportional to y).

Long fire fight (|S| = 27, |A_i| = 2, |O_i| = 2). In the fire fighting problem [3] two robots try to extinguish three houses and receive negative reward for higher house fire levels (see [3] for details). In the new long fire fighting version a house can also start burning on its own with probability 0.1. To make a single Dec-POMDP time step correspond to a shorter time in the real application, we multiplied all transition probabilities between fire levels with 0.01. In this version a fire takes longer to put out, and it takes longer for fire levels to increase.

Problem definition for the wireless network with overhead Dec-POMDP problem in the dpomdp file format.
Problem definition for the long fire fight Dec-POMDP problem in the dpomdp file format.

To solve a non-linear program for an aperiodic unichain average reward Dec-POMDP problem we used the publicly available SNOPT NEOS solver. The NEOS server requires a model and data file. The model file specifies the structure of the non-linear average reward Dec-POMDP program. A data file specifies the problem definition, that is, problem specific observation and transition probabilities and the reward function. In addition, the data file includes (random) guesses for the initial belief over controller and world states, and (random) guesses for the controller action and transition probabilities.
Here are the data files used in the long fire fight and wireless network with overhead Dec-POMDP problems.

[1] Pajarinen, J., Peltonen, J. Expectation maximization for average reward decentralized POMDPs. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), To appear, 2013.

[2] Pajarinen, J., Peltonen, J. Efficient planning for factored infinite-horizon DEC-POMDPs. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), pages 325--331, AAAI Press, 2011.

[3] Oliehoek, F., Spaan, M., Whiteson, S., Vlassis, N. Exploiting locality of interaction in factored DEC-POMDPs. In Proceedings of the Seventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS). vol.1, pages 517--524, IFAAMAS, 2008.