This page contains additional material for the paper "Expectation
maximization for average reward decentralized POMDPs" [1] to be
presented at the ECML PKDD 2013 conference. Below you will find a
description of the two average reward decentralized POMDP (Dec-POMDP)
problems introduced in the paper, problem definitions for the
problems, and information on applying non-linear programming to
average reward Dec-POMDP problems.
Wireless network with overhead (|S| = 64, |A_i| = 2, |O_i| =
6). In the wireless networking problem [2], two wireless agents try to
keep their transmit buffers, modeled with four states, as empty as
possible. Each buffer gets data from a two-state source model. Buffer
fullness is modeled as few states at rough intervals;
insertions/transmissions have a probability to change the buffer
state. If both agents transmit simultaneously both transmissions fail
and data is not removed from the buffers. The world state is the cross
product of the transmit buffers and source models, in total 64
states. In [2] the objective corresponded to minimizing delay. In the
new problem, successful transmissions are rewarded, corresponding to
maximizing throughput. In real wireless networks, decisions are made
at 10 microsecond intervals; to reflect this, we multiplied the
probability to transition from one buffer state to another and the
probability to insert data into a buffer with 0.01. As overhead from
packet headers etc. is proportionally smaller for larger packets, the
new wireless problem allows transmission of more data, when the buffer
is fuller: for buffer size x, y=2 x / (x+1) data units are transmitted
(probability to change buffer state is proportional to y).
Long fire fight (|S| = 27, |A_i| = 2, |O_i| = 2). In the fire
fighting problem [3] two robots try to extinguish three houses and
receive negative reward for higher house fire levels (see [3] for
details). In the new long fire fighting version a house can also start
burning on its own with probability 0.1. To make a single Dec-POMDP
time step correspond to a shorter time in the real application, we
multiplied all transition probabilities between fire levels with
0.01. In this version a fire takes longer to put out, and it takes
longer for fire levels to increase.
Problem definition for the
wireless
network with overhead Dec-POMDP problem in
the
dpomdp
file format.
Problem definition for the
long fire fight Dec-POMDP problem in
the
dpomdp
file format.
To solve a non-linear program for an aperiodic unichain average
reward Dec-POMDP problem we used the
publicly available SNOPT NEOS solver. The NEOS server requires a
model and data
file.
The
model file specifies the structure of the non-linear average
reward Dec-POMDP program. A data file specifies the problem
definition, that is, problem specific observation and transition
probabilities and the reward function. In addition, the data file
includes (random) guesses for the initial belief over controller and
world states, and (random) guesses for the controller action and
transition probabilities.
Here are the
data
files used in the
long fire fight and
wireless network with
overhead Dec-POMDP problems.
[1] Pajarinen, J., Peltonen, J.
Expectation maximization for
average reward decentralized POMDPs. In
Proceedings of the
European Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECMLPKDD), To appear,
2013.
[2] Pajarinen, J., Peltonen, J.
Efficient planning for factored
infinite-horizon DEC-POMDPs. In
Proceedings of the
Twenty-Second International Joint Conference on Artificial
Intelligence (IJCAI), pages 325--331, AAAI Press, 2011.
[3] Oliehoek, F., Spaan, M., Whiteson, S., Vlassis, N.
Exploiting
locality of interaction in factored
DEC-POMDPs. In
Proceedings of the Seventh International
Conference on Autonomous Agents and Multiagent Systems
(AAMAS). vol.1, pages 517--524, IFAAMAS, 2008.