Movatterモバイル変換


[0]ホーム

URL:


Next Article in Journal
Two Types of Intuitionistic Fuzzy Covering Rough Sets and an Application to Multiple Criteria Group Decision Making
Next Article in Special Issue
A Transmission Prediction Neighbor Mechanism Based on a Mixed Probability Model in an Opportunistic Complex Social Network
Previous Article in Journal
JDriver: Automatic Driver Class Generation for AFL-Based Java Fuzzing Tools
 
 
Search for Articles:
Title / Keyword
Author / Affiliation / Email
Journal
Article Type
 
 
Section
Special Issue
Volume
Issue
Number
Page
 
Logical OperatorOperator
Search Text
Search Type
 
add_circle_outline
remove_circle_outline
 
 
Journals
Symmetry
Volume 10
Issue 10
10.3390/sym10100461
Font Type:
ArialGeorgiaVerdana
Font Size:
AaAaAa
Line Spacing:
Column Width:
Background:
Article

Multi-Agent Reinforcement Learning Using Linear Fuzzy Model Applied to Cooperative Mobile Robots

1
Department of industrial engineering and manufacturing, Autonomous University of Ciudad Juarez, Ciudad Juarez 32310, Mexico
2
Faculty of mechanical and electrical engineering, Autonomous University of Coahuila, Torreon 27276, Mexico
*
Author to whom correspondence should be addressed.
Symmetry2018,10(10), 461;https://doi.org/10.3390/sym10100461
Submission received: 5 September 2018 /Revised: 24 September 2018 /Accepted: 30 September 2018 /Published: 3 October 2018
(This article belongs to the Special IssueSymmetry in Complex Systems)

Abstract

:
A multi-agent system (MAS) is suitable for addressing tasks in a variety of domains without any programmed behaviors, which makes it ideal for the problems associated with the mobile robots. Reinforcement learning (RL) is a successful approach used in the MASs to acquire new behaviors; most of these select exact Q-values in small discrete state space and action space. This article presents a joint Q-function linearly fuzzified for a MAS’ continuous state space, which overcomes the dimensionality problem. Also, this article gives a proof for the convergence and existence of the solution proposed by the algorithm presented. This article also discusses the numerical simulations and experimental results that were carried out to validate the proposed algorithm.

    1. Introduction

    Multi-agent systems (MASs) are finding application in a variety of fields where pre-programmed behaviors are not a suitable way to tackle the problems that arise. These fields include robotics, distributed control, resource management, collaborative decision making, data mining [1]. A MAS includes several intelligent agents in an environment, where each one has its independent behavior and should coordinate with the others [2].
    MASs could emerge as an alternative way to analyze and represent the systems with centralized control, where several intelligent agents perceive and modify an environment through sensors and actuators respectively. At the same time, these agents can also learn new behaviors to adapt themselves to the new tasks and the goals in an environment [3].
    One of the fields where multi-agent systems have emerged are mobile robots, most approaches are based on low level control systems, in [4] a visibility binary tree algorithm is used to generate the mobile robot trajectories. This type of approach is based on the complete knowledge of the dynamics of the robotic system. In this article, we offer a proposal based on reinforcement learning, which will result in high-level control actions.
    Reinforcement learning (RL) is one of the most popular methods for learning in a MAS. The objective of a Multi-agent reinforcement learning (MARL) is to maximize a numerical reward; so that, the agents can interact with the environment and modify it [5]. At each learning step, these agents choose an action, which drives the environment to a new state [6]. The Reward function assesses the grade of this state transition [7]. In the RL the agents are not told which tasks should be executed instead they must explore which actions have the best reward. Hence, the RL feedback is less informative than a supervised learning method [8].
    MASs are affected by the curse of dimensionality, which is a term given to suggest that the computational and memory requirements increase as the number of states or agents increase in an environment. Most approaches require an exact representation of the state-action pair values in the form of lookup tables, making the solution intractable, hence the application of these methods is reduced to small or discrete tasks [9]. In the real-life applications, the state variables can be a selected from of a large number of possible values or even from the continuous values; so, the problem is manageable if value functions are approximated [10].
    Some MARL algorithms have been proposed to deal with this problem using neural networks by making generalizations from a Q-table [11], applying function approximation for discrete and large state-action space [12], applying vector quantization for continuous state and actions [13], using experience replay for MAS [14], using Q-learning and normalized Gaussian network as approximators [15], predictions in systems with heterogeneous agents [16]. In [17] a couple of neural networks is used to represent the value function and the controller, however, the proposed strategy is based on a sufficient exploration, which is a function of the size of the training data of the neural networks. Inverse neural networks have also been proposed to approximate the actions policy, which uses an initial policy of actions to be refined through reinforcement learning [18].
    This article presents an approach for MARL in a cooperative problem. It is a modified version of the Q-learning algorithm proposed in [19] which uses a linear fuzzy approximator of the joint Q-function for a continuous state space. An implicit form of coordination is implemented to solve the coordination problem. An experiment is conducted on two robots performing the task of solving a coordination problem to verify the proposed algorithm. In addition to that, two theorems are presented to ensure the convergence of the proposed algorithm.

    2. MARL with Linear Fuzzy Parameterization

    2.1. Single Agent Case

    In a reinforcement learning (RL) for a single agent case, let us define:xk as the current state of the environment at the learning stepk, anduk as the action taken by the agent inxk.
    The reward or the numerical feedback,rk, reflects how good was the previous action,uk1, in the statexk1. The single agent RL problem is a Markov decision process (MDP). MDP for a deterministic case is:
    f:X×UXρ:X×UR
    whereX is the state space,U is the actions space,f is the state transition function which can be known or unknown, andρ is the reward scalar function.
    The action policy is used to describe the agent’s behavior, which specifies the way in which the agent chooses the action from a state. If the action policy,h=XU, does not change over time it is considered stationary [20].
    The final goal is to find an action policyh such that the long-term returnR is maximized:
    Rh=Ek=0γkrk+1x0=x,π
    whereγ0,1 is the discount factor. The policy,h, is obtained from the state-action value function, called Q-function.
    The Q-function
    Qh:X×UR
    gives a expected return from the policy,h,
    Qh(x,u)=Ek=0γkrk+1x0=x,u0=u,h
    The optimal Q-functionQ is defined as:
    Qx,u=maxhQhx,u
    OnceQ is available, the optimal action policy is obtained by:
    hx=argmaxuQx,u

    2.2. Multi Agent System Case

    In a Multi-agent case, there is some number of heterogeneous agents with their own set of actions and tasks in an environment. A stochastic game model describes this behavior in which the action performed at any state is a combination of the actions by each agent [21].
    The deterministic stochastic game’s model is a tupleX,U1,U2,,Un,f,ρ1,ρ2,,ρn, wheren is the number of the agents in the environment,X is the state of the environment,Uii=1,2,,n are the sets of actions available to each agent and the joint action setU=U1×U2××Un. The reward functionsρi:X×UR,i=1,2,,n and the state transition function isf:X×UX.
    The joint actionuk=u1,kT,u2,kT,,un,kTT,ukU,uiUi taken in the statexk, changes the state toxk+1=f(xk,uk). A numerical value for the reward is calculated asri,k+1=ρxk,uk for each joint actionuk. The actions are taken according to each agent’s own policyhi:XUi, where all of them form the joint policyh. Similar to a single agent case, the state space and actions space can be continuous or discrete.
    The long term rewardR depends on the joint policyRih(x)=k=0γkri,k+1 due to the numerical feedbackr of each agent depends on the joint actionuk. Thereby the Q-function of each agent relies on the joint action and the joint policy,Qih=X×UR, withQihx,u=Ek=0γkri,k+1x0=x,u0=u,h.
    Each agent could have its own goals, however, in this article the agents seek a common goal, i.e., the task is fully cooperative. In this way the numerical feedback or reward for any state is the same for all agentsρ1=ρ2==ρn, therefore the reward scalar functions and returns are the same for all the agents,R1h=R2h==Rnh. Hence the agents have the same goal which is maximize the common long term performance (or return).
    Determining an optimal joint policyh in Multi-agent systems is the equilibrium selection problem [22]. Although establishing an equilibrium is a difficult problem, the structure of the cooperative settings make this problem manageable. Assuming that the agents know the structure of the game in the form of the transition functionf and the reward functionρi makes the searching of the equilibrium point more tractable.
    In a fully cooperative stochastic game, if the transition functionf and the reward functionρ for each agent is known, the objective can be accomplished by learning the optimal joint-action valuesQ through Bellman optimal equation:Q(xk,uk)=ρxk,uk+γmaxjQf(xk,uk),uj and then using a greedy policy [23]. OnceQ is available, a policyh is:
    hi(x)=argmaxuimaxu1,u2,,unQ(x,u)
    When several joint actions are optimal the agents could choose different actions and degrade the performance of the search for a common goal. This problem can be solved by: the coordination free methods assume that the optimal join action are unique across learning a common Q-function [24], the coordination-based methods use the coordination graphs through a decomposition of the global Q-function in local Q-functions [25], or the implicit coordination methods assume the agents learn to choose one joint action by chance and then discard the others [26].

    2.3. Mapping the Joint Q-Function for MAS

    In a deterministic case, the Q-function,Q, is
    Qk+1=H(Qk)H(Q)(x,u)=ρx,u+γmaxjQf(x,u),uj
    with an arbitrary initial value forQ. The iterations (8) attracts theQ to a unique fixed point at [27]
    Q=H(Q).
    In [28] is shown that the mappingH:QQ is a contraction with factorα<1. For any pair of Q-functionQ1 andQ2,
    H(Q1)H(Q2)αQ1Q2
    and thenH has a unique fixed point.Q is a fixed point ofH:Q=H(Q), and the iteration converges toQ ask. An optimal policyhi(x) can be calculated fromQ using (8). To perform the former iteration, we need a model of the task in the form of the transition functionf and reward functionρi.
    This kind of method, based on the Bellman optimality equation, need saving and updating the Q-values for each state-joint action stage. In this way, only tasks with finite discrete set of state and actions are generally treated. The dimensionality problems occur by the growth of the number of agents involved in the task [29], thus this generates an increment on the computational complexity [30].
    In the case where the state space or actions space are continuous or even discrete with a great number of variables, the Q-functions must be depicted in an approximated form [31]. Because an exact representation of the Q-function could be impractical or intractable, therefore, we propose an approximate linear fuzzy representation of the joint Q-function through a vectorϕ.

    2.4. Linear Fuzzy Parameterization of the Joint Q-Function

    In general, if there is no prior awareness about the Q-function, the only form to have an exact representation is saving distinct Q-values for each state-action couple. If the state space is continuous, the exact representation of the Q-function would need take an infinite number of state-action values. For this reason, the only practical way to overcome this situation is using an approximate representation of the joint Q-function.
    In this section, we present a parameterized version of the Q-function through a linear fuzzy approximator which consist in a vectorϕRz, this vector relies in a fuzzy partition of the continuous state space. The principal advantage of this proposal is that we only need to save the state-action pair Q-value of the center of every membership function.
    There areN fuzzy sets, which are depicted by a membership function:
    μd(x)=X0,1d=1,2,..,N
    whereμd(x) describe the degree of membership of the statex to the fuzzy setd, this membership functions can be looked as basis functions or features [32]. The number of membership functions increase with the size of the state space, the number of the agents and with the degree of resolution sought for the vectorϕ.
    Triangular shapes of fuzzy partitions are used in this paper since they have their maximum value in a single point, namely, for everyd exist a uniquexd (the core of the membership function) such thatμd(xd)>μd(x)xxd. Since all the others membership functions take zero values inxd,μd^(xd)=0 for ∀dd^, we assume thatμd(xd)=1, which mean that the membership functions are normal. Another kind of Membership Functions shape could diverge when they have too big values inxd [33].
    We have a numberN of triangular membership functions for each state variablexe=1,2,,E, withdim(X)=E. A pyramid shapeE dimensional of membership functions will be the consequence of the product of each single dimensional membership function in the fuzzy partition of the state space.
    We assume that the action space is discrete for all the agents and they have the same number of actions available:
    Ui=uiji=1,2,,nj=1,2,M
    The parameter vectorϕ is composed byz=nNM elements to be stored, the membership function-action pair(μd,uij) for each agent correspond an element of the parameter vectorϕi,d,j. The approximator’s elementsϕi,d,j are organized in a preliminary way usingn different matrices with sizeN×M, filling the firstM columns with theN elements available. The elements of then matrix are allocated in a vector arrangementϕ column by column for the first agent, then follow with the next agent’s columns until completingn agents.
    Denoting the scalar indexesi,d,j ofϕ by:
    i,d,j=d+j1N+i1N×M
    wherei=1,2,,n means the number of the analyzed agent,d=1,2,,N is the number of fuzzy partitions for each state variablexe=1,2,,E, withdim(X)=E andj=1,2,,M dim(Ui)=M. In this way we denote the indexes of the parameter approximator byϕi,d,j, which means the approximate Q-value for thed membership function, performing the actionj available for the agenti.
    The statex is taken as input by the fuzzy rule base and producesM outputs for each agent, which are the corresponding Q-values to each action for every agentuiji=1,2,,nj=1,2,,M, the function’s outputs are the elements ofϕi,d,j. The fuzzy rule base proposed can be considered as a zero order Takagi-Sugeno rule base [34]:
    ifxisμd(x)thenqi,1=ϕi,d,1;;qi,M=ϕi,d,M
    The approximate Q-value can be calculated by:
    Q˜x,u=i=1nd=1Nμdxϕi,d,j
    The expression (13) is a linear parameterized approximation, the Q-values of a specified state-action couple is estimated through a weighted sum, where the weights are generated by the membership functions [35]. This approximator can be denoted by an approximator mapping
    F=RzQ
    whereRz is the parameter space, the parameterϕ represents the approximation of the Q-function:
    Q˜x,u=Fϕx,u
    Thus we do not need to store a great amount of Q-values for every pairx,u. Onlyz parameters inϕ are needed. The mapping approximatorF only represents a subset ofQ [36].
    From the point of view of reinforcement learning, a linear parameterized approximation ofF are preferred since they make more suitable to analyze the theoretical aspect. This is the reason for our choosing of using a linear parameterized approximationϕ, in this way, the normalized membership functions can be considered as basis functions [37].
    The Expression (15) provides aQ˜ which is an approximate Q-function, in place of the exact Q-functionQ, so the approximateQ˜ is supplied to the mappingH:
    Q¯k+1x,u=(HF)ϕkx,u
    Most of the time the Q-functionQ¯ is not able to be stored in a explicit way [38], as alternative, it has to be represented in an approximate form using a projection mappingP:QRz
    ϕk+1x,u=P(Q¯k+1)x,u
    which makes certain thatQ˜x,u=Fϕx,u is as near as possible ofQ¯x,u [39], in the sense of least square regression:
    P(Q)=ϕϕ=argminϕλsQxλ,uλ[Fϕ]xλ,uλ2
    where a set of state-joint actions samplesx,u are used. Because of the use of triangular membership function shapes, and the linear parameterized approximation mappingF, (18) is a convex quadratic optimization problem wherez=nNM samples are used [40], so the expression (18) is reduced to a designation in the form:
    ϕi,d,j=P(Q)i,d,j=Q(x,u)
    Recapitulating, the approximate linear fuzzy representation of the joint Q-function begins with an arbitrary value of the vector parameter vectorϕ and actualizes it in each iteration using the mapping:
    ϕk+1=PHFϕk
    and stops when a parameter thresholdξ is greater than the difference between 2 consecutive parameters vectorϕ
    ϕk+1ϕkξ
    A greedy policy can be obtained to control the system fromϕ (which is the parameter vector derived whenk), for whichever state, the actions are calculated by interpolation between the best local actions for each agent for every membership function centerxd:
    hi(x)=d=1Nϕi,dxujidwhereji,dargmax[F(ϕ)]x,u
    To implement the update (20), we propose a procedural using a modified version of the Q-learning algorithm [19], where the linear parameterization is added, in this way the algorithm can be extended to Multi-agent problems with continuous state space but with discrete action space. The algorithm starts with an arbitraryϕ (it can beϕ=0) until a thresholdξ is reached after several iterations.

    2.5. Reinforcement Learning Algorithm for Continuous State Space

    The linear fuzzy approximation depicted by (20) can be described by the following algorithm, where is used a modified version of Q-learning algorithm. To set a correspondence among the algorithm and the expression (20), the right hand of step 2 can be seen as (16) and then using the expression (17). Here the dynamicsf, the reward functionρ and the discount factorγ are known in the form of a batch sample data.
    • Letα0,1,ϵ0,1 set
      ϕ(x,u)0
      where1ϵ is the probability of choose a greedy action in the statex, andϵ is the probability of choose a random joint-action inU.
    • Repeat in each iterationk
      • For statex, we select a joint actionU with a suitable exploration. At each step a random action with probabilityε0,1 is used.
      • Applying the linear fuzzy parameterization with membership functionsμdd=1,,N and discrete actionsUjj=1,,M, the thresholdξ>0
        ϕi,d,jk+1ϕi,d,jk+αkrk+1+βΓΩβ=γmaxji=1nd=1Nμdfxk+1,uϕi,d,j
        Γ=i=1nd=1Nμdfxk,uϕi,d,j
        Ω=i=1nd=1Nμdfxk,u
      • Until:
        ϕk+1ϕkξ
      • Output:
        ϕ=ϕk+1
    A greedy policy is obtained to control the system by:
    hi(x)=d=1Nϕi,dxujidwhereji,dargmax[F(ϕ)]x,u
    whereji,dargmaxF(ϕ)x,u,ji,d is the optimal action for the centerxd for the agenti.
    Theorem 1.
    The algorithm with linear fuzzy parameterization (20) converges to a fixed vectorϕ.
    Proof of Theorem 1.
    Since the mapping given byF,
    Fϕx,u=i=1nd=1Nμdxϕi,d,j
    the convergence of the algorithm is guaranteed through ensuring that the compound mappingPHF is a contraction in the infinite norm. [28] shows that the mappingH is a contraction, so it remains to demonstrate thatF andP are not expansions. The mapping given byF is a weighted linear combination of membership functions
    [Fϕ]x,u[Fϕ]x,u=i=1nd=1Nμdxϕi,d,ji=1nd=1Nμdxϕi,d,j=i=1nd=1Nμdxϕi,d,jϕi,d,j=i=1nd=1Nμdxϕi,d,jϕi,d,ji=1nd=1Nμdxϕi,d,jϕi,d,ji=1nd=1Nμdxϕi,d,jϕi,d,ji=1nd=1Nμdxϕϕϕϕ
    where the last step is true because the sum of the standard functionsμdx is1, and the product generated by each agent also is1. So it shows that the mappingF is a non-expansion. Since the mappingP is
    P(Q)i,d,j=Q(x,u)
    and the samples are centers of the membership functionsϕkxk,uk=1, so the mappingP is a non-expansion [33]. SinceH mapping is a contraction withγ<1, soPHF is also a contraction by the factorγ
    PHFϕPHFϕγϕϕ
    wherePHF has a fixed vectorϕ, and the algorithm above converges to this fixed point ask. ☐
    Theorem 2.
    For any choice ofξ>0 and any initial threshold value parameter vectorϕ0Rz, the algorithm with linear fuzzy parameterization is completed in a finite time.
    Proof of Theorem 2.
    As shown in Theorem 1, the mapping is a contractionPHF withγ<1 and a fixed vectorϕ
    ϕk+1ϕ=PHFϕkPHFϕ<γϕkϕ
    So, ifϕk+1ϕ<γϕkϕ, for inductionϕkϕ<γkϕ0ϕ fork>0. According to Banach fixed point,ϕ is bounded. Since the vector where the iteration starts is bounded, thenϕ0ϕ is also bounded. LetGo=ϕ0ϕ which is bounded andϕkϕγkG0 fork>0, applying the triangle inequality:
    ϕk+1ϕkϕk+1ϕ+ϕkϕγk+1G0+γkG0=γkG0γ+1
    IfγkG0γ+1=ξ,
    γk=ξG0γ+1
    Applyingγ log base on both side of the above expression
    k=logγξG0γ+1
    withGo=ϕ0ϕ which is bounded andγ<1 implies thatk is finite. So the algorithm is arrived in the mostk iterations. ☐

    3. Results

    3.1. Simulation of a Cooperative Task with Mobile Robots

    We perform a simulation where the linear fuzzy parameterization is applied to a two-dimensional Multi-agent cooperative problem with continuous states and discrete actions. The two agents with massm have to be directed in a flat surface, such that they reach the origin at the same time with minimum time elapsed. The information available to the agents consists of the reward function, the transition function of states and joint actions.
    In the simulation environment, the statex=[x1,x2,,x8]T has the coordinates in two dimensions of each agentsix,siy and their velocities in two dimensionss˙ix,s˙iy fori=1,2:x=s1x,s1y,s˙1x,s˙1y,s2x,s2y,s˙2x,s˙2yT. The continuous state space model of the simulated system is:
    s¨1x=ηs1x,s1ys˙1xm1+u1xm1s¨1y=ηs1x,s1ys˙1ym1+u1ym1s¨2x=ηs2x,s2ys˙2xm2+u2xm2s¨2y=ηs2x,s2ys˙2ym2+u2ym2
    whereηsix,siy fori=1,2 is the function friction which depends of the position of each agent, the control signal isU=u1x,u1y,u2x,u2yT which is a force andmi fori=1,2 is the mass of each robot.
    The system is discretized with a step ofT=0.4s and the expression that describes the dynamics of the system are integrated between the sampling time. In the task, we select the start points randomly and carry out 50 training iteration, in the case of reaching 50 iterations without accomplishing the final goal, the experiment is restarted.
    The magnitude of the state and action variables are bounded. To make more tractable the problem,six andsiy6,6 meters,s˙ix ands˙iy3,3ms, also the force is boundeduix,uix2,2 fori=1,2, the friction coefficient is taken constant withη=1kgs, the mass of the agent is taken equal for bothm=0.5 kg.
    The actions control for each agent are discrete with 25 elementsUi=[20.2 00.22]×[20.2 00.22] fori=1,2, they correspond to force in diagonal, left, right, up, down and no force applied. The membership functions used for the position state and velocity state have triangular shape, where the core of each membership function isxd. The cores of the membership function for the location domains is centered at[6,3,0.3,0.1,0,0.1,0.3,3,6] and the cores of the membership function for the velocity domain are:[3,1,0,1,3], each one for every agent, this is shown inFigure 1. In this way 50625 pairs(x,u) are storage for each agent in the vector parameterϕ, this amount increases with the number of membership functions. An example of fuzzy triangular partition is showed inFigure 1.
    The partition of the state spacex is determined by the product of the individual membership function for each agenti
    μ(x)=i=12μsixi=12μsiyi=12μs˙ixi=12μs˙iy
    The final objective of arriving at the same is shown by a common reward functionρ:
    ρx,u=5ifx<0.1ρx,u=0inanotherway
    As regards the coordination problem, the agents accomplish an implicit coordination, where they learn to prefer one solution about equally good solutions by chance and then overlook the other options [41].
    After the algorithm is performed andϕ is obtained, a policy can be derived by interpolation between the best local action for each agent:
    hi(x)=d=1Nϕi,dxujidwhereji,dargmaxF(ϕ)x,u
    For the simulation, the learning parameters were setγ=0.96 and theξ=0.05, the initial conditions for the experiment were sets0=[4,6,2,2,5,3,2,1], the algorithm shows a convergence after 15 iterations,Figure 2 shows the states and the signal controlU1=u1x,u1y for the agent 1,Figure 3 shows the states and the signal controlU2=u2x,u2y for the agent 2. The final path followed by both agents are shown inFigure 4.

    3.2. Experimental Set-up

    Two mobile robots Khepera IV are used to perform a experiment in MAS [42]. They have 5 sensors which are placed around the robot and are positioned and numbered as shown inFigure 5. These 5 sensors are ultrasonic devices compose of one transmitter and one receiver, they are used to detect the physical features of the environment such as obstacles and other nearby agents.
    The five Khepera’s sonar readingsla,c are quantified in three degrees. They represent the amount of closeness to the nearest obstacle or others agents, 0 indicates obstacles or agents which are near, 1 indicates obstacles or agents which are in a medium distance and finally2 indicates obstacles or agents which are relatively far from the sensors.
    The parametersda (the distance to the target or the goal) andpa (the relative angle to the target or goal) are divided in eight degrees (0–8). Where 0 represents the smallest distance or angle and 8 represents the greatest relative distance or angle from the current Khepera’s position to the target or goal.
    The actions available for the robot khepera are:
    • Move forward
    • Turn in clockwise direction
    • Turn in counter clockwise direction
    • Stand-Still
    The ultra-sonic sensors on the Khepera are used to help the mobile robots to determine if there are any obstacles in the environment. The experimental set up reveals that the reinforcement learning algorithm relies strongly in the sensors readings.
    Sensor readings in the ideal simulation situation are based on mathematical calculations which are accurate and consistent. In the experimental implementation the readings are inaccurate and fluctuating. During the application of the controller this effect is minimized by permitting a period after performing a joint action, with the above we ensure that the sensor has steady reading before it is recorded. In addition, by moving the robots at relatively slow step during the learning process, the collisions with other objects or agent are reduced. The quantified readings would be enough to represent the current location and velocity when the robots are moving [43].

    3.3. Experimental Results

    To validate the proposed algorithm, the linear fuzzy approximator of the joint Q-function is applied to a two-dimensional Multi-agent cooperative task. Two mobile robots Khepera IV must be driven in a surface such that both agents reach the origin at the same time with minimum time elapsed, it is shown inFigure 6.
    The fuzzy partition and the location of centroids used for the states were the same as the simulation section.
    The goal of arriving at the same moment toward the origin in minimum time elapsed is shown by the common reward functionρ:
    ρx,u=5ifx<0.2ρx,u=0inanotherway
    For this experiment the learning parameters were setγ=0.96 andξ=0.2, the initial conditions for the experiment were sets0=[4,5,0,0,5,3,0,0], the experiment shows a convergence after 27 iterations.Figure 7 shows the states, the signal controlU1=u1x,u1y and the rewards for the agent 1 andFigure 8 shows the states, the signal controlU2=u2x,u2y and the reward for the agent 2.
    The election of the value forγ andξ was set arbitrarily, the vectorϕ converged after 27 iterations when the boundedϕl+1ϕlξ was reached. The final path is shown inFigure 9, this trajectory is evidently different from the optimal policy, which would drive both agents in a straight line toward the goal since any initial position. However, with the fuzzy quantization used in this implementation and the effect of the damping, the final path obtained is the best that can be accomplished with this fuzzy parameterization. The coordination problem was overcame using an indirect method, where the agents learn to choose a solution by chance.

    4. Comparison with CMOMMT Algorithm for Multi-Agent Systems

    There are other methods for MARL in continuous state space, these proposals are restricted to a limited kind of task, one of this methods is “cooperative multi-robot observation of multiple moving target” (CMOMMT) given in [13], which relies in local information in order to learn cooperative behavior. It allows the application of the reinforcement learning in continuous state space for Multi-agents systems.
    The kind of cooperation learned in this proposal is by implicit techniques, in this way this method is useful for reducing the representation of the state space through of a mapping of the continuous state space in a finite state space, where every new state discretized is considered as a region in the continuous state space.
    The action space is discrete, this method uses a delayed reward where positive or negative values are obtained at the end of the training. Finally, an optimal joint policy of actions is obtained by a clustering technique in the discrete action space.
    We performed the same cooperative task of arriving to the origin point at the same time for 2 agents as in the section above, using the same reward function and the same continuous state space. The initial condition was sets0=[4,5,0,0,5,3,0,0].
    The final path obtained by the CMOMMT algorithm is shown inFigure 10, where the path traced is not straight enough and near to the origin point is shown as a persistent oscillation. The signal state and the signal control for the agent 1 and agent 2 are shown inFigure 11 and theFigure 12, respectively.
    TheTable 1 shows a comparison between our proposal and the CMOMMT algorithm for an average of trials conducted under the same initial conditions.
    One possible reason for the results of CMOMMT could be that the Q-functions obtained by this method are less smooth than the presented by the fuzzy parameterization. In this way the method proposed in our paper shows a better performance in the form of the more accuracy representation of the state space and less computational resources used by it.

    5. Conclusions

    One of the principal research direction of the artificial intelligent is to develop autonomous mobile robots with cooperative skills in continuous state space. For this reason, in this paper we have presented a linear fuzzy parameterization of the joint Q-function which is used with a modified version Q-learning algorithm. Our algorithm proposed can handle MARL problems with continuous state space, minimizing the time of convergence and avoiding storing the entire Q-values in a look up table. Triangular shapes were used to set the membership functions to define de estate space of the environment, this form was selected to simplify the projection mapping.
    The main contribution of our work is that we present a reinforcement learning algorithm for MAS which uses a linear fuzzy parameterization of the joint Q-function. This approximation is carried out by means of a parameterization vector which only stores the Q-values at the centers of the triangular membership functions. The Q-values that are not found in the center are calculated by means of a weighted sum according to their degree of membership.
    Two theorems were presented to guarantee the convergence to a fixed point in a finite number of iterations. The proposed method is a off-line model-based algorithm with deterministic dynamics, in the assumption that the joint reward function and the transition function are known by all the agents. Since having that kind of knowledge could be difficult in a real-life application, a future work could be to extend this proposal to a model free method, where the agents can learn by itself the dynamics of the environment, also this extension can be done to encompass problems with stochastic dynamics.
    The performance of the linear fuzzy parameterization was evaluated first through simulation using Matlab software and then by an experiment where the task involves two mobile robots Khepera IV. Finally, the results obtained was compared with another suitable algorithm called CMOMMT which is capable of deal with tasks where the estate space is continuous.

    Author Contributions

    Formal analysis and Investigation D.L.-C.; Software, F.G.-L.; Validation, L.P.-D.; Review of the methods and editing, S.K.G.

    Funding

    The authors extend their appreciation to the Mexico’s Secretary of Public Education for funding this work through research agreement 511-6/17-7605.

    Conflicts of Interest

    The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

    References

    1. Sen, S.; Weiss, G.Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
    2. Stone, P.; Veloso, M. Multiagent systems: A survey from machine learning perspective.Auton. Robots2000,8, 345–383. [Google Scholar] [CrossRef]
    3. Wooldridge, M.An Introduction to MultiAgent Systems; John Wiley & Sons: Hoboken, NL, USA, 2002. [Google Scholar]
    4. Rashid, A.T.; Ali, A.A.; Frasca, M.; Fortuna, L. Path planning with obstacle avoidance based on visibility binary tree algorithm.Robot. Auton. Syst.2013,61, 1440–1449. [Google Scholar] [CrossRef]
    5. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey.J. Artif. Intell. Res.1996,4, 237–285. [Google Scholar] [CrossRef]
    6. Arel, I.; Liu, C.; Urbanik, T.; Kohls, A.G. Reinforcement learning-based multi-agent system for network traffic signal control.IET Intell. Transp. Syst.2010,4, 128–135. [Google Scholar] [CrossRef]
    7. Cherkassky, V.; Mulier, F.Learning from data: Concepts, Theory and Methods; Wiley-IEEE Press: Hoboken, NL, USA, 2007. [Google Scholar]
    8. Barlow, S.V. Unsupervised learning. InUnsupervised Learning: Foundations of Neural Computation, 1st ed.; Sejnowski, T.J., Hinton, G., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 1–18. ISBN 9780262581684. [Google Scholar]
    9. Zhang, W.; Ma, L.; Li, X. Multi-agent reinforcement learning based on local communication.Clust. Comput.2018, 1–10. [Google Scholar] [CrossRef]
    10. Hu, X.; Wang, Y. Consensus of Linear Multi-Agent Systems Subject to Actuator Saturation.Int. J. Control Autom. Syst.2013,11, 649–656. [Google Scholar] [CrossRef]
    11. Luviano, D.; Yu, W. Path planning in unknown environment with kernel smoothing and reinforcement learning for multi-agent systems. In Proceedings of the 12th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 28–30 October 2015. [Google Scholar]
    12. Abul, O.; Polat, F.; Alhajj, R. Multi-agent reinforcement learning using function approximation.IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.2000, 485–497. [Google Scholar] [CrossRef]
    13. Fernandez, F.; Parker, L.E. Learning in large cooperative multi-robots systems.Int. J. Robot. Autom. Spec. Issue Comput. Intell. Tech. Coop. Robots2001,16, 217–226. [Google Scholar]
    14. Foerster, J.; Nardelli, N.; Farquhar, G.; Afouras, T.; Torr, P.H.; Kohli, P.; Whiteson, S. Stabilising experience replay for deep multi-agent reinforcement learning.arXiv, 2017; arXiv:1702.08887. [Google Scholar]
    15. Tamakoshi, H.; Ishi, S. Multi agent reinforcement learning applied to a chase problem in a continuous world.Artif. Life Robot.2001, 202–206. [Google Scholar] [CrossRef]
    16. Ishiwaka, Y.; Sato, T.; Kakazu, Y. An approach to pursuit problem on a heterogeneous multiagent system using reinforcement learning.Robot. Auton. Syst.2003,43, 245–256. [Google Scholar] [CrossRef]
    17. Radac, M.-B.; Precup, R.-E.; Roman, R.-C. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-Learning.ISA Trans.2017. [Google Scholar] [CrossRef] [PubMed]
    18. Pandian, B.J.; Noel, M.M. Control of a bioreactor using a new partially supervised reinforcement learning algorithm.J. Process Control2018,69, 16–29. [Google Scholar] [CrossRef]
    19. Watkins, C.J.; Dayan, P. Q-learning.Mach. Learn.1992,8, 279–292. [Google Scholar] [CrossRef] [Green Version]
    20. Nguyen, T.; Nguyen, N.D.; Nahavandi, S. Multi-Agent Deep Reinforcement Learning with Human Strategies.arXiv, 2018; arXiv:1806.04562. [Google Scholar]
    21. Boutilier, C. Planning, Learning and Coordination in Multiagent Decision Processes. In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (TARK96), De Zeeuwse Stromen, The Netherlands, 17–20 March 1996; pp. 195–210. [Google Scholar]
    22. Harsanyi, J.C.; Selten, R.A General Theory of Equilibrium Selection in Games, 1st ed.; MIT Press: Cambridge, MA, USA, 1988; ISBN 9780262081733. [Google Scholar]
    23. Busoniu, L.; De Schutter, B.; Babuska, R. Decentralized Reinforcement Learning Control of a robotic Manipulator. In Proceedings of the International Conference on Control, Automation, Robotics and Vision, Singapore, 5–8 December 2006. [Google Scholar]
    24. Littman, M.L. Value-function reinforcement learning in Markov games.J. Cogn. Syst. Res.2001,2, 55–66. [Google Scholar] [CrossRef] [Green Version]
    25. Guestrin, C.; Lagoudakis, M.G.; Parr, R. Coordinated reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML-2002), Sydney, Australia, 8–12 July 2002; pp. 227–234. [Google Scholar]
    26. Bowling, M.; Veloso, M. Multiagent learning using a variable learning rate.Artif. Intell.2002,136, 215–250. [Google Scholar] [CrossRef]
    27. Bertsekas, D.P.Dynamic Programming and optimal control vol. 2, 4th ed.; Athena Scientific: Belmont, MA, USA, 2017; ISBN 1-886529-44-2. [Google Scholar]
    28. Istratesku, V.Fixed Point Theory: An introduction; Springer: Berlin, Germany, 2002; ISBN 978-1-4020-0301-1. [Google Scholar]
    29. Melo, F.S.; Meyn, S.P.; Ribeiro, M.I. An analysis of reinforcement learning with functions approximation. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), Helsinki, Finland, 5–9 July 2008; pp. 664–671. [Google Scholar]
    30. Szepesvari., C.; Smart, W.D. Interpolation-based Q-learning. In Proceedings of the 21st International Conference on Machine Learning (ICML-04), Banff, AB, Canada, 4–8 July 2004; pp. 791–798. [Google Scholar]
    31. Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [Google Scholar]
    32. Bertsekas, D.P.; Tsitsiklis, J.N.Neuro-Dynamic Programming, 1st ed.; Athena Scientific: Belmont, MA, USA, 1996; ISBN 1-886529-10-8. [Google Scholar]
    33. Tsitsiklis, J.N.; Van Roy, B. Feature-based methods for large scale dynamic programming.Mach. Learn.1996,22, 59–94. [Google Scholar] [CrossRef] [Green Version]
    34. Kruse, R.; Gebhardt, J.E.; Klowon, F.Foundations of Fuzzy Systems, 1st ed.; John Wiley & Sons: Hoboken, NL, USA, 1994; ISBN 047194243X. [Google Scholar]
    35. Gordon, G.J. Reinforcement learning with function approximation converges to a region.Adv. Neural Inf. Process. Syst.2001,13, 1040–1046. [Google Scholar]
    36. Tsitsiklis, J.N. Asynchronous stochastic approximation and Q-learning.Mach. Learn.1994,16, 185–202. [Google Scholar] [CrossRef] [Green Version]
    37. Berenji, H.R.; Khedkar, P. Learning and tuning fuzzy logic controllers through reinforcements.IEEE Trans. Neural Netw.1992,3, 724–740. [Google Scholar] [CrossRef] [PubMed] [Green Version]
    38. Munos, R.; Moore, A. Variable-resolution discretization in optimal control.Mach. Learn.2002,49, 291–323. [Google Scholar] [CrossRef]
    39. Chow, C.S.; Tsitsiklis, J.N. An optimal one-way multi grid algorithm for discrete-time stochastic control.IEEE Trans. Autom. Control1991,36, 898–914. [Google Scholar] [CrossRef]
    40. Busoniu, L.; Ernst, D.; De Schutter, B.; Babuska, R. Approximate Dynamic programming with fuzzy parametrization.Automatica2010,46, 804–814. [Google Scholar] [CrossRef]
    41. Vlassis, N. A concise Introduction to Multi Agent Systems and Distributed Artificial Intelligence. InSynthesis Lectures in Artificial Intelligence and Machine Learning; Morgan & Claypool Publishers: San Rafael, CA, USA, 2007. [Google Scholar]
    42. K-Team Corporation. 2013. Available online:http://www-k-team.com (accessed on 15 January 2018).
    43. Ganapathy, V.; Soh, C.Y.; Lui, W.L.D. Utilization of webots and Khepera II as a Platform for neural Q-learning controllers. In Proceedings of the IEEE Symposium on Industrial Electronics and Applications, Kuala Lumpur, Malaysia, 25–27 May 2009. [Google Scholar]
    Symmetry 10 00461 g001 550
    Figure 1. Triangular fuzzy partition for velocity state.
    Figure 1. Triangular fuzzy partition for velocity state.
    Symmetry 10 00461 g001
    Symmetry 10 00461 g002 550
    Figure 2. States and signal control for the Agent 1.
    Figure 2. States and signal control for the Agent 1.
    Symmetry 10 00461 g002
    Symmetry 10 00461 g003 550
    Figure 3. States and signal control for Agent 2.
    Figure 3. States and signal control for Agent 2.
    Symmetry 10 00461 g003
    Symmetry 10 00461 g004 550
    Figure 4. Final path by Agent 1 and Agent 2.
    Figure 4. Final path by Agent 1 and Agent 2.
    Symmetry 10 00461 g004
    Symmetry 10 00461 g005 550
    Figure 5. Position of the Khepera’s UltraSonic sensors.
    Figure 5. Position of the Khepera’s UltraSonic sensors.
    Symmetry 10 00461 g005
    Symmetry 10 00461 g006 550
    Figure 6. Experimental test
    Figure 6. Experimental test
    Symmetry 10 00461 g006
    Symmetry 10 00461 g007 550
    Figure 7. States and Rewards for the Agent 1 in the experimental implementation.
    Figure 7. States and Rewards for the Agent 1 in the experimental implementation.
    Symmetry 10 00461 g007
    Symmetry 10 00461 g008 550
    Figure 8. States and Rewards for the Agent 2 in the experimental implementation.
    Figure 8. States and Rewards for the Agent 2 in the experimental implementation.
    Symmetry 10 00461 g008
    Symmetry 10 00461 g009 550
    Figure 9. Final path by Agent 1 and Agent 2 in experimental implementation.
    Figure 9. Final path by Agent 1 and Agent 2 in experimental implementation.
    Symmetry 10 00461 g009
    Symmetry 10 00461 g010 550
    Figure 10. Final path obtained by CMOMMT algorithm.
    Figure 10. Final path obtained by CMOMMT algorithm.
    Symmetry 10 00461 g010
    Symmetry 10 00461 g011 550
    Figure 11. States and signal control for agent 1 CMOMMT algorithm.
    Figure 11. States and signal control for agent 1 CMOMMT algorithm.
    Symmetry 10 00461 g011
    Symmetry 10 00461 g012 550
    Figure 12. States and signal control for agent 2 CMOMMT algorithm.
    Figure 12. States and signal control for agent 2 CMOMMT algorithm.
    Symmetry 10 00461 g012
    Table
    Table 1. Comparison between fuzzy partition and CMOMMT.
    Table 1. Comparison between fuzzy partition and CMOMMT.
    TrainingCMOMMTFuzzy Parameterization
    Error (cm)186
    Time (s)2512
    Iterations3427

    © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

    Share and Cite

    MDPI and ACS Style

    Luviano-Cruz, D.; Garcia-Luna, F.; Pérez-Domínguez, L.; Gadi, S.K. Multi-Agent Reinforcement Learning Using Linear Fuzzy Model Applied to Cooperative Mobile Robots.Symmetry2018,10, 461. https://doi.org/10.3390/sym10100461

    AMA Style

    Luviano-Cruz D, Garcia-Luna F, Pérez-Domínguez L, Gadi SK. Multi-Agent Reinforcement Learning Using Linear Fuzzy Model Applied to Cooperative Mobile Robots.Symmetry. 2018; 10(10):461. https://doi.org/10.3390/sym10100461

    Chicago/Turabian Style

    Luviano-Cruz, David, Francesco Garcia-Luna, Luis Pérez-Domínguez, and S. K. Gadi. 2018. "Multi-Agent Reinforcement Learning Using Linear Fuzzy Model Applied to Cooperative Mobile Robots"Symmetry 10, no. 10: 461. https://doi.org/10.3390/sym10100461

    APA Style

    Luviano-Cruz, D., Garcia-Luna, F., Pérez-Domínguez, L., & Gadi, S. K. (2018). Multi-Agent Reinforcement Learning Using Linear Fuzzy Model Applied to Cooperative Mobile Robots.Symmetry,10(10), 461. https://doi.org/10.3390/sym10100461

    Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further detailshere.

    Article Metrics

    No
    No

    Article Access Statistics

    For more information on the journal statistics, clickhere.
    Multiple requests from the same IP address are counted as one view.
    Symmetry, EISSN 2073-8994, Published by MDPI
    RSSContent Alert

    Further Information

    Article Processing Charges Pay an Invoice Open Access Policy Contact MDPI Jobs at MDPI

    Guidelines

    For Authors For Reviewers For Editors For Librarians For Publishers For Societies For Conference Organizers

    MDPI Initiatives

    Sciforum MDPI Books Preprints.org Scilit SciProfiles Encyclopedia JAMS Proceedings Series

    Follow MDPI

    LinkedIn Facebook X
    MDPI

    Subscribe to receive issue release notifications and newsletters from MDPI journals

    © 1996-2025 MDPI (Basel, Switzerland) unless otherwise stated
    Terms and Conditions Privacy Policy
    We use cookies on our website to ensure you get the best experience.
    Read more about our cookieshere.
    Accept
    Back to TopTop
    [8]ページ先頭

    ©2009-2025 Movatter.jp