Fault-Tolerant Distributed Cyber-Physical Systems: Two Case Studies

Johnson, Taylor T.

Abstract:

Fault-tolerance in distributed computing systems has been investigated extensively in the literature and has a rich history and detailed theory. This thesis studies fault-tolerance for distributed cyber-physical systems (DCPS), where distributed computation is combined with dynamics of physical processes. Due to their interaction with the physical world, DCPS may suffer from failures that are qualitatively different from the types of failures studied in distributed computing. Failures of the components of DCPS which interact with the physical processes—such as actuators and sensors—must be considered. Failures in the cyber domain may interact with failures of sensors and actuators in adverse ways. This thesis takes a first step in analyzing fault-tolerance in DCPS through the presentation of two case studies. In each case study, the DCPS are modeled as distributed algorithms executed by a set of agents, where each agent acts independently based on information obtained from its communication neighbors and agents may suffer from various failures. The first case study is a distributed traffic control problem, where agents control regions of roadway to move vehicles toward a destination, in spite of some agents' computers crashing permanently. The second case study is a distributed flocking problem, where agents form a flock, or a roughly equally spaced distribution in one dimension, and move towards a destination, in spite of some agents' actuators becoming stuck at some value. Each algorithm incorporates self-stabilization in order to solve the problem in spite of failures. The traffic algorithm uses a local signaling mechanism to guarantee safety and a self-stabilizing routing protocol to guarantee progress. The flocking algorithm uses a failure detector combined with an additional control strategy to ensure safety and progress.

Reference:

Taylor T. Johnson, "Fault-Tolerant Distributed Cyber-Physical Systems: Two Case Studies", Master's thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, 2010, may. (doi)

Bibtex Entry:

@mastersthesis{johnson2010msthesis,
    author      =   {Taylor T. Johnson},
    title       =   {Fault-Tolerant Distributed Cyber-Physical Systems: Two Case Studies},
    school      =   {Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign},
    address     =   {Urbana, IL 61801, USA},
    month       =   may,
    year        =   2010,
    gsid        =   {9683503763763426403},
    abstract    =   {Fault-tolerance in distributed computing systems has been investigated extensively in the literature and has a rich history and detailed theory.  This thesis studies fault-tolerance for distributed cyber-physical systems (DCPS), where distributed computation is combined with dynamics of physical processes.  Due to their interaction with the physical world, DCPS may suffer from failures that are qualitatively different from the types of failures studied in distributed computing.  Failures of the components of DCPS which interact with the physical processes---such as actuators and sensors---must be considered. Failures in the cyber domain may interact with failures of sensors and actuators in adverse ways.
                 This thesis takes a first step in analyzing fault-tolerance in DCPS through the presentation of two case studies.  In each case study, the DCPS are modeled as distributed algorithms executed by a set of agents, where each agent acts independently based on information obtained from its communication neighbors and agents may suffer from various failures.  The first case study is a distributed traffic control problem, where agents control regions of roadway to move vehicles toward a destination, in spite of some agents' computers crashing permanently.  The second case study is a distributed flocking problem, where agents form a flock, or a roughly equally spaced distribution in one dimension, and move towards a destination, in spite of some agents' actuators becoming stuck at some value.
                 Each algorithm incorporates self-stabilization in order to solve the problem in spite of failures.  The traffic algorithm uses a local signaling mechanism to guarantee safety and a self-stabilizing routing protocol to guarantee progress.  The flocking algorithm uses a failure detector combined with an additional control strategy to ensure safety and progress.},
    comment     =   {<a href="https://www.ideals.illinois.edu/handle/2142/16191">doi</a>},
    pdf = {http://www.taylortjohnson.com/research/johnson2010msthesis.pdf},
}