TECHNICAL DIAGNOSIS AND FAULT TOLERANCE
Pavel P. Parkhomenko, Founder and first Head of Laboratory No. 27 |
The Laboratory was born in 1964 in the depths of Laboratory No. 3, headed by Mikhail A. Gavrilov, Corresponding Member of the USSR Academy of Sciences. It was originally called the Laboratory of Logic Machines. This name was due to pioneering R&D works on a logic analyzer for relay-contact circuits and several program-controlled machines for the automated testing of technical objects (telephone equipment, electric locomotives, aircraft, missile control systems, etc.). Those works attracted general attention and intensified the solution of control automation tasks in different areas of the national economy.
The new Laboratory was headed by Cand. Sci. (Eng.) Pavel P. Parkhomenko (later Corresponding Member of the RAS, Dr. Sci. (Eng.), and Prof.). Enthusiasts of the new area in technical cybernetics—Cand. Sci. (Eng.) V.V. Karibsky, Cand. Sci. (Eng.) Yu.L. Tomfeld, and E.S. Sogomonyan (later Dr. Sci. (Eng.), Prof.)—joined the Laboratory. In addition, the novelty of the research attracted many young engineers to the Laboratory.
Interesting theoretical results in the initial period of the Laboratory’s life include the following: the concept of a one-cycle equivalent of a multi-cycle circuit; methods for designing logical circuits from different-base elements (the method of replacing input variables and the method of replacing output functions); fundamental results on racing in logical circuits and the recognition of classes of finite-state machines.
The Laboratory translated A. Gill’s famous monograph Introduction to the Theory of Finite-State Machines into Russian. It became a desk book for many researchers and prompted Laboratory’s employees to write the fundamental monograph Vvedenie v tekhnicheskuyu diagnostiku (Introduction to Technical Diagnosis) in two volumes.
Technical diagnosis, a new discipline for that time, became the research area and name of the Laboratory in the late 1960s. In the early 1970s, the Laboratory started extensive work on the theory and practice of combination and sequential circuits testing, fault localization, built-in control and testing systems, test automation, and reliability calculations and optimization. Most of those applications-relevant problems were posed for the first time. This research area is still topical today.
The scientific and organizational role of the Laboratory proved to be very significant. The annual schools on technical diagnosis led by P.P. Parkhomenko strengthened the interest in these problems and earned authority and recognition among Soviet scientists and engineers involved in the development of computing and control means. Since 1973, 18 schools were organized in total. Almost 100 students defended their candidate’s dissertations, and over 20 became Doctors of Science. Also, 6 All-Union Meetings on Technical Diagnosis and Fault Tolerance were held, which aroused the interest of researchers from neighboring and far-abroad countries. A solid scientific foundation of important technological knowledge on design automation, testing, diagnosis, functional control, and fault tolerance was created. This foundation was not completely destroyed even during the perestroika era.
In the 1970s, the Laboratory de facto became a coordinating research center on technical diagnosis in the USSR. Laboratory’s employees were closely connected with the R&D works of many enterprises: the Research Center for Electronic Computing (NITsEVT), the Research Institute “Scientific Center” (NIINTs), the Research Institute of Semiconductor Engineering (NIIPM), the Research Institute of Instrument Engineering (NIIP), the VEGA Scientific and Production Association (NPO VEGA), the Impulse Research Institute, the Elektropribor Design Bureau (Kharkov), and others. The Laboratory honorably fulfilled the task of the USSR Academy of Sciences to diagnose, restore, and maintain the control and computer equipment of a new series of imported fishing super trawlers on stationary bases and in the open ocean.
In the mid-1970s, following the Institute’s works on PS-2000, the Laboratory started designing fault-tolerant multiprocessor control systems. The works were carried out for spacecraft (Elektropribor) and next-generation long-range reconnaissance flying laboratories (NPO VEGA). This circle of problems is still relevant today.
Mikhail F. Karavay, Head of Laboratory No. 27 |
Since 1995, the Laboratory has been headed by Cand. Sci. (Eng.) Mikhail F. Karavay (currently Dr. Sci. (Eng.) and, Honorary Man of Science and Technology of the Russian Federation).
In 2006 and 2007, the Laboratory was staffed with new researchers in the area of microelectronics reliability (Cand. Sci. (Eng.) B.P. Petrukhin and his colleagues) and experts in networks and switching (Dr. Sci. (Eng.), Prof. G.G. Stetsyura and Dr. Sci. (Eng.) V.S. Podlazov). In 2010, the Laboratory merged Laboratory No. 4, previously headed by Dr. Sci. (Eng.), Prof. A.M. Ignatushchenko. Laboratory No. 4 studied the reliability of complex software complexes in multiprocessor systems. Also, Cand. Sci. (Eng.) A.M. Mikhailov, a developer of the neural cortex model, joined the Laboratory. (This area is new for the Laboratory.)
Vladislav V. Ignatushchenko |
In recent years, the Laboratory has been conducting research on several theoretical lines:
– highly reliable and survivable control information systems;
– reliability analysis and methods for calculating the reliability of systems built on the modern microelectronic base;
– the neural cortex model for recognition problems with processing very large data arrays.
In the first area (highly reliable and survivable control information systems), note the following results:
- A theory of fault tolerance was developed based on the invariant-group study of system structures. An effective analytical approach to the fault tolerance problem was proposed to design optimal fault-tolerant systems of different architectures. For the first time in the literature, it was established that fault tolerance mathematically rests on the symmetry properties (the automorphism group) of the system structure (M.F. Karavay). Several problems on system diagnosis and optimal resource allocation in multiprocessor systems with the hypercube and homogeneous graph architectures were solved (P.P. Parkhomenko).
- Laboratory’s theoretical results on fault tolerance and survivability provide a new look at the design of systems-on-chip (SoC). Numerous switching and logical resources-on-chip can be used to implement cost-effective and efficient structural fault tolerance methods. Such methods were developed in the Laboratory. They are based on a virtual representation of a circuit-on-chip as a set of logical blocks, with one or more complex logical blocks (CLBs) on the given level doubled on each subsequent level. For example, 1/128 of the entire circuit, 1/64, etc. up to 1/2 of the circuit. A chip packing algorithm for CAD was developed. This algorithm uses the natural redundancy-on-chip to map the failed CLB on the redundant space-on-chip (Cand. Sci. (Eng.) S.S. Uvarov).
- Several fundamental problems were solved for built-in test and functional diagnosis systems of digital equipment with system decomposition and testing at limiting operating frequencies. The results yield a new constructive approach to testable devices when designing systems-on-chip (Cands. Sci. (Eng.) G.P. Aksenova and V.F. Khalchev). Studies of built-in self-recovery mechanisms continue for the systems with redundant structural and functional resources. For systems-on-chip, such mechanisms are investigated using the causal approach to the consideration of damaging factors by analogy with the survival principles in a hostile (adverse) environment of biological organisms, symbioses, and highly organized communities (E.A. Adoyan and Cand. Sci. (Eng.) Yu.L. Tomfeld).
- Research was carried out to develop new approaches to the organization of reliable numerical calculations. As supposed, they should involve a new standard in which calculations are implemented simultaneously with the reliability assessment of their results. Such an approach should sharply reduce the possibility of unpredicted incorrect results in the operation of highly reliable systems (S.I. Uvarov).
- Current significant efforts are aimed at solving fundamental problems of switching networks (P.P. Parkhomenko, M.F. Karavay, and V.S. Podlazov). Nowadays, switching problems in supercomputers, multicore chips, and real-time system area networks are important in computer science. They determine success in increasing the performance and fault-tolerance (in general, reliability) of modern systems based on parallelism. As shown by the previous studies of invariant-group properties of system structures, a fault-tolerant structure of acceptable redundancy can be obtained from an arbitrary structure in some cases only (even if this solution is minimal). To get out of the seemingly insurmountable bounds, it was proposed to map the initial structure into the complete graph structure. When choosing and designing switching systems, several issues are often analyzed: performance, capacity, complexity, scalability, parallelism, fault tolerance, operation in a heterogeneous environment, simple control, conflict-free data transmission, acceptable frequency ranges, noise resistance, continuity of previous solutions, and others. Unfortunately, a complete graph (or cross-bar) is an unacceptable environment to map the original graph due to its high complexity. At the same time, the other characteristics of a complete graph (see the list below) are very attractive for the designed systems.
- The Laboratory develops new switching structures for microelectronics, computer engineering, and real-time system area networks. This research aims at constructing a mathematical framework for a clear answer to all of the issues mentioned above, including the complexity problem of complete graphs. As discovered, the symmetric balanced block design, an element of combinatorial mathematics little known in the engineering community, contains great opportunities in creating network switching facilities for high-performance fault-tolerant (particularly heterogeneous) control and computing systems. Block designs have a graph equivalent, the bipartite graph.
Block designs can be interpreted as quasi-complete switching structures: graphs whose vertices are connected not by the point-to-point principle but through a simple switch with almost zero delay in signal transmission. In this case, the number of communication channels and ports of an n-node network decreases proportionally √n times compared to the complete graph. This is their main advantage over the switching structures modeled by complete graphs.
For the first time, it was observed that bipartite graphs, balanced block designs, and switching networks are not separate concepts but “brothers-in-law.” This result turned out to be most important because researchers received a strong mathematical apparatus and formulated the problem of designing high-performance fault-tolerant network switching systems. Also, it was realized that this research could make a technological breakthrough in creating ultra-large integrated circuits (FPGA and SoC): the number of necessary connections in the switching network is reduced by orders of magnitude.
The proposed topology is essentially a two-stage switch representing an “almost” complete graph: it can be treated as a complete graph for applications. It was called the “quasi-complete graph.”
Why the quasi-complete graph is of such interest? Mainly because it has all the positive characteristics of a complete graph and is much simpler. Moreover, any topology can be mapped into a quasi-complete graph (an invaluable property for performance and fault tolerance). Enough work has already been done for practical results. It is clear how to design clusters of up to 1500 users. It is clear how to cascade these networks and build their combinations containing tens of thousands of users.
Another research area is related to the many-year works of G.G. Stetsyura on combining calculations and data exchange in data transmission channels. A group of channel nodes performs distributed computations (logical, arithmetic: addition, subtraction, multiplication, max and min operations) over data during their bitwise transmission along the channel. When leaving the last group node, a data packet moving through the channel contains the group operation result. This approach has an extensive range of applications: accelerating collective operations in computers (at least by log n times at n processors on polynomial calculation, convolution, discrete Fourier transform, and sorting); reducing the active chip area for data exchange operations; detecting fast and eliminating faulty components, etc.
These approaches are developed as methods to support the autonomy of hard real-time control systems. Autonomy is understood as self-control means in the system, including configurability, optimization, self-recovery, and self-protection against hostile interference. At the 2018 National Supercomputer Forum in Pereslavl-Zalessky, Intel representatives noted the start of their R&D works on combining calculations and data exchange with memory in channels as a promising line to increase system performance.
In 2009–2013, the Laboratory analyzed and compared several methods for calculating the reliability of integrated circuits based on testing results by different firms with different methodologies to develop a predictive model for the reliability indicators of modern CMOS chips (B.P. Petrukhin and his colleagues).
Today, the main elements of digital technology in the world are CMOS (complementary metal-oxide-semiconductor) chips. These are programmable arrays of logic elements, microprocessors, various memory elements, etc. The main suppliers of large and very large CMOS chips are Altera, Xilinx, Atmel, and others. In accordance with the ISO 9000 standard, all manufacturers confirm the quality of their products, particularly reliability indicators.
These elements belong to the class of highly reliable products, which have very high reliability indicators: the failure rate is one failure per one hundred million device-hours and less. Therefore, such indicators are confirmed through evaluation tests in forced modes and conditions. (Although the manufacturers do not recommend using the evaluation tests-based failure rates for estimating the reliability of products containing such elements.) However, it is practically impossible to get reliable information about failures during operation. As noted, FPGA failures under normal conditions are very rare. Therefore, the aim of the studies was to assess the possibility of using the results of evaluation tests by Altera and Xilinx over the past five years or more (a significant equivalent operating time). A critical analysis of the failure modes was performed, and the failure mechanism and the effect of various external factors were considered. The American Military Standard MIL-217 + F.2 gives a more pessimistic assessment than the French UTC (CNET93). The analysis shows that the failure rates obtained by both techniques are generally higher than those of the evaluation tests. According to the test results, FPGA failure rates are almost independent of the characteristic size and degree of integration. At present, FPGAs have widely penetrated into the design of consumer products and aerospace equipment. Their dependence on the radiation effects has become apparent. Their real characteristics are necessary to design reliable electronic devices.
M.F. Karavay is actively involved in the works of the Moscow Experimental Design Bureau Mars (MEDB Mars) and the Central Research Institute for Machine Building (TsNIIMash) on the design and production of the latest fault-tolerant control and computing systems for upper stages and small satellites. Laboratory’s many-year experience in creating diagnostic support, if possible, is adopted in the latest developments of MEDB Mars. Jointly with MEDB Mars, the new architecture of fault-tolerant memory for operation under enhanced ionizing radiation was patented in the Russian Federation. The memory withstands (parries) up to several hundreds of stable failures that were considered irrecoverable before. At the same time, the memory has no noticeable losses in performance. Another line of joint research is creating two-sided (two-channel) fault-tolerant systems with characteristics close to those of modern four- and three-sided systems. Two-sided systems are intended for equipping upper stages, small satellites, and aircraft produced by State Experimental Design Bureau Raduga (SEDB Raduga). Onboard control computers are designed based on domestic systems-on-chip manufactured by Elvis (Zelenograd).
In 2016, the Laboratory included employees of the former Laboratory No. 5 (Analysis of Properties of Complex Structure Systems), high-class experts in reliability: Chief Researcher, Dr. Sci. (Eng.) V.S. Viktorova; Leading Researcher, Cand. Sci. (Eng.) N.V. Lubkov; Senior Researcher, Cand. Sci. (Eng.) A.V. Antonov; Senior Researcher, Cand. Sci. (Eng.) G.L. Polyak; Senior Researcher A.S. Stepanyants; Junior Researcher, a postgraduate student of Moscow State University M.Yu. Vorobyova. In 2016, some employees of the former Laboratory No. 13 (Functional Safety) also joined Laboratory No. 27: its Head, now Chief Researcher, Dr. Sci. (Eng.) E.V. Yurkevich; Senior Researcher, Cand. Sci. (Law) B.V. Kolosov; Researcher L.N. Kryukova.
Boris G. Volik, Founder and first Head of Laboratory No. 5 |
Laboratory No. 5 was created in 1972 by merging two scientific groups engaged in the complex automation of a new class of nuclear-powered submarines (Project 705). One group led by Dr. Sci. (Eng.), Prof. Sergey M. Domanitsky (1927–1971) developed reliability analysis and assurance methods for logical control systems. The second group (within Laboratory No. 49) dealt with new problems for those times: developed analysis and optimal level selection methods for the properties of complex structure systems based on simulation modeling of their operation. The leader of this team, Boris G. Volik, became Head of Laboratory No. 5.
The research of Laboratory No. 5 covered two areas.
The first area concerned the analysis of reliability, survivability, efficiency, and technogenic safety of complex structure systems. These problems have a common methodology for building calculation models and a common mathematical apparatus, including probability theory, algebra of logic, combinatorial analysis, and mathematical statistics.
The second area included the development of methods and tools for simulation modeling of complex-structure organizational systems. When creating models, a researcher may consider the interaction of systems, including conflicts. In this case, the simulation system is complemented by game models to find the losses and payoffs of the parties to the conflict. The simulation models can be used to develop best-behavior algorithms and find the balanced values of system indicators. In the most complex game models (e.g., bilateral conflicts), different situations are played with a human operator. In these cases, the simulation system is equipped with a decision support system (DSS), which includes a simulation model of the situation. The simulation model allows assessing different decision options (alternatives) in terms of a vector efficiency criterion. Then the decision-maker uses DSS algorithms to determine the best alternative by his preferences.
During its existence, Laboratory No. 5 has performed several applied works highly appreciated by the customers.
Under the leadership of Cand. Sci. (Eng.) B.B. Buyanov, the variants of the navigation and piloting complex for the IL-62m aircraft were analyzed to obtain an international certificate for flights over the North Atlantic (the 1970s); also, the reliability of numerical software control systems for machine tools was studied. He proposed the principle of comparing design solutions by vector estimates in multicriteria decision problems and developed algorithms for selecting preferable options considering the decision-maker’s preferences.
Since the 1980s, Laboratory No. 5 carried out works on the reliability analysis of ship power plants, life support systems of nuclear submarines, and subsystems of automated process control systems (APCSs) for nuclear power plants (NPPs). Those works were led by Cand. Sci. (Eng.) N.V. Lubkov. Also, he was responsible for assessing the technical condition of power supply facilities in the complex topic of developing automated control systems for the municipal economy.
In the 1990s, Laboratory No. 5 developed reliability models for onboard fault-tolerant computing complexes designed in the Research Institute “Scientific Center” (Zelenograd). In the 2000s, Laboratory No. 5 participated in the Russian-American project on the RAM analysis of chemical weapons destruction facilities. A.S. Stepanyants was in charge of those works.
Starting from the mid-2000s, Cand. Sci. (Eng.) A.V. Antonov and his colleagues were engaged in the following activities:
- software testing and verification for the APCSs of NPPs (the unit and station control levels) located in Bushehr and Kudankulam;
- preparation for the safety certification of a highly-reliable hardware and software complex for NPPs;
- algorithmic motion control for complex technical objects during their operation and software tools for processing motion parameters measurements and recovering the forces and moments;
- ultra-large databases within a specialized monitoring and control complex for process parameters to prevent accidents when testing spacecraft installations.
Valentina S. Viktorova |
In 2006, employees of Laboratory No. 5 initiated a new research area, i.e., the testability of aircraft systems. They were involved in large-scale projects of the leading aircraft engineering organizations (the Sukhoi Civil Aircraft Company, the IRKUT Corporation, and the State Research Institute of Aviation Systems (GosNIIAS)). The Laboratory performed the following contractual works: developed models, methods, and algorithmic software for the automated testability analysis of MS-21 aircraft; studied testability and maintenance models for onboard avionics and the impact of these factors on reliability indicators. Testability analysis and automation projects for aircraft systems were led by Dr. Sci. (Eng.) Valentina S. Viktorova.
In July 2013, she became Head of Laboratory No. 5.
Since the beginning of the 21st century, the theoretical results within the second research area were implemented with the supervision of Cand. Sci. (Eng.) G.L. Polyak:
– The work “Computer Simulation Systems as a Tool to Elaborate the National Policy in Science and Technology under the Military Reform” was listed among the most important RAS results for the national defense and security.
– Demonstration programs for two maritime operations in the North Atlantic and the Caspian Sea were developed. Based on these programs, a software complex of the simulation system was created and implemented in the training process of the Military Academy of the General Staff of the Russian Federation.
At the end of 1993, the Institute’s Scientific Council established Laboratory No. 13 (Functional Safety). Cand. Sci. (Eng.) E.V. Yurkevich was elected its Head. (Now, he is Dr. Sci. (Eng.), Prof., Academician of the Russian Academy of Natural Sciences, and Chief Researcher of Laboratory No. 27.)
Evgeny V. Yurkevich |
Laboratory No. 13 was formed as a scientific unit for developing the new generation of the State System of Devices and Automation Means (GSP-2). This explains the close contact of research in the 1990s with the orders of the Russian Committee on Machine Building (the successor of Minpribor) and Rosstandart (nowadays, The Federal Agency for Technical Regulation and Metrology). The first results obtained in Laboratory No. 13 were models of operational impacts to improve the activities of public authorities in the new market conditions.
For over 10 years the Certification Body for Electrical Equipment, accredited by Rosstandart, was operating based on Laboratory No. 13; its employees were accredited as experts in the field of information technology and electrical engineering. IBM, DUX, Samsung, and other major corporations were among the applicants of the Certification Body.
Further research was determined by other topical problems: the reliable control of man-machine industrial automation systems. A peer-reviewed journal, Reliability, was founded. E.V. Yurkevich was its first Editor-in-Chief for over 10 years.
For improving international standards on the functional safety of software and hardware means, Laboratory No. 13 studied the functional reliability of the combination of technical and organizational-economic systems. A situational principle-based methodology was proposed for managing the standardization of software and hardware complexes (E.V. Yurkevich and L.N. Kryukova).
E.V. Yurkevich and L.N. Kryukova, experts in the functional reliability of technical systems, participated in the standardization works of Technical Committee 65 (Industrial-Process Measurement, Control and Automation) of the International Electrotechnical Commission (IEC).
For the innovative development of the rocket-space industry, Laboratory No. 13 formed a methodology for ensuring the stability of onboard spacecraft systems to external impacts jointly with Roscosmos enterprises (TSNIIMash and the VNIIEM Corporation). Within those works, an important line was building expert systems. At the International Exhibition of Scientific, Technical and Innovative Developments “Measurement, World, Man,” the expert forecasting system for ensuring spacecraft stability to electrophysical impacts was awarded a gold medal in the category “Information and Analytical Systems” (2015); the information and analytical system for assessing the quality of rocket and space equipment, developed jointly with the VNIIEM Corporation, was awarded a gold medal in the category “Production Automation and Computerization.”
Presently, employees of Laboratory No. 27 are studying the functional reliability of onboard spacecraft systems. Jointly with Laboratory No. 31, a simulation optimization methodology was proposed for the mechanisms to control the resistance of functional modules to external impacts. The information support peculiarities of computer multi-agent technologies were considered for the intersubjective interaction of experts. Regular discussion of new results at the Moscow Seminar on Systems Theory and Control Problems, headed by E.V. Yurkevich, is an effective tool for research development.
The works of Laboratory No. 27 were presented in the form of journal papers and author’s certificates. Also, note the following monographs: Metody analiza i sinteza struktur upravlyayushchikh system (Methods to Analyze and Design Control System Structures), Moscow: Energoatomizdat, 1988; Modeli i metody rascheta nadezhnosti tekhnicheskikh sistem (Models and Methods to Calculate the Reliability of Technical Systems), Moscow: URSS, 2014. Many works were carried out by the decisions of the Government and were awarded diplomas of the Institute and government prizes. In 2017, the Laboratory was entrusted with important work to diagnose the technical condition of the Mission Control Center systems in Korolev. (This work was included in the 4-year State Program of Upgrading the MCC of Roscosmos (M.F. Karavay, V.F. Khalchev, N.V. Lubkov, and A.V. Antonov)).