A Bayesian network structure for operational risk modelling in structured finance operations
This paper is an abbreviated version of the paper published as Journal of the Operational Research Society (2012) 63, 431–444, authored by A D?Sanford and I A?Moosa. The full paper is available to members on this site.
This paper is concerned with the design of a Bayesian network structure that is suitable for operational risk modelling. The model's structure is designed specifically from the perspective of a business unit operational risk manager whose role is to measure, record, predict, communicate, analyse and control operational risk within their unit. The problem domain modelled is a functioning structured finance operations unit within a major Australian bank. The network model design incorporates a number of existing human factor frameworks to account for human error and operational risk events within the domain. The design also supports a modular structure, allowing for the inclusion of many operational loss event types, making it adaptable to different operational risk environments.
Banks and non-bank financial institutions have become increasingly complex in terms of size and scope, global reach, and product and technological complexity. In addition to their more traditional focus on credit and market risk, these factors have driven the need for financial institutions to become more aware of the operational risks they face. Although severe operational loss events are very rare, such events have demonstrated their potential to bankrupt an organization.
Operational risk is defined by the Basel Committee on Banking Supervision (BCBS) as ‘the risk of loss resulting from inadequate or failed internal processes, people and systems or from external events. This definition covers legal risk (the risk associated with legal action) but it does not include reputational risk (the risk of loss due to a decline in a firm's reputation) and strategic risk (the risk of loss associated with an improper strategic decision). The BCBS identifies seven types of operational loss events: external fraud; internal fraud; damage to physical assets; clients, products and business practices; business disruption and system failure; execution, delivery and process management; and employment practices and workplace safety. The type of operational risk modelled in this case study falls under ‘execution, delivery and process management’.
A detailed presentation of the Bayesian network technology seems unnecessary, given the large number of very good references currently available. Therefore, we provide only a brief description of the technology as a background for the remainder of the paper. A Bayesian network consists of nodes and directional arcs or arrows. In earlier forms of the Bayesian network technology, in order to use efficient inferential algorithms, each node was restricted to being either discrete, having at least two states, or continuous, having a Gaussian distribution over the real line. To work around such restriction, Bayesian network developers have had to use static discretized nodes over the ranges of non-Gaussian distributed nodes. Alternatively, to avoid these distributional restrictions, slower, less efficient inferential algorithms, such as simulation-based methods, have been used. Such distributional constraints have until recently restricted the application of Bayesian networks in modelling real-world environments. Fortunately, techniques for creating efficient hybrid Bayesian networks, consisting of both discrete and continuous nodes have now been developed. Inferential efficiency in hybrid Bayesian networks has been achieved by the development of dynamic discretization techniques, which allow the inclusion of a wide variety of continuous distributions as discrete form approximations. Approximating continuous nodes in this manner, using dynamic discretization means that the fast and efficient inferential algorithms can still be employed for probability propagation.
Within the Bayesian network, each node represents some variable of interest within the domain to be modelled. Behind each node is a function that represents the probability distribution of the states of that node. That function is often represented as a table, which is called the conditional probability table (CPT) or node probability table. Given the semantics of the nodes and states, we see that in modelling an environment, the modeller must decide on what variables are of interest to the user or decision maker. They must also decide on what measures are used to determine the state of these variables and what state descriptors provide the most value to the user or decision maker. In determining the number of states for a discrete node, the modeller should be aware that increasing the states improves the granularity of the measure, but makes probability elicitation potentially more complex. Therefore, a trade-off between the number of states and the additional complexity needs to be considered when developing the network.
More contentious is the meaning of the directed arcs within a Bayesian network model. It is usually the case, however, that the arrows and their direction encode a concept of ‘influence’ or ‘cause’ within the domain. An arrow extending from node X to node Y indicates that a change in, or manipulation of, the state of node X will cause changes in the state of node Y. In this paper, we take the firm view that the arrow directions are causal.
Architecturally, although Bayesian networks can take on an arbitrary level of complexity, all models consist of three basic structures. These are the serial, convergent or divergent structures, as illustrated in Figure 1. Although networks can be constructed in such a way as to show that all nodes are connected to each other, Bayesian networks are more suitable as a modelling tool in situations where connections between nodes are sparse rather than saturated.
Serial, divergent and convergent network structures.
Once the nodes and causal relations are identified, and the Bayesian network constructed, the state probabilities can be elicited and incorporated into each node's CPT. These state probabilities can be elicited from human experts, statistical analysis of historical data or, in some situations, learned directly by the Bayesian network.
The process of propagating evidence through the network in order to evaluate the prior marginal and posterior distributions of nodes is referred to as inference. Inference can be readily performed by generating the joint probability distribution encoded within the network, and then summing out all nodes other than the node of interest. The problem with this approach is that it leads to an exponential growth in the inference task. It has long been recognized, that under worst-case conditions, inference in Bayesian networks is NP-Hard, for both exact and approximate inference. Despite this being the case, a large number of algorithms have been developed that, by taking advantage of the probability structures encoded within the network, achieve efficient probability propagations.
An advantage of using Bayesian networks as a modelling tool is that they can provide answers for both predictive and diagnostic queries. For example, a predictive query would be ‘what is the probability of a payment failure, given that a loan is being processed?’, while a diagnostic query would be ‘what is the most probable transaction type processed given that a payment failure occurred?’ Having observed the state of an effect node, such as a payment failure, inference can be carried out to show the probable states of the causal nodes within the network.
Structured finance operations (SFOs)
The institution that participated in this research is one of Australia's largest banks. Included within the group is the bank's wholesale banking division, which itself also includes two business units: the Structured Finance and the Corporate Finance units. It is the responsibility of these two units to develop and market structured finance products to the bank's wholesale corporate customers. Structured products are created by bundling individual transaction products together to provide a tailored financial solution to meet an individual client's needs. The different individual products that may be included within a structured product can range from simple vanilla loans and deposits, to the more complex risk management tools such as over-the-counter options and credit derivatives. These structured products may have overall terms lasting as little as 2 weeks to as long as 5 years. They may be comprised of only a few individual cash flows involving only a single currency, or a large number of flows involving different currencies, values and timings. These structured products may also involve the establishment of alternative legal structures, or special purpose vehicles (SPVs), which are necessary to make the transactions tax-effective. To manage these complex transactions during the life of each structured product, a separate business unit has been established within the wholesale banking division. This unit, known as SFO, is responsible for the successful implementation of each structured product.
The management and staffing of SFO consists of a single director, who has overall responsibility for the leadership of SFO, its budget, workflow and personnel. A number of associate directors, reporting directly to the director, are responsible for the day-to-day management of SFO deal teams. They also have direct responsibility for individual deals of a highly complex nature. Below them are the associates, or line managers, who take responsibility for a number of outstanding individual deals and direct the analysts, who are junior staff within the deal team. It is the analysts who perform most of the activities related to individual transactions. Given the diversity of transaction types within the domain, SFO operators tend to be generalists rather than specialists. For this reason, it may be difficult to replace staff once they move on to other areas of the bank. This creates further potential risks brought about by staff turnover, inadequacy of training and the level of task instruction necessary.
Given the heterogeneity of the different transaction types making up a structured product, it is difficult to develop a process-based model to manage them. Instead, the SFO implements a ‘deal team’ or jobbing arrangement. Such a structure provides the flexibility to implement the actions necessary to support these products. This flexibility is not costless however, with the potential for more frequent operational loss events resulting from human error. Furthermore, much of the bank's existing automated legacy systems lack the specific functionality needed to support these unique products. Therefore, greater reliance on both manual and spreadsheet-based solutions has resulted, more so than in a homogenous transaction environment.
In the process of creating a new structured product for a client, the Structured Finance and Corporate Finance units produce a physical document known as the ‘deal document’. Within this document, all details (such as transaction types, cash flows, timings, currencies and legal structures) are specified and described. The deal document is the ‘blue print’ that specifies the structured product from the initial setup to termination. It is the deal document that is passed on to SFO, and it is the deal document on which the SFO staff relies to guide them through each transaction. Given the potential complexities and risks involved in structured products, considerable due diligence is carried out in the authoring of this document. To ensure that errors are removed, the document passes through a large number of oversight hurdles prior to its release to SFO.
From an operational risk perspective, and as defined under Basel II, SFO exposes the bank to ‘execution, delivery and process management’ risk. SFO has identified its major operational risks as:
- Payments made to incorrect beneficiaries, and/or for an incorrect amount, and/or for an incorrect value date.
- Regulatory breach such as regulatory reporting or account segregation.
- Failure to enforce its rights or meet its obligations to counterparties over the life of a deal.
- Exposure capture. This is the risk that the terms of a transaction or details of a counterparty/security are not recorded accurately in the Bank's systems, resulting in a misunderstanding of the risk profile.
As part of the existing operational risk oversight and assessment of SFO, business environment scorecards are prepared regularly to identify generic key risk drivers (KRDs) common to all units within the wholesale banking division and score individual units against each dimension. Generic risk drivers include: (i) whether the unit is large and distributed or small and centralized (scale measure); (ii) whether transactions processed within the unit are of a large wholesale dollar amounts, or low retail dollar values (transaction measure); (iii) whether the products are complex and large in volume or simple and low in volume (product measure); (iv) whether the operational processes within the unit are complex, manual and outsourced, or simple, automated and carried out in-house (process measure); (v) whether the units operational and systems technology is legacy, disparate and with multiple interfaces, or is modern, integrated and with fewer interfaces (technology measure); (vi) whether unit staff are incentive driven, with high turnover, or wage and salary remunerated with low turnover (staff measure); (vii) whether the unit is undergoing rapid, large scale and complex change, or slow, small scale and simple change (change measure); (viii) whether the unit is experiencing new aggressive and competitive entrants, or competition is benign and stable (competition measure); (ix) whether the unit operates in a tight regulatory environment, with multiple legal entities and global reach, or a loose regulatory environment, with few legal entities and a local or national focus (regulatory/legal/geographic measures). For each unit, the operational risk drivers are assigned a number ranging from 1 (highest risk) to 9 (lowest risk). Although scorecards are a popular and valuable tool, they—unlike Bayesian networks—do not make explicit the causal relations between various risk drivers.
The SFO senior management views the Bayesian network model as offering a number of practical features. First, risk communications within and across business units can be improved, as the model makes explicit the risk drivers and their causal relations. Furthermore, auditable justifications for risk decisions can be made explicit and accessible to external parties. For example, the model could be used to support the bank's internal audit team in assessing the risk profile of SFO. The model also provides supporting evidence for management decisions on reducing and mitigating potential operational risks within the business unit. Although the model presented in this paper does not incorporate decision nodes, the Bayesian network technology allows for the inclusion of decision and utility nodes, which may be added in future versions of the model.
While efficient algorithms for inference in Bayesian networks have been available for some time, construction and development of these networks is still as much an art as it is a science. In modelling a problem domain, the developer must use their judgement in determining what level of detail is appropriate, which nodes should be included, and what causal relations may exist. Supporting these choices is the ultimate purpose of the model. As a simple modelling rule, the choices made have been informed by our desire to provide potential users with gainful insights of the domain, while at the same time avoiding the creation of a model whose complexity makes future use cumbersome, and is difficult to maintain and improve. Pragmatically, the rule is ‘simple enough to be used and complex enough to be useful’.
Reliance on human expert judgement in the construction and elicitation phases presents a number of difficulties. These include such situations in which the available domain experts do not have sufficient knowledge scope to cover all facets of the domain, or where domain experts have difficulty specifying the correct causal ordering of events, or where problems associated with the combination of probabilities provided by all of the individual experts arise. A potential remedy for such difficulties is the use of automated machine learning techniques, which have been an active area of Bayesian network research, motivated particularly by the desire to overcome the bottleneck associated with using expert judgement. Many Bayesian network tools incorporate such machine learning algorithms, including the tool used in this paper. Given the lack of available hard observational data, the use of domain expert input appears unavoidable here. More generally, this may always be somewhat true, given the very nature of operational risk. Despite these difficulties, it is our view that, with respect to operational risk management, the involvement of domain expert judgement is highly desirable. Incorporation of expert knowledge is therefore a strength of Bayesian network modelling, but it is also a weakness because of the difficulties expert elicitation presents.
The Bayesian network construction process is iterative, proceeding in steps and cycles.
- Structural development and evaluation: initial development proceeds by identifying all of the relevant risk driver events, their causal relations, and the query, hypothesis or operational loss event variables.
- Probability elicitation and parameter estimation: this step involves defining the probability distributions of the nodes and setting their parameter values.
- Model validation: this step is probably the most problematic component of Bayesian network construction, especially when historical data is sparse.
In this paper, we are only concerned with the first stage of the development cycle, structural development and evaluation
In the process of constructing the network model, we sought to answer the following broad-based questions:
- What operational risk queries should the model be able to answer?
- What operational risk categories and events should be included in the model?
- What are the main risk drivers in SFO for operational risk events?
- What are the causal relations between risk drivers and risk events?
- What are the key performance indicators (KPIs) for the SFO domain?
Acquiring answers to these questions initially proceeded through a review of SFO's internal documentation to gain an appreciation of the business and operational environment. The existing operational risk documents pertaining to SFO, provided by senior operational risk staff, were of particular value. Included within this material were the most recent business environment score ratings for SFO. These scorecard assessments proved invaluable in providing an overview of the unit's operational risk profile. SFO's ratings were: scale (8), transactions (2), product (2), process (2), technology (4), people (8), change (4), competition (8), and regulatory/legal/geography (7). Obviously, transactions, products and processes, as well as technology and change, present the greatest challenges to SFO. It has low transaction volumes but with relatively high dollar values. Its products are relatively complex while its processes rely on human performance and manual intervention. It also faces a business environment that is dynamic and changing due to business growth. Existing internal documentation also revealed that the KPI used by SFO senior management was the number of transactions process per month.
Following on from the document reviews, a number of face-to-face unstructured and semi-structured interviews were carried out with the Director of Quality Assurance for SFO. For future reference, we refer to this domain expert as the ‘risk manager’ who has over 20 years of banking experience, and was responsible for the monitoring of SFO's operational risk events. For this reason, they had considerable interest in the project and the development of the Bayesian network tool, as it directly impacts their own responsibilities. The risk manager has considerable detailed knowledge of the SFO domain and was very familiar with the operational processes involved, potential loss events and their drivers. They do not, however, have any experience in Bayesian networks as a decision support or risk management tool.
Taking a user's perspective, the risk manager saw the model as providing probability outputs for various operational loss events, conditional on the underlying characteristics of each type of transaction performed within SFO. It was the risk manager's view that SFO management required a more formal method of assigning operational risk capital allocations for each transaction. The current methods for doing this are somewhat ad hoc, opaque and reliant on the judgement of senior management. By introducing the Bayesian model, a more formal and transparent decision-making approach would be available, based on hard evidence, as well as professional judgement. The model would, at the very least, make the cognitive causal models used implicitly by SFO staff accessible to internal and external parties. The network model would also be useful as a baseline negotiating position in discussions between SFO and any of the other transaction originating business units.
The data capture mode driver is the manner in which transaction information is captured and stored, which has implications for the risk profile of any one transaction. This driver is closely associated with the existence, or otherwise, of an SPV that pertains to the transaction being processed. Transactions processed with an associated SPV are more likely to have their data capture performed using manual and spreadsheet-based solutions (than via the bank's existing information system infrastructure) with its attendant automated features, error and reconciliation controls. This is because the use of SPVs can make a transaction more complex, requiring processing that is unique for that transaction.
Furthermore, transaction-based characteristic risk drivers also include the transaction size or principal value, and the payment or cash flow sizes associated with that transaction. These characteristics have particular influence on the impact of an operational loss event, as potential losses can be closely linked to these variables. Taken together, transaction type, data capture mode, SPV, transaction size and payment size constitute what we broadly categorize as the transaction characteristics that drive operational risk events.
The next set of drivers, actively managed transaction volumes, and quality and quantity of operational staff, which we broadly classify as ‘skills and experience’, are meant to capture the important states of SFO's working environment. Actively managed transaction volumes provide a proxy measure to the KPI used in SFO, the number of transactions processed. By including the actively managed transaction volumes within the model, we also gain some measure of the demands of the working environment on operational staff. It is the interaction of external factors, as found in the social-working environment and the technical environment, which drives the internal psychological states of the SFO operators. We consider the internal and external states of the operational staff to be of particular importance in driving operational loss events, because of SFO's reliance on the human factor.
The final categories are similar to each other in structure and represent the categories of operational loss events. These include regulatory/legal/tax, exposure management and payment failure events. Divorcing is also used between error types and loss events, producing intermediate nodes (for example, the transaction implementation Error=> Exposure Failure event). Once again, divorcing is used to ease complexity and aid, at a later stage of development, expert elicitation and probability adaption.
Other operational loss events previously identified by SFO, such as approval compliance and documentation, are not explicitly included in the final network model, although they may be included within regulatory/legal/tax failure, at the discretion of the risk manager. However, the subsequent modular design of the final network would make their inclusion relatively straightforward. The remaining category, theft, is considered to be more problematic and requires its own separate causal model.
Two approaches to structure evaluation are used during and at the end of the construction phase. The first approach involves revisions, walkthroughs and feedback from operational risk and SFO staff who are not involved in model construction. This is a somewhat informal approach, involving familiarity with the model, questioning and suggesting alternatives. The second is a more formal approach, based on the fact that a Bayesian network encodes, via its directed-acyclic graph, assumptions regarding the conditional dependence and independence relations between nodes or variables within the environment. Independence between nodes is referred to in the Bayesian network literature as d-separation.3 Two nodes or two groups of nodes are said to be d-separated, if they are independent or conditionally independent of each other. For example, nodes in group Y are said to be d-separated from nodes in group X, given evidence, ε, instantiated on the remaining nodes Z when the following is true:
This reads as follows: X and Y are independent, given evidence of Z=ε. The d-separation properties between nodes is also symmetrical, thus if X is independent of Y, given Z=ε, then Y is independent of X, given Z=ε. Furthermore, it is possible for a node or a group of nodes that were previously d-separated to become d-connected when evidence is added to the network.
What does d-separation mean from a network modeller's perspective? If two nodes or groups of nodes are d-separated, then information about the first node or groups of nodes will not provide any further information on the state or states of the second node or group of nodes. Therefore, in constructing a network model, the modeller should ensure that the network structure contains no d-separated nodes, to which a domain expert would consider to contain information about each other.
In developing the network model, a major design concern is to ensure that the transaction type node remains d-connected to the operational loss event nodes. One of the most important outputs required of the model, from the risk manager's perspective, is to provide predictive probabilities of operational loss events, conditional on the different types of transactions processed and the working conditions within the SFO environment. The transaction type node represents a considerable amount of implicit information. Contained within that node is an implicit representation of the process detail, and how that transaction type process interacts with the working conditions and operator loadings. It would undermine the model's performance if this information was not available to the operational loss event nodes. It is not necessary to have a direct causal link between the transaction type and the operational loss events nodes, so long as at least one pathway is available. It might be argued, however, that the extra directed causal link from transaction type to operational loss event nodes would represent all other causes that are not related to human errors, or system error events, etc. This point is really moot, as ultimately the transaction type node is d-connected to all operational loss event nodes, provided that not all of the human error type nodes are instantiated, and they would not be instantiated when using the model for operational risk predictions.
Likewise, the other important nodes related to operational loss events are those that model the working environment, and the internal state of operators within SFO. These nodes include skills and experience, active transaction load, time load, mental effort load and stress load. They may well be instantiated with evidence, as part of the process of predicting operational loss events. Once again, the transaction type node remains d-connected to the operational loss event nodes with all of the working environment and internal operator state nodes instantiated.
Conclusions and future research
We have developed a network structure for the modelling of operational risk based on a functioning SFO unit within a major Australian bank. The dominant perspective used in developing this model structure is that of human error and its role in contributing to operational losses. Within the unit under investigation, human action plays a dominant role in the transaction processes, which makes it logical to emphasize human error. The model is designed to generate probabilities of operational loss events by consideration of interaction between the working environment, transaction processes and their effect on the generation of human errors. A valuable feature of our model is its modularity, which provides the opportunity to add other types of operational loss events as necessary.
Future development of the model involves the elicitation of prior probabilities for each of the states for each node and their parent node configurations. This will allow a priori event probabilities to be generated and evaluated against the actual probabilities experienced within the unit. Further research will also consider how the model will adapt, as new information arises in the unit. The Bayesian network technology allows for the learning of model structure and parameters directly from observed data. We see this model adaption functionality as an important future development, making the model responsive to changes within the environment. We also see this adaption feature as being a valuable addition for the purposes of organizational learning. Operational staff will have the facility to compare their a priori assumptions and beliefs against the adapting model, as new risk events unfold.
Operational risk involves the failure of people, processes and systems. Although the ‘people’ component is well covered by the model, we would like to include more features that cover processes and systems as well. This, we feel, will give the model a greater range of applications in banking. An important further addition will be to introduce more nodes that help to identify control failures, particularly as they relate to internal and external fraud.
Although we focused on human errors in our model, it is worth reminding ourselves that it is people, and people alone, who possess the only source of agency within an organization. Although processes and systems may fail, whether due to poor design or function, it is ultimately people who must take responsibility. It is for this reason that, by focusing on human errors, the model provides a good foundation for future development.