Graph methods are more general than you may think
Graph data science methods are usually applied to data that has some inherent graphical nature, e.g., molecular structure data, transport network data, etc. However, graph methods can also be useful on data which does not display any obvious graphical structure, such as the tabular datasets used in machine learning tasks. In this article I will demonstrate simply and intuitively — without any mathematics or theory — that by representing tabular data as a graph we can open new possibilities for how to perform inference on this data.
To keep things simple, I will use the Credit Approval dataset below as an example. The objective is to predict the value of Approval based on the values of one or more of the other attributes. There are many classification algorithms that can be used to do this, but let’s explore how we might approach this using graphs.
The first thing to consider is how to represent the data as a graph. We would like to capture the intuition that the greater the number of attribute values that are shared between two instances, the more similar are those instances. We will use one node to represent each instance (we refer to these nodes as instance nodes), and one node for each possible attribute value (these are attribute value nodes). The connections between instance nodes and attribute value nodes are bi-directional and reflect the information in the table. To keep things really simple, we will omit attributes Employment and History. Here is the graph.
Given attribute values for some new instance, there are several ways in which we might use graph methods to predict some unknown attribute value. We will use the notion of message passing. Here is the procedure we will use.
Initiate a message with value 1 at the starting node, and let this node pass the message to each node to which it is connected. Let any node that receives a message then pass on the message (dilated by a factor k, where 0 <k<1) to each other node to which it is connected. Continue message passing until either a target node is reached (i.e., a node corresponding to the attribute whose value is being predicted), or there are no further nodes to pass the message to. Since a message cannot be passed back to a node from which it was received the process is guaranteed to terminate.
When message passing has completed, each node in the graph will have received zero or more messages of varying value. Sum these values for each node belonging to the target attribute, and then normalize these (sum) values so that they themselves sum to 1. Interpret the normalized values as probabilities. These probabilities can then be used to either predict the unknown attribute value or to impute a random value drawn from the distribution. Dilating message values at each pass reflects the intuition that longer paths should contribute less to the probability estimate than shorter paths.
Suppose that we wish to predict the value of Approval given that Income is Low. The arrows on the graph below illustrate the operation of the message-passing procedure, with the thickness of each arrow representing the message value (dilated by factor k = 0.5 at every hop).
The message is initiated at node Income:Low (colored green). This node passes messages of value 1 to both Instance 1 and Instance 2, which then each pass the message (with a dilated value of 0.5) to nodes Education:College and Approval:No. Note that since Education:College receives messages from both Instance 1 and Instance 2, it must pass each of these messages to Instance 3, with a dilated value of 0.25. The numbers at each node of the target variable show the sum of message values received and (in parentheses) the normalized values as percentages. We have the following probabilities for Approval, conditional on Income is Low:
- Prob (Approval is ‘Yes’ | Income is Low) = 20%
- Prob (Approval is ‘No’ | Income is Low) = 80%
These probabilities are different to what we would have obtained from a count-based prediction from the table. Since two of the five instances have Income Low, and both of these have Approval No, a count-based prediction would lead to a probability of 100% for Approval No.
The message-passing procedure has taken into account that the attribute value Education College, possessed by Instances 1 and 2, is also possessed by Instance 3, which has Approval Yes, thus contributing to the total message value received at node Approval:Yes. If we had incorporated the additional attributes Employment and History in our graph, this would likely have further increased the number of paths connecting the start node to the target nodes, thereby utilizing additional contextual information, and improving the estimate of the probability distribution.
The message-passing procedure can also be used when conditioning on more than one attribute. In such a case we simply initiate a message at each of the nodes corresponding to the attribute values we are conditioning on and follow the same procedure from each. The graph below shows the result of predicting the value of Approval given Income is Low and Education is Graduate. The sequence of messages originating from each starting node are shown in different colors.
Instance 4 has Education value Graduate, and therefore contributes to the sum of message values received at node Approval:Yes. A further contribution to Approval:Yes is made by Instance 5, which shares Income High with Instance 4.
[Note: in each of these examples we have used Approval as our target variable; however, we could have estimated the probability distributions for Income or Education in exactly the same way].