When the Network becomes Social: Small World Graphs and O.R.

Mathematicians have been studying graphs for a long while. Sociologists found out that some of them explain how we interact. Indeed, social networks just make the connection more evident to anyone. In the middle of that, some researchers have been wondering about the following question: can we make optimal decisions based on our local information into a social network?

A world of lines and dots…

Dots and lines connecting pairs of dots – that’s a graph (but we usually say vertices and edges – or nodes and arcs – when talking about them). Mathematicians study graphs because they are structures capable of modeling lots of relationships among entities. Sometimes they wonder if a property found in a certain graph implies another one, developing statements to the Graph Theory. Other times they want to leverage those properties when designing an algorithm that manipulates certain types of graphs, like in Combinatorial Optimization algorithms. As a matter of fact, that is not an isolated case – many researchers handling real-world problems aim at designing algorithms with an outstanding performance for the most common instances they expect to solve.

… and the world of people!

Many people have already heard about the “six degrees of separation” principle, which states that – on average – you can reach any person in the world through a chain of six people that know one another. Such “magical number” emerged from experiments of Stanley Milgram and others during the 1960’s, in which they asked a person in the U.S. to deliver a letter to another person by submitting it to someone that he/she knew and who he/she supposed to be closer to such person. Theoretical results also point something similar: for a random graph, the average shortest distance among pairs of vertices is proportional to the logarithm of the number of vertices, what means a very slow pace of increment as graphs get bigger and bigger. However, that is not truefor any graph. Instead, people started looking to a more specific class of graphs called Small World Graphs, which are supposed to be representative of a number of situations.

Small World Graphs to be explored everywhere

Small World Graphs can be though as a combination of lattices (grids of edges) and clusters or quasi-clusters (groups in which almost all edges exist among vertices) with a small average degree (number of edges from each vertex). The former property ensures that the graph is connected and it is possible to find a path among any pair of vertices. The later has to do with the fact that two vertices sharing an edge with a third one are more likely to share an edge among them. Think about it: you might know some people from your university, almost everyone from your department, whereas each of your colleagues is more likely to have long range connections with researchers sharing a common interest worldwide; and all of that together means that you don’t need many steps to reach most of the researchers in the world. The same goes valid for airports: your local airport might be connected to a number of airports in other cities of your country and some of them are connected to airports worldwide in a way that you can attend to your meetings everywhere without worrying too much about how to get there. However, if you need to think about it, you might probably come up with a very good answer, isn’t it?

Do we always have optimal answers from local network information?

That’s the question that Jon Kleinberg tries to answer in the article “The Small-World Phenomenon: An Algorithmic Perspective”. He claims to have found the class of graphs for which such local information ensures an optimal decision. To be honest, I didn’t read the entire paper (I’m in really busy times) but it sounds really interesting and I left to the curious reader such task (let me know about it after).

This post was prompted by the INFORMS Blog Challenge of July: OR and Social Networking. You can check all the submitted entries by August 4th.

Drug discovery optimization: a meeting point for data mining, graph theory and operations research

How can I help finding the best compound to save a life?

In a glance, the answer is what research in optimizing drug design is about. Its main goal is to cheapen and speed-up drug development whilst producing more effective and less toxic pharmaceuticals. It is an interesting topic that involves much of what business analytics and optimization professionals use daily to solve problems in many other areas. For that reason, I decided to write about that for April’s INFORMS Blog Challenge: O.R. in Health Care.

Drug design basics: in vitro vs. in silico methods and QSAR analysis

Before a drug can be prescribed to humans, it must pass through a careful evaluation. However, it would not be reasonable or even feasible to test every possible compound for each desired effect, especially if such tests involve animals. Pharmaceutical researchers are interested in screening molecules on a huge data set, so that only those molecules that are most likely to work are ultimately tested in vitro, in animals and ultimately in humans. They accomplish that with computer-aided predictive analysis commonly called in silico tests.

In silico tests are designed in the belief that there exists a valid theory to extrapolate quantitatively the activity of a compound based on similar ones, using tools like Quantitative Structure-Activity Relationship (QSAR) analysis. The goal of a QSAR analysis is to find a function capable of predicting the activity of the main molecule of a compound based on the presence and quantity of certain molecular substructures. We are then lead to the quest for a way of representing molecules and to use available data about an activity to create trustable predictions for similar molecules.

The role of data mining and molecules as graphs

For many purposes, a molecule is just a graph with vertices and edges. However, the fact that molecular data are more structured does not mean that the problem is easier. Roughly* deciding whether one graph can be found inside another one is know as the Sub-Graph Isomorphism problem. That problem is NP-Complete, what means that there is not know an efficient algorithm to solve it.

(* By roughly, I ask you to forgive the lack of formality in which I defined sub-isomorphism: graphs are not said to be inside each other. A one-to-one correspondence between one graph or sub-graph and another one is called an isomorphism. However, if I was to tell it at first, I would lose most of the audience.)

More than just finding a sub-graph isomorphism, one is usually interested in find the most frequent ones. One interesting approach to Frequent Sub-Graph (FSG) mining is gSpan, a DFS-based mining algorithm proposed by Yan and Han. It consists of defining a unique lexicographic encoding for each sub-graph so that it can be counted more easily while exploring molecular data. There is also a controversy about the validity of 2D models such as graphs for modeling molecules, specially because some geometric subtleties differ a innocuous molecule from a a carcinogenic one. Thus, it is worth of notice that predictive models are not intended to skip in vitro tests but just point out which molecules are most likely to work out.

How can operations research fit in?

There are a number of O.R. applications for what we have been discussing here.

I would like to mention one interesting application of Linear Programming (LP) to QSAR analysis by Saigo, Kodawaki and Tsuda. Their approach consisted of using LP to build a regression function for prediction in which error is minimized. It is based on positive and negative relations between molecules and activities, that is, a quantitative relation among each molecule and the activity or a constraint imposing that such relation is not relevant. The decision variables of the problem are the coefficients associated with every possible substructure. Since there can be a lot of possible substructures, they start with a small problem and there is a column generation algorithm that tries to find a new variable whose addition to the problem might improve the results.

Final remarks: OR and the invisible good

The fun of many professions is to see the benefit of what you are doing. That is even more present in health care, since you are dealing with people and their lives. On the other hand, OR is the kind of invisible science, which is on its best use when the average people can not sense its presence. Notwithstanding, OR practitioners enjoy numbers and are glad to know that things are a bit better because of their work behind the scenes. Despite that, blogging about OR can help people notice the importance of the area to support others that are commonly visible to everyone.

I would like to thank my mother-in-law for some helpful advises to write about QSAR.