The pursuit of asymmetrical models in CP: Backtracking hardness is, after all, a tricky business!

Last week I wrote a post on the effectiveness of using backtracking alone to solve combinatorial problems. It was an interesting exercise to develop the discussion from theory to practice and back to theory again. However, there was a mistake in that development. As pointed by Jean-François Puget in his comment, it is possible to add a small number of symmetry breaking constraints to efficiently refute that class of infeasible instances of the pigeonhole principle used as example. After all, what was actually claimed by the theoretical development that I shortly mentioned is that the refutation of a specific formulation of the problem has an exponential lower-bound. While that could be a case for using a global constraint, it could also be addressed by symmetry breaking.

Symmetries arise naturally in combinatorial problems, but the extent to which they cause concern varies with the solving method used. They matter a lot for backtracking because of their impact in the size of the search space to be explored. We have a formulation with a symmetry breaking constraint if, compared to the formulation without such a constraint, only solutions differing by symmetry from the solutions of the more restricted formulation are feasible for one and not for the other.

There is an intuitive notion that symmetry breaking is a good thing, but care must be taken: what if it makes an easy problem harder? It might be the case that the original problem was underconstrained and additional constraints would rule out exactly the solutions that are easier to find, therefore making the search longer than without them. However, that was not the case of the problem in the previous post and is not likely to be the case in most applications. It is even interesting to ask if there are conditions in which this can be in fact observed.

Nevertheless, symmetry breaking is a bit challenging because it requires taking into account the whole formulation instead of reasoning about each constraint separately. For each symmetry identified in a subproblem, we may come up with an additional constraint that introduces information about the problem context that breaks it. As a consequence, symmetry breaking inference mechanisms cannot be encapsulated within global constraints and we are left with two options: expect good formulations from the modeler or develop some sort of general mechanism to find them.

The current practice is based on the former option, and thus the expertise of the modeler plays an important role in the process. However, the quest for a truly autonomous solver of broader use would demand further advances in the latter possibility. Honestly, I was not aware of the importance of reformulation until thinking about all of this, and it made me look again for Barry O’Sullivan’s paper at AAAI-10 discussing the automation of both modeling and solving. Among recent papers, it is a good start on the topic.

In conclusion, proving that backtracking cannot suffice is a bit trickier that I thought at first. Unless P=NP, there must exist an asymmetrical CSP class (or at least one for which breaking its symmetries would not be an easy problem) with lower-bound refutation asymptotically worst than any polynomial function in the size of the problem. If such a class is yet unknown, the interesting part of looking for it would be to find a number of general inference rules for symmetry breaking that we could use to solve other problems.

Why backtracking alone cannot suffice to solve combinatorial problems: CP perspective, empirical hardness models evidence, and proof complexity results

Constraint Programming (CP) is a technique with particular strength to solve classes of combinatorial problems for which global constraints were identified and exploited in the design of specialized algorithms. However, I have been convinced that CP is, above all, a general model to understand combinatorial problems and their solution – using the term ‘model’ in a broader sense, as explained in a previous post and based on John Hooker’s article at CP Letters.

Regardless of the solving method used, the classic aim of CP is to tackle the Constraint Satisfaction Problem (CSP), which ultimately encodes any problem involving variables with discrete and finite values as the intersection of the regular languages defined by each of its constraints. When special properties are observed in a class of such languages that would allow us to make strong and efficient inferences among those variables, we have found one of those so-called global constraints. Not surprisingly, the concept is in fact a formalization of the widespread practice of exploiting problem structure, which was also discussed in a previous post linking the implementation of local search operators to CP concepts.

In spite of such claimed ‘agnosticism’ regarding solving methods, CSP solving is often associated with backtracking search algorithms. The rationale is that, if compared to other general solving techniques like dynamic programming, backtracking scales better in memory use with the size of the problem being solved. As a consequence, CP lectures and many lines of research are developed with the premise of using backtracking-based search. In practice, however, that post on CP and local search also points that the converse is true: local search methods like variable neighborhood search have been successfully used in commercial CP solvers.

One of such lines in which backtracking was extensively studied is the adaptive selection of algorithms, which is aimed to select the most suitable algorithm to solve a given instance according to predictive models of performance. From the characterization of CSP instances by means of properties like constrainedness, which is the ratio of pairwise constraints to variables, empirical hardness models were developed that also documented easy-hard-easy phase transitions. In short, it is easy to solve an unconstrained problem or to infer that an overconstrained one is unfeasible, but those in the ‘feasibility frontier’ are much harder to solve. A short reference about the topic is the 2005 article of Carla Gomes and Bart Selman in Nature entitled ‘Can get satisfaction’, and a precise figure of such hardness can be seen in the graph at slide number 6 of Kevin Leyton-Brown’s talk at the Workshop on Beyond Worst-Case Analysis.

Although a well-documented phenomenon, until very recently it was yet unclear to me why this happens. I was only convinced of it this week, when I came across some lecture notes of KTH Professor Jakob Nordstrom on proof complexity that discussed the resolution of formulas for the pigeonhole principle. Proof complexity is a discipline concerned with understanding the hardness of theorem proving, and comprises the study of characteristics and strengths of different proof systems as well as their equivalence. Among such systems is resolution, which is equivalent to solving through backtracking a CSP in which variables can only assume two values – true or false, or 0 or 1 – and constraints only use the logical operators NOT, OR, and AND in Conjunctive Normal Form (CNF): a set of clauses using NOT and OR joined by AND. This is a special form of the satisfiability problem, or simply SAT, for which the backtracking search with only local inference corresponds to the well-known DPLL algorithm.

The pigeonhole principle states that it is not possible to assign n pigeons into m pigeonholes without putting two of them in the same pigeonhole if n > m. It turns out that encoding an instance of this problem in polynomial size in n and m such that n = m+1 as a satisfiability problem (SAT), i.e., a CSP with variable domains of size two and constraints built of logical operators in CNF, results in an unsatisfiable problem that requires an exponential lower-bound resolution. That happens because a valid refutation requires exploring a large number of combinations of assignments to prove that none would lead to the satisfaction of the problem constraints. [check the erratum at the end of the post]

In short, backtracking really needs global constraints!

Erratum

According to the first comment to this post, there is an enconding for which the mentioned refutation is efficiently found. The discussion is continuned in the sequel of this post.

The problem with the problem

Operations research is a field with many problematic words, of which ‘problem’ is an interesting example. The word is generally used to describe a difficulty, but it has found an additional meaning in mathematics as a well-defined question that one may use an algorithm to answer. That would not be bad if it was not by a problem (of the first kind): both meanings are often interchangeably used.

The unavoidable consequence is that new developments targeting a given application sometimes present a meaningless criticism against mathematical problems, which are abstract entities that exist in spite of their use. Such a concern may sound pedantic because the mathematical problem is usually only serving as a proxy to its interpretation as the solution to the problem (of the first kind) associated with a given application. However, this bothers me because it is the tip of an iceberg of misguided conceptualization. And, of course, that is not how one should treat guys like the poor traveling salesman and his problem, who did not intend to do any harm against the problems of others.

Even though this is not a common mistake among reputed researchers in optimization, it happens pretty often among those in fields that need optimization as a means but do not delve further into theoretical concerns. This is bad because many of them go to the industry and will eventually come back to hire an expert in operations research to solve their most challenging problems, which will also involve detaching their needs from fixed ideas on how to model them.

I have already written previously about the meaning of problem, model, and solution in this blog, but coming across yet another bunch of writings and sayings with similar issues made me do it again. Personally, I only realized the problem with that after talking about my work with people in other fields, to which a problem has only one meaning (the full story is on that previous post). If I ever have the opportunity to teach an introductory course on OR, I would try to make this point clear in the most funny way that I could conceive (the funny part takes longer to be forgotten). As a matter of fact, this issue reminds me of a poem of my favorite writer. It is entitled XXVIII and is from The Keeper of Flocks, a book written by the Portuguese poet Fernando Pessoa under the name Alberto Caeiro (translation from the blog Fernando Pessoa: Alberto Caeiro: Complete Poems):  

 

Today I read almost two pages
In a book by a mystical poet
And I laughed like someone who’d cried a lot.

 

Mystical poets are sick philosophers
And philosophers are crazy.

 

Mystical poets say flowers feel
And they say stones have a soul
And they say rivers have ecstasies in the moonlight.

 

But flowers wouldn’t be flowers if they felt,
They’d be people;
And if stones had a soul, they’d be living things, they wouldn’t be stones;
And if rivers had ecstasies in the moonlight,
Rivers would be sick people.

 

You need to not know what flowers and stones and rivers are
To talk about their feelings.

Talking about the soul of stones, of flowers, of rivers,
Is talking about yourself and your false thoughts.
Thank God stones are only stones,
And rivers are nothing but rivers,
And flowers are just flowers.

Me, I write the prose of my poems
And I’m at peace,
Because I know I comprehend Nature on the outside;
And I don’t comprehend Nature on the inside
Because Nature doesn’t have an inside;
If she did she wouldn’t be Nature.

Continue reading “The problem with the problem”

The unsustainable lack of analytical skills for a sustainable environment

Developing an analytical model to support decision-making is a really tricky task. I did many of them in order to decide whether it was a good idea to buy an apartment or to keep renting. What differed among them was the level of detail taken into account, which progressed slowly as I learned from my previous mistakes. While most of my models pointed out that it was better to keep money on a savings account and do not lend any instead of buying an apartment for the rent that can be earned (or that I would avoid paying since I had to live somewhere), I often forgot to regard the probability that apartments would appreciate considerably in a near future. Looking back, I see that my renter did an excellent investment by renting to me and waiting for what would come. Nevertheless, that moment is gone and it does not make any sense to bet that real estate in São Paulo will keep rising in price as it did in recent years.

When it comes to a trending topic like sustainability, that is supposed to represent the concern of people with protecting the environment for future generations while exploiting it for their own survival, I have been often told of misconceptions and lost opportunities that made many initiatives on that regard produce the opposite effect. And what is worst: as conceived, sustainability is a great but sometimes overlooked application domain to leverage analytical tools.

For instance, I was told of a workplace where someone had the idea of asking for shelves above the toilet washbasins. Since the washbasins were often wet, people got used to place a paper towel below their toiletries bags. As an experiment, such shelves were placed in half of the toilets of a floor. Long after, people started to ask when those shelves would be placed in the other half, since the experiment seemed to be successful. However, they were informed that someone had to collect data to assess if less paper towels were used in the first half of the toilets or not. As many people did not bother to walk more to use the toilets with shelves, it is very likely that the person undertaking such experiment will be surprised after realizing that the use of paper towels actually increased in that half and decreased in the other half. I hope that the person manages to sum up the use on both halves to realize what happened instead of concluding that shelves increase the use of paper towels.

It is attributed to Ronald Fisher the following quote: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”. While it would be interesting to count with professional help, I believe that we would be better off if people working with sustainability were aware of the importance of developing analytical skills to avoid working against their own goals.

Revisiting operations research elements – part II: results, research… and failure!

This post resumes the discussion about key concepts of operations research started previously in this blog. However, instead of focusing only on the meaning of terms, this time I’d like to discuss about which results are worth of publishing and the unacknowledged importance of failure.

Are you the “brand new champion”?

“I never knew a soul who ever took a licking.
My friends have all been champions at everything.”
(Fernando Pessoa, Portuguese poet)

For a while, I thought that the idea described by Fernando Pessoa explained what happens in O.R.: an article is only worth of being published if it has noticeable results. However, which results are we talking about? In “Testing Heuristics: We Have It All Wrong”, John Hooker stresses many concerns about how results-driven approaches have been transforming research into plain development. He summarized most of them by the maxim “The tail wags the dog as problems begin to design algorithms”. In fact, such concerns were wide spreading by that time. In the article “Generating Hard Satisfiability Problems”, Bart Selman and others warned that bad benchmarks may lead to wrong conclusions about how hard a problem is and, conversely, about how good an algorithm to such problem is. Moreover, theoretical results such as the No Free Lunch theorems state that there is no best overall algorithm and there are always weaknesses to be acknowledged. Otherwise, it would not make sense to talk about a subject called meta-learning, which aims at selecting the most appropriate algorithm to solve each problem instance.

I remember to hear about the importance of reporting bad results in experimental physics classes, but that was quite lost in my mind after my junior and sophomore years. The big picture suggested by Selman’s group and Hooker was that many approaches being done in the area strive to be claimed the best one to solve a given problem at the cost of biasing the tests towards situations that are better suited to what is being proposed, intentionally or not. Besides, hardware differences, coding abilities and several other issues might influence what is being presented. Minding that metaphor that I used in the previous post, one must not confuse the power of an idea with the performance of a source code when dealing with a given benchmark set at a given machinery architecture. For instance, if someone is proposing an algorithm that explores fewer branches during search, it would be better to measure the number of branches instead of the running time, since it might be the case that a bad choice of data structures might hinder an adequate comparison of different techniques. Hence, one must strive to write an article that does recognize when its solution fails along with when it succeeds, whilst striving to diminish perturbation factors on its results.

After all, operations research is research, right?

What would you answer if asked that? I let it go unanswered last time. Most often, O.R. does involve scientific research: if the problem being solved was quite simple to skip that part, it would be unnecessary to have someone thinking about it. Since I was asked that while working on a long-term project, that was quite true. Notwithstanding, even if there is no scientific research involved, O.R. is still sort of a research: one wants to find the most appropriate way of performing certain operations. Therefore, some of the subtleties and misconceived concepts mentioned previously come from the differences among scientific research and operations research: research involves a problem, solutions and it may fail sometimes. Indeed, the word research in O.R. is also the same used to denote research in terms like R&D in Brazil (pesquisa), Portugal (investigação) and France (recherche). Not surprisingly, the list of ambiguous terms is far from complete. For instance, programming means both planning something or the act of producing computer code, as stressed by Lusting and Puget in the article “Program Does Not Equal Program: Constraint Programming and Its Relationship to Mathematical Programming”, in which they explain why it means both things for mathematical programming but just the later for constraint programming.

Yet another conclusion

I have decided to place the most important conclusions along with the most confusing term of the previous post – solution, but I can repeat them once more. Being aware of the differences between the common O.R. terminology and what researchers adopt anywhere else is important to be more effective when spreading the word about what you did beyond your close peers. In these two posts, I aimed at sharing some thoughts that I would be glad to have heard before. In order to succeed with that, I had the valuable help of my fiancée Sabrina, which helped me by revising this text in order to avoid that I committed the same mistakes that I was pointing. It might still not be completely accurate or clear cut, since I’m just a graduate student striving to figure out my area.

Drug discovery optimization: a meeting point for data mining, graph theory and operations research

How can I help finding the best compound to save a life?

In a glance, the answer is what research in optimizing drug design is about. Its main goal is to cheapen and speed-up drug development whilst producing more effective and less toxic pharmaceuticals. It is an interesting topic that involves much of what business analytics and optimization professionals use daily to solve problems in many other areas. For that reason, I decided to write about that for April’s INFORMS Blog Challenge: O.R. in Health Care.

Drug design basics: in vitro vs. in silico methods and QSAR analysis

Before a drug can be prescribed to humans, it must pass through a careful evaluation. However, it would not be reasonable or even feasible to test every possible compound for each desired effect, especially if such tests involve animals. Pharmaceutical researchers are interested in screening molecules on a huge data set, so that only those molecules that are most likely to work are ultimately tested in vitro, in animals and ultimately in humans. They accomplish that with computer-aided predictive analysis commonly called in silico tests.

In silico tests are designed in the belief that there exists a valid theory to extrapolate quantitatively the activity of a compound based on similar ones, using tools like Quantitative Structure-Activity Relationship (QSAR) analysis. The goal of a QSAR analysis is to find a function capable of predicting the activity of the main molecule of a compound based on the presence and quantity of certain molecular substructures. We are then lead to the quest for a way of representing molecules and to use available data about an activity to create trustable predictions for similar molecules.

The role of data mining and molecules as graphs

For many purposes, a molecule is just a graph with vertices and edges. However, the fact that molecular data are more structured does not mean that the problem is easier. Roughly* deciding whether one graph can be found inside another one is know as the Sub-Graph Isomorphism problem. That problem is NP-Complete, what means that there is not know an efficient algorithm to solve it.

(* By roughly, I ask you to forgive the lack of formality in which I defined sub-isomorphism: graphs are not said to be inside each other. A one-to-one correspondence between one graph or sub-graph and another one is called an isomorphism. However, if I was to tell it at first, I would lose most of the audience.)

More than just finding a sub-graph isomorphism, one is usually interested in find the most frequent ones. One interesting approach to Frequent Sub-Graph (FSG) mining is gSpan, a DFS-based mining algorithm proposed by Yan and Han. It consists of defining a unique lexicographic encoding for each sub-graph so that it can be counted more easily while exploring molecular data. There is also a controversy about the validity of 2D models such as graphs for modeling molecules, specially because some geometric subtleties differ a innocuous molecule from a a carcinogenic one. Thus, it is worth of notice that predictive models are not intended to skip in vitro tests but just point out which molecules are most likely to work out.

How can operations research fit in?

There are a number of O.R. applications for what we have been discussing here.

I would like to mention one interesting application of Linear Programming (LP) to QSAR analysis by Saigo, Kodawaki and Tsuda. Their approach consisted of using LP to build a regression function for prediction in which error is minimized. It is based on positive and negative relations between molecules and activities, that is, a quantitative relation among each molecule and the activity or a constraint imposing that such relation is not relevant. The decision variables of the problem are the coefficients associated with every possible substructure. Since there can be a lot of possible substructures, they start with a small problem and there is a column generation algorithm that tries to find a new variable whose addition to the problem might improve the results.

Final remarks: OR and the invisible good

The fun of many professions is to see the benefit of what you are doing. That is even more present in health care, since you are dealing with people and their lives. On the other hand, OR is the kind of invisible science, which is on its best use when the average people can not sense its presence. Notwithstanding, OR practitioners enjoy numbers and are glad to know that things are a bit better because of their work behind the scenes. Despite that, blogging about OR can help people notice the importance of the area to support others that are commonly visible to everyone.

I would like to thank my mother-in-law for some helpful advises to write about QSAR.

 

Revisiting operations research elements – part I: problem, model and solution

People often admit that they have just discovered to know nothing about something they thought they knew. I felt that about terms like “problem”, “model” and “solution” during the last year. At a first sight, those words seem quite simple to anyone: a problem is a quest one wants to resolve, a model is an insightful description of such problem and a solution is an attempt of resolving the problem. As I want to show next, big messes are made of such simple things, especially in OR.

What is your problem?

Once, when I practiced a presentation at home saying “My problem is that I have those decision variables Xi, Yi and Zi”, I was interrupted by my mother-in-law with a “No! Your problem is that you have expensive machinery, oil wells to develop and you want to maximize the profit of the company”, as I indeed had! She might not know much about combinatorial optimization problems or other related class of optimization problems, but after two daughters, Ph.D. and Post-Doc in organic chemistry and a published book, she definitely knows better what a problem is.

I used to consider the “problem” as a class of purely abstract formulations or a domain-targeted specialization of such a class. However, I was ignoring the actual problem – something that must be fixed or enhanced –, which was occasionally mentioned as a motivation for the study. Indeed, such mistake can be the first step to detach a promising subject from its context and devalue its importance. As a matter of fact, I should have done the converse path: to consider my problem as a real matter from which disclosure is motivated by the resemblance with other problems of practical importance.

What does your model explain?

For a long time, I’ve been used to denote as a “model” what I write as input to an optimization solver. However, after reading “Good and Bad Futures for Constraint Programming (and Operations Research)” by John N. Hooker, I decided to get a step back about that. In his article, Hooker emphasizes the importance of models to leverage scientific progress according to their original meaning: an elegant description of a natural phenomenon. Under that assumption, a model is as valuable as its ability to explain how a given system works and why it works that way. For sure, that is not achievable by simply stating a bunch of equations.

The input of an optimization solver is always computer source code, even if it is at a very high level of abstraction when compared to generic programming languages. Nevertheless, what you are able to describe with such programming language – its syntax – can be considered as the “model” for some sort of class of abstract formulations, and my code was just a specialization of it. Therefore, it would be more accurate to say that I’ve spent much more time coding upon a model of the problem than modeling the problem. Metaphorically, the model is an ethereal inspiration to a substantial source code (hold that thought for the next post).

What are you solving?

The term “solution” has been so often associated with variable assignments of a given problem formulation that its use in its original assertion sounds like an invitation to confusion. I felt that when reading an article about Case-Based Reasoning (CBR) applications to constraint programming (CP), in which the authors had to manage such nomenclature conflict: in CBR, a solution is understood as a technique for solving a given problem and was forcefully redefined as a “resolution”, given that, in CP – and OR in general –, a solution is the output of a technique that solves some problem. I can’t imagine a better way of handling it than theirs: the ambiguity was acknowledged and possibly taught to some unaware readers, and the article moved forward.

Maybe it is a lost battle trying to associate the term “solution” in the OR field with what the general public would expect it to be. However, if that is not possible, one must emphasize the difference, since the benefit of succeeding on that – as well as in the other cases exhibited – is to present an original work in a more accessible language to researchers of other fields. Therefore, the reward for that is a higher probability of having your effort being recognized and used somewhere else.

Why to write about this?

Operations research possesses a powerful toolset, which can be applied almost everywhere. However, many authors acknowledge that OR is in a constant danger of extinction for being self-centered and detached from the outside world. I believe that one of the reasons for such belief is the misconception of terms. For that reason, I aimed to make my point as clear as possible to someone outside the field, what demanded a great number of revisions by my fiancée Sabrina. Doing that was much harder than writing the post at once and I am very grateful for her patience with my attempt to escape my own ivory tower.