A few days ago, I attended a very interesting conference by Emery Berger , a Professor in the School of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system.
In 2010, the economists Carmen Reinhart and Kenneth Rogoff, both now at Harvard, presented results of an extensive study of the correlation between indebtedness (debt/GDP) and economic growth (the rate of change of GDP) in 44 countries and over a period of approximately 200 years. The authors argued that there was an “apparent tipping point”: when indebtedness crossed 90%, growth rates plummeted. It looks that the results of this study were widely used by politicians to justify austerity measures taken to reduce debt loads in countries around the world.
What programming language did they use to develop the model?
C++? Nope…. even if there are aprox. 3.5 million users of this programming language around the world…
Was it Java? Nope… even if there are some 9 million users of Java around the world…
What did they use then? Like many others in Social and Bio sciences they used … Microsoft EXCEL …. It is estimated that there are around 500 million users around the world ( aprox. 7% of the World population) and…. Yes, EXCEL is a very powerful programming language with the formulas/macros and the embedded execution method…
A friend of mine, Oscar Pastor, who is Professor of Computer Science at the Polytechnic University of Valencia, is trying to apply Information Systems Technologies to genetics and together with his research group has being dong research on that area of Biology. He tries to organize DNA sequences in such a way that discovering patterns for illness is made easier and quicker.
Why I refer to this? … it looks that hundred of gigabytes of information on the subject gathered and processed by research groups are in flat file and EXCEL Worksheets and therefore checking the correctness of models and data developed with EXCEL is extremely important.
As stated by Emery and his team, program correctness has been an important programming language research topic for many years. A lot of research has been carried out ( and still is !!!) to find techniques to reduce program errors . They range from testing and runtime assertions to dynamic and static analysis tools that can discover a wide range of bugs. These tools enable programmers to find programming errors and to reduce their impact, improving overall program quality.
The Holy Grail in this area is to achieve “program proofing”, i.e. being able to find a mathematical (formal) representation of a program that allows it to be “proofed” very much in the same way one “proofs a theorem”.
Nonetheless, a computation is not likely to be correct if the input data are not correct. The phrase “garbage in, garbage out,” long known to programmers, describes the problem of producing incorrect outputs even when the program is known to be correct. Consequently, the automatic detection of incorrect inputs is at least as important as the automatic detection of incorrect programs. Unlike programs, data cannot be easily tested or analyzed for correctness.
There a variety of reasons why “data errors exist”: they might be Data Entry Errors (typos or false transcription), Measurement errors (acquisition devices is faulty), Data Integration errors (mixing different data types or measurement units….. ).
On Data Integration errors, remember the Mars Climate Orbiter loss in 1999 “because spacecraft engineers failed to convert from English to metric measurements when exchanging vital data before the craft was launched”.
By contrast with the proliferation of tools at a programmer’s disposal to find programming errors, few tools exist to help find data errors.
There are some automatic approaches to finding data errors such as data cleaning (cross-validation with ground truth data) and statistical outlier detection (reporting data as outliers based on the relation ship to a given distribution (e.g. Gaussian). However, identifying a valid input distribution is at least as difficult as designing a correct validator…. and, as stated by the authors of the research “even when the input distribution is known, outlier analysis often is not an appropriate error-finding method. The reason is that it is neither necessary nor sufficient that a data input error be an outlier for it to cause program errors !!!”.
While data errors pose a threat to the correctness of any computation, they are especially problematic in data-intensive programming environments like spreadsheets. In this setting, data correctness can be as important as program correctness. The results produced by the computations—formulas, charts, and other analyses— may be rendered invalid by data errors. These errors can be costly: errors in spreadsheet data have led to losses of millions of dollars. …….. and here comes the relationship between Reinhart and Rogoff and Programming Language Technology.
Although Reinhart and Rogoff made the original data available that formed the basis of their study, they did not make public the instrument used to perform the actual analysis: an Excel spreadsheet. Herndon, Ash, and Pollin, economists at the University of Massachusetts Amherst, obtained the spreadsheet. They discovered several errors, including the “apparently accidental omission of five countries in a range of formulas “. After correcting for these and other flaws in the spreadsheet, the results invalidate Reinhart-Rogoff’s conclusion: no tipping point exists for economic growth as debt levels rise.
Now , could this kind of “accidental error “ have been detected with the help of Programming Language Technology?
Emery and his team have carried out a research whose key finding is that, “with respect to a computation, whether an error is an outlier in the program’s input distribution is not necessarily relevant. Rather, potential errors can be spotted by their effect on a program’s output distribution. An important input error causes a program’s output to diverge dramatically from that distribution. This statistical approach can be used to rank inputs by the degree to which they drive the anomalousness of the program”.
In fact, they have presented “Data Debugging”, an automated technique for locating potential data errors. Since it is impossible to know a priori whether data are erroneous or not, data debugging does the next best thing: locating data that have an unusual impact on the computation. Intuitively, data that have a high impact on the final result are either very important or wrong. By contrast, wrong data whose presence have no particularly unusual effect on the final result do not merit special attention.
Based on the theoretical research, they have developed a tool called CHECKCELL , a data debugging tool designed as an add-in for Microsoft Excel and for Google Spreadsheets.
It highlights all inputs whose presence causes function outputs to be dramatically different than the function output were those outputs excluded. CHECKCELL guides the user through an audit one cell at a time. CHECKCELL looks to be empirically and analytically efficient.
CHECKCELL’s statistical analysis is guided by the structure of the program present in a worksheet. In the first place, it identifies the inputs and outputs of those computations; it scans the open Excel workbook and collects all formula strings. The collected formulas are parsed using an EXCEL grammar expressed with the FParsec parser combinatory library. CHECKCELL uses the Excel formula’s syntax tree to extract references to input vectors and other formulas, resolves references to local, cross-worksheet and cross-workbook cells.
One interesting approach was that, in order to generate possible input errors to test the tool, Emery and his team used human volunteers via Amazon’s Mechanical Turk crowdsourcing platform to copy series of data to generate typical human transcription errors. According to Emery, on average, 5% of data copied are erroneous.
Emery and his team obtained the Excel spreadsheet directly from Carmen Reinhart and ran CHECKCELL on it. The tool singled out one cell in bright red, identifying it as “a value with an extraordinary impact on the final result”.
They reported this finding to one of the UMass economists (Michael Ash). He confirmed that this value, a data entry of 10.2 for Norway, indicated a key methodological problem in the spreadsheet. The UMass economists found this flaw by careful manual auditing after their initial analysis of the Spreadsheet.
Due to the extraordinary growth (more that 10%) of Norway in a single year ,1946 , out of the 130 years registered. Such a high growth in one year has an enormous impact on the model since Norway’s one year in the 60-90 percent GDP category receives equal weight to, for example, Canada’s 23 years in the category, Austria’s 35, Italy’s 39, and Spain’s 47 !!!!!
I asked Emery about the reaction of Reinhart and Rogoff, when the flaws in the model were discovered…. He answered that both economist maintained their conclusions… and in any case , according to him, they said that theirs was a “working paper” …. and, of course, it was not subject to a rigorous “peer review”…..
Yu can find the opinion of the prestigious economist and Nobel Prize winner Paul Krugman on EXCEL errors and the Reinhart-Rogoff model at http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html?_r=0
I found an interesting FAQ on this famous (or perhaps infamous…) flaw at http://www.businessweek.com/articles/2013-04-18/faq-reinhart-rogoff-and-the-excel-error-that-changed-history for those who are interested in the subject.
I will come back on Emery’s works when I will speak about SurveyMan in one of my next posts.
In the mean time, I wish all the readers a Happy 2015 !!!
Stay tuned for more in the New Year !!!