Navigation path

Left navigation

Additional tools

Archive for December, 2014

The last post of 2014: The Reinhart and Rogoff Spreadsheet, Austerity Policies and Programming Language Technology…..

Wednesday, December 31st, 2014

A few days ago, I attended a very interesting conference by Emery Berger , a Professor in the School of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system.

In 2010, the economists Carmen Reinhart and Kenneth Rogoff, both now at Harvard, presented results of an extensive study of the correlation between indebtedness (debt/GDP) and economic growth (the rate of change of GDP) in 44 countries and over a period of approximately 200 years. The authors argued that there was an “apparent tipping point”: when indebtedness crossed 90%, growth rates plummeted. It looks that the results of this study were widely used by politicians to justify austerity measures taken to reduce debt loads in countries around the world.

What programming language did they use to develop the model?

C++? Nope…. even if there are aprox. 3.5 million users of this programming language around the world…

Was it Java? Nope… even if there are some 9 million users of Java around the world…

What did they use then? Like many others in Social and Bio sciences they used … Microsoft EXCEL …. It is estimated that there are around 500 million users around the world ( aprox. 7% of the World population) and…. Yes, EXCEL is a very powerful programming language with the formulas/macros and the embedded execution method…

A friend of mine, Oscar Pastor, who is Professor of Computer Science at the Polytechnic University of Valencia, is trying to apply Information Systems Technologies to genetics and together with his research group has being dong research on that area of Biology. He tries to organize DNA sequences in such a way that discovering patterns for illness is made easier and quicker.

Why I refer to this? … it looks that hundred of gigabytes of information on the subject gathered and processed by research groups are in flat file and EXCEL Worksheets and therefore checking the correctness of models and data developed with EXCEL is extremely important.

As stated by Emery and his team, program correctness has been an important programming language research topic for many years. A lot of research has been carried out ( and still is !!!) to find techniques to reduce program errors . They range from testing and runtime assertions to dynamic and static analysis tools that can discover a wide range of bugs. These tools enable programmers to find programming errors and to reduce their impact, improving overall program quality.

The Holy Grail in this area is to achieve “program proofing”, i.e. being able to find a mathematical (formal) representation of a program that allows it to be “proofed” very much in the same way one “proofs a theorem”.

Nonetheless, a computation is not likely to be correct if the input data are not correct. The phrase “garbage in, garbage out,” long known to programmers, describes the problem of producing incorrect outputs even when the program is known to be correct. Consequently, the automatic detection of incorrect inputs is at least as important as the automatic detection of incorrect programs. Unlike programs, data cannot be easily tested or analyzed for correctness.

There a variety of reasons why “data errors exist”: they might be Data Entry Errors (typos or false transcription), Measurement errors (acquisition devices is faulty), Data Integration errors (mixing different data types or measurement units….. ).

On Data Integration errors, remember the Mars Climate Orbiter loss in 1999 “because spacecraft engineers failed to convert from English to metric measurements when exchanging vital data before the craft was launched”.

By contrast with the proliferation of tools at a programmer’s disposal to find programming errors, few tools exist to help find data errors.

There are some automatic approaches to finding data errors such as data cleaning (cross-validation with ground truth data) and statistical outlier detection (reporting data as outliers based on the relation ship to a given distribution (e.g. Gaussian). However, identifying a valid input distribution is at least as difficult as designing a correct validator…. and, as stated by the authors of the research “even when the input distribution is known, outlier analysis often is not an appropriate error-finding method. The reason is that it is neither necessary nor sufficient that a data input error be an outlier for it to cause program errors !!!”.

While data errors pose a threat to the correctness of any computation, they are especially problematic in data-intensive programming environments like spreadsheets. In this setting, data correctness can be as important as program correctness. The results produced by the computations—formulas, charts, and other analyses— may be rendered invalid by data errors. These errors can be costly: errors in spreadsheet data have led to losses of millions of dollars. …….. and here comes the relationship between Reinhart and Rogoff and Programming Language Technology.

Although Reinhart and Rogoff made the original data available that formed the basis of their study, they did not make public the instrument used to perform the actual analysis: an Excel spreadsheet. Herndon, Ash, and Pollin, economists at the University of Massachusetts Amherst, obtained the spreadsheet. They discovered several errors, including the “apparently accidental omission of five countries in a range of formulas “. After correcting for these and other flaws in the spreadsheet, the results invalidate Reinhart-Rogoff’s conclusion: no tipping point exists for economic growth as debt levels rise.

Now , could this kind of “accidental error “ have been detected with the help of Programming Language Technology?

Emery and his team have carried out a research whose key finding is that, “with respect to a computation, whether an error is an outlier in the program’s input distribution is not necessarily relevant. Rather, potential errors can be spotted by their effect on a program’s output distribution. An important input error causes a program’s output to diverge dramatically from that distribution. This statistical approach can be used to rank inputs by the degree to which they drive the anomalousness of the program”.

In fact, they have presented “Data Debugging”, an automated technique for locating potential data errors. Since it is impossible to know a priori whether data are erroneous or not, data debugging does the next best thing: locating data that have an unusual impact on the computation. Intuitively, data that have a high impact on the final result are either very important or wrong. By contrast, wrong data whose presence have no particularly unusual effect on the final result do not merit special attention.

Based on the theoretical research, they have developed a tool called CHECKCELL , a data debugging tool designed as an add-in for Microsoft Excel and for Google Spreadsheets.

It highlights all inputs whose presence causes function outputs to be dramatically different than the function output were those outputs excluded. CHECKCELL guides the user through an audit one cell at a time. CHECKCELL looks to be empirically and analytically efficient.

CHECKCELL’s statistical analysis is guided by the structure of the program present in a worksheet. In the first place, it identifies the inputs and outputs of those computations; it scans the open Excel workbook and collects all formula strings. The collected formulas are parsed using an EXCEL grammar expressed with the FParsec parser combinatory library. CHECKCELL uses the Excel formula’s syntax tree to extract references to input vectors and other formulas, resolves references to local, cross-worksheet and cross-workbook cells.

One interesting approach was that, in order to generate possible input errors to test the tool, Emery and his team used human volunteers via Amazon’s Mechanical Turk crowdsourcing platform to copy series of data to generate typical human transcription errors. According to Emery, on average, 5% of data copied are erroneous.

Emery and his team obtained the Excel spreadsheet directly from Carmen Reinhart and ran CHECKCELL on it. The tool singled out one cell in bright red, identifying it as “a value with an extraordinary impact on the final result”.

They reported this finding to one of the UMass economists (Michael Ash). He confirmed that this value, a data entry of 10.2 for Norway, indicated a key methodological problem in the spreadsheet. The UMass economists found this flaw by careful manual auditing after their initial analysis of the Spreadsheet.

Due to the extraordinary growth (more that 10%) of Norway in a single year ,1946 , out of the 130 years registered. Such a high growth in one year has an enormous impact on the model since Norway’s one year in the 60-90 percent GDP category receives equal weight to, for example, Canada’s 23 years in the category, Austria’s 35, Italy’s 39, and Spain’s 47 !!!!!

I asked Emery about the reaction of Reinhart and Rogoff, when the flaws in the model were discovered…. He answered that both economist maintained their conclusions… and in any case , according to him, they said that theirs was a “working paper” …. and, of course, it was not subject to a rigorous “peer review”…..

Yu can find the opinion of the prestigious economist and Nobel Prize winner Paul Krugman on EXCEL errors and the Reinhart-Rogoff model at http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html?_r=0

I found an interesting FAQ on this famous (or perhaps infamous…) flaw at http://www.businessweek.com/articles/2013-04-18/faq-reinhart-rogoff-and-the-excel-error-that-changed-history for those who are interested in the subject.

I will come back on Emery’s works when I will speak about SurveyMan in one of my next posts.

In the mean time, I wish all the readers a Happy 2015 !!!

Stay tuned for more in the New Year !!!

Best

Paco

 

 

 

 

 

On entrepreneurship …..

Saturday, December 6th, 2014

A few days ago, a delegation from Spain visited UC Berkeley in a meeting organized by the Spain-USA Chamber of Commerce with the participation of the Universities of Valencia, Sevilla y Malaga and representatives of the public administrations of the Spanish central government and regional governments of Andalucía y Valencia and six Spanish Start-ups.

The two most interesting parts of the meeting were a presentation by J. Miguel Villas-Boas  Professor of Marketing Strategies at UC Berkeley Haas School of Business  and a panel with the six Start-ups chaired by Ken Singer  from the Centre for Entrepreneurship and Technology (http://cet.berkeley.edu/) who has been himself a serial entrepreneur.

Prof. Vilas-Boas spoke about “Branding and Pricing Strategies in the Digital Economy” .

He set the scene by speaking about consumer behaviour on the web when searching for products to purchase.

It was expected that the digital world would lead to “low prices and lesser role for brands” … in reality there is a lot of variability in prices and still very powerful brands…. and not to be forgotten “search is costly for customers in terms of mental effort”!!!. Therefore, vendors, like customers, have to strike the balance between the information to gain with search and the search effort; research has found that ” the optimal purchase threshold increases with informativeness of search and decreases with search costs”.

If that is the case, how much information to provide? Too little …..customers do not have enough information to buy and may go somewhere else… too much (information overload) impacts the cost of search about the product… What about “on purpose Product Obfuscation”, so that it is more difficult to compare prices?. It looks that structuring the product information in order to easier consumer search while avoiding information overload is the best approach.

On pricing, the question is how to price given consumer search effort and behavior…. low price means low margins but consumer do not have to search much to buy… higher prices means higher profits but consumer will likely search more for better deals.. how to strike the balance?

Another interesting part was about “personalization”;  the central question is: how to learn about customers so that price differently, do product personalization and target the message?. There are strategies to attract new customers , for instance offer discounts, and to price strategically at the beginning of campaigns to separate high valuation customers… the problem is that consumers may understand that this may be happening and refrain from buying to get discounts later….

For known customers, product targeting based on their previous preferences is a good strategy but it is important to strike the right balance : look for the better fit between advertising and customers preferences, avoiding getting them annoyed with too much advertising…..

Another part of the presentation was about “Brands in the Digital World”. Based on market research he gave information about “the most valuable global brands in 2013”.. Apple, Google, Coca-Cola, IBM and Microsoft leading the pack. However, looking at social media mentions, the top brands for 2012 were Coke, Gatorade, Apple, Google and Starbucks…while if one looks at fan pages, Facebook, You Tube, Coca-Cola, Disney and MTV lead the pack.

The final part was about co-creation. He spoke about how Mountain Dew, who sells soda drinks and Starbuck , who as you know sells coffee and food, engage with customers to create new flavors (Drewmocracy they called the campaign!!!) or looking for new ideas for products. According to market research, internet users, at least in the United States, like brands to listen to them; the proportion is higher in ages 25-34 and 35-44 and the main reasons are : support for the brand they like, to receive regular updates about the products or to get coupons.

I have recently had the opportunity to have a coffee with Professor Villas-Boas and we spoke about data privacy issues when identifying the customer (see my previous posts on this matter) and co-creation.

I wanted to see how to apply the approach and techniques mentioned above to the public sector or what the public sector should do to engage better with citizens and the civil society using the internet. He has introduced me to a Hungarian researcher, Zsolt Katona, who also works on the subject of co-creation in the Hass School of Business and I will meet him soon. I am sure the collaboration will be very useful for my research project.

The second interesting item of the conference was the panel with the Spanish entrepreneurs chaired in a very professional manner by Ken Singer. Participated the CEOs of Brave Zebra (game outsourcing company), Closca (manufactures foldable bike helmets), Imegen (DNA sequencing) , GPTech (technologies for removable energies) and Melomics Media (Applies Artificial Intelligence to music composition). They were asked about their personal history as entrepreneurs and the reasons to start it, the challenges they found, the relationships with the academic world if any and what they would ask the government to do to facilitate their life.

I think these guys are real heroes (particularly in Spain…and have sacrificed many hours of their life to make their ideasbecome  a reality….. and all of them said that money was not the first reason for them to start a company…. !!!

Stay tuned.. more to follow this week !!

Best

Paco