David MacKay

I was deeply saddened to learn last week that Sir David MacKay has passed away. Over the last nine months, he struggled with cancer, and documented this journey in great detail on a personal blog. He is survived by his wife and two kids, aged one and four. He was just a few days short of his 49th birthday. 

I first met David MacKay at Princeton in 2006 when he gave a talk about a piece of HCI software called Dasher that he had designed. I immediately loved his presentation style, and I loved how with Dasher, he had blended theory with practical impact. He had a way to infuse a serious technical talk with wit and humor. Dasher was born out his realization that natural language can be compressed and that this compression can be presented visually as a form of text entry. From what I later read, Dasher went on to prove tremendously useful to the disabled. Soon after I got hold of a copy of David's book Information Theory, Inference and Learning Algorithms, considering it good enough to own in print, despite the fact that had made it available online for free. The amount of detail in the book in the way of examples is truly astonishing. It is a great book to learn ideas in the information sciences from first principles. His book was influential in giving me the confidence to study machine learning seriously towards the end of my PhD, because it served as a bridge from information and coding theory, which I was comfortable with. Around 2013 I learnt that dasher was available as an iOS app and I immediately installed it. A few years later I came across David's work on Gaussian processes and Bayesian learning. I remember enjoying watching this video on videolectures.net in which he explains Gaussian processes in his characteristic style -- very clear on technicals, witty and humorous, and loaded with demos and code snippets in Matlab. A while later I learnt of his book on climate change and remember browsing through the free copy he had put online.

Although I did not get too many opportunities to interact with him, David's work was, and continues to be, hugely inspiring. The world is a lesser place without him.

Deep learning and Bayesian Nonparametrics

Deep learning thrives on problems where automatic feature learning is crucial and lots of data are available. Large amounts of data are needed to optimize the large number of free parameters in a deep network. In many domains, that would benefit from automatic feature learning, large amounts of data are not available, or not available initially. This means that models with low capacity must be used with hand crafted features, and if large amounts of data become available at a later time, it may be worthwhile to learn a deep network on the problem. If deep learning really reflects the way the human brain works, then I imagine that the network being trained by a baby must not be very deep, and must have few parameters than networks that are trained later in life, as more data becomes available. This is reminiscent of Bayesian Nonparametric models, wherein the model capacity is allowed to grow with increasing amounts of data. 

Aesthetic pleasure requires partial familiarity

The toy plot below is inspired by Jurgen Schmiduber’s thesis on what we as humans seek when we seek aesthetic pleasure from information objects (visual art, music, movies, stories, etc.).

Partial familiarity is necessary in order to enjoy aesthetic pleasure from an object. At some level, this is pretty obvious. For e.g, a Mathematician may experience aesthetic elegance when reading a new theorem that a layman cannot, because the layman just isn’t familiar enough to be able to interpret it. At the other extreme, residents of even the most beautiful places on the planet cease to be overcome by the beauty of their surroundings because they grow accustomed to it. I think aesthetic pleasure therefore, is simply a state where the mind understands part of what it is shown, just enough, not too much, not too little.

  1. This explains why designers (fashion designers, graphic designers, industrial designers like those who decide what the next model of a car should look like, etc.) need to constantly innovate. People eventually get bored with designs that have been around.
  2. This is also the reason that retro designs sometimes make a comeback. Bell bottoms eventually fade from memory, and aren’t even familiar to those born in the 90s. A design is called ‘retro’ if has faded just enough from public memory to be partially familiar. Something that has completely faded from public memory is unlikely to make a comeback.
  3. One somewhat depressing consequence that the thesis above suggests is that we are constantly driven by a desire to find and consume new things that we are partially familiar with. That is, we want to move from the right edge of the plot to the middle. We cannot keep watching the same kind of movie, or TV shows, or wearing the same kind of clothes, etc. This suggests that the pursuit of aesthetic pleasure seems somewhat meaningless by itself. I told you it was depressing.
  4. There is some interesting work on why we find jokes funny, or, what makes a joke funny. Several jokes take two seemingly unrelated things and develop a new connection between them, for e.g. a pun. Here, the listener is familiar with part of the context in the joke, but there is something new. Once the joke has been heard, it looses its humor value, because it slowly moves towards the right edge of the plot above. Edward de Bono suggests about jokes:

The mind is a pattern-matching machine, and that it works by recognising stories and behaviour and putting them into familiar patterns. When a familiar connection is disrupted and an alternative unexpected new link is made in the brain via a different route than expected, then laughter occurs as the new connection is made.

Ratio of unbiased mean estimates

Suppose \(X\) and \(Y\) are two quantities and we are interested in estimating the ratio \(X/Y\) but we need to estimate \(X\) and \(Y\) using samples (e.g. survey samples). For example, \(X\) may be the the amount of money spent by the population on leisure in year \(T\) and \(Y\) may be the amount of money spent in year \(T+1\). It is tempting to estimate \(X\) and \(Y\) separately and divide the estimates to get an estimate of \(X/Y\). If \(\hat{\mu}_x\) and \(\hat{\mu}_y\) are empirical means of \(X\) and \(Y\) as computed from the samples, the ratio \(r = \hat{\mu}_x/\hat{\mu}_y\) is not an unbiased estimate of \(X/Y\). This can be easily seen as follows: $$E\left[\frac{X}{Y}\right] = E\left[X\right] E\left[\frac{1}{Y}\right] \geq \frac{E\left[X\right]}{E\left[Y\right]},$$where the equality follows the independence of \(X\) and \(Y\), and the inequality follows from Jensen's inequality, noting that \(1/Y\) is a convex function of \(Y\).

If samples of \(X\) and \(Y\) are available, an unbiased estimate of \(X/Y\) can be obtained by generating samples of \(X/Y\) via statistical bootstrap. 

Incidentally, if \(X\) and \(Y\) are independent zero mean Gaussian random variables, then their ratio has a Cauchy distribution. However, if \(X\) and \(Y\) are not zero mean, then their ratio had a distribution that isn't well known and has a pretty complex distribution. 

10th Annual NYAS Machine Learning Symposium

On March 4, I attended the 10th annual Machine Learning Symposium organized by the New York Academy of Sciences. This was the third time I attended. It is a nice day-long event with a 3-4 keynote talks and several posters and 5-minute talks by selected poster presenters. It was at this venue that I first heard David Blei talk about topic modeling and LDA in 2010. Below are the talks I enjoyed most this time. 


Alex Graves, Google DeepMind
This was the best talk of the day in my opinion. Alex presented a set of results on the role of specifying which parts of the input to pay attention to in deep neural networks. Deep networks naturally learn where to focus their attention for a particular task (e.g. which parts of an image are responsible for detecting the desired output), but being able to provide some side information to the network pertaining to where to focus can improve learning tremendously. The mathematical development of this idea makes use of smooth, differentiable operators for specifying where to focus attention, so that these operators can be included in gradient descent learning. Alex is famous for his work on handwriting recognition using Recurrent Neural Networks, and has a cool widget on his website where you can translate typed text into cursive handwriting in a chosen style. 

Sasha Rakhlin, UPenn
This keynote was more theoretical than Alex’s talk. The setting for Sasha’s talk was predicting attributes of a node in a graph (e.g. a social network), when the input predictors include the graph structure as well as non-graph features of the node, and data about nodes arrives in an online fashion. Training a machine in the offline setting (e.g. a classifier for whether a user in a social network will click on a given ad) using conventional methods runs into a combinatorial explosion. Sasha’s work shows that  the online setting allows for a relaxation that allows for polynomial time learning. 

Notable Posters: 

Multitask Matrix Completion for Learning Protein Interactions across Diseases
Meghana Kshirsagar, IBM Research
The matrix completion problem, exemplified by the Netflix Challenge is that a large matrix with a low rank has very few entries available and the task is to estimate the missing entries. The problem also arises in the context of protein-protein interactions between the proteins of disease causing germs and proteins of a host organism. While there are several known methods for matrix completion (e.g. nuclear norm minimization, Candes et al. ), this setting can also benefit from multi-task learning because often several diseases have similar responsible proteins. Multi-task learning, which is a general idea (not specific to matrix completion) that basically proposes that when multiple problems have portion of the signal in common, then using a combination of a shared representation and a specific representation can lead to better generalization. 

Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS
Boxiang Wang, University of Minnesota
Distance Weighted Discrimination was proposed as a superior alternative to the SVM, because unlike the SVM, the DWD cares about the distances of *all* data points from the separating hyperplane, rather than just the points closest to it. However, the DWD hasn’t yet enjoyed the popularity of the SVM because it require solving a SOCP rather than a quadratic program (SVM). This paper extends theoretical results on DWD (established Bayes risk consistency) and proposes an algorithm faster than second order cone programming. 


Comparing Cybersecurity and Trading in Financial Markets

  1. Both fields benefit from rigorous quantitative modeling for tasks such as detecting patterns and anomalies.

  2. Both financial trading and security operate in an adversarial environment. In finance, the adversary is the ‘market’, i.e. the rest of the participants. In security, the adversary is the set of attackers.

  3. Because of the adversarial environments, simple models do not work for very long, because the adversary evolves in response to actions taken by the other party, and it is hard for a model to capture the direction in which the adversary will evolve. Therefore both fields require that models be updated. Models are built using training data . The challenge therefore is to use the right mix of recent data and longer term historical data. How to do this is not obvious in both fields..

  4. Game theory is a useful tool for modeling problems in both fields because of the adversarial nature of the field.

  5. Modern financial instruments, apart from serving as investment vehicles, are designed for mitigating risk. The business of selling information security products or services is also one of mitigating risk — risk of information loss. There is no such thing as perfect security. Instead, organizations allocate an amount of money they consider appropriate to meet their risk appetite.

  6. One difference is that trading in financial markets often has the ‘self fulfilling prophesy’ property, i.e. it can act as a positive feedback system (if a certain fraction of people with opinion that the price of a security is likely to fall, sell their holdings, the rest start thinking its probably a good idea to sell, and the end result is that the price does fall), whereas there doesn’t seem to be an equivalent in cybersecurity.

A probabilistic Escher waterfall

If a fair coin is repeatedly tossed till either the sequence HHHHH or HTHHT occurs, which of the two sequences is likely to occur first? If you thought the two sequences are equally likely, you’d be wrong. The correct answer is HTHHT. A quick way to see this is that HHHHH requires five H’s to occur in succession and the occurrence of a single T destroys any progress made towards obtaining the sequence. On the other hand, for HTHHT, if after obtaining HTH, the 4th toss results in a T instead of an H, then the HT formed by the 3rd and 4th tosses can serve as the starting ‘HT’ of the same sequence.  

[Taken from:  Konold, Clifford. 1995. Confessions of a coin flipper and would-be instructor, American Statistician 49: 203–209]

One can analyze such problems graphically using state diagrams. For example, suppose we repeatedly toss a fair coin till either HHH or THH appear. Which is likely to appear first? The state transition diagram below represents the ‘race’ between the two sequences

It can be seen why THH is more likely to occur first: once the sequence gets to the state ‘T’, there is no path that leads to the state HHH.

Walter Penney, in his article in the Journal of Recreational Mathematics (October 1969, p. 241) showed a curious property of sequences of three (or more) tosses of fair coins: for any sequence of length three, one can find another sequence of length three, that has a higher probability of occurring first in repeated tosses of a fiar coin. This is fascinating, and not at all intuitive, because it leads to circular loops like the following: THH is less likely to occur before TTH (2 to 1), which is less likely than HTT (3 to 1), which is less likely than HHT (2 to 1), which is less likely than THH (3 to 1)!

Random walks in Manhattan

"Random Walks in Manhattan", pixels in Python and NumPy.

The black curve is a pure random walk in two dimensions with independent zero mean Gaussian increments of equal variance in each dimension. The red curve is an Ornstein Uhlenbeck process in two dimensions with mean at (0,0). Unlike a pure random walk, an Ornstein Uhlenbeck process experiences a pull back towards the origin whenever it wanders away from it in any direction, and therefore the OU process does not travel as far and wide as the random walk. Each image shows a trajectory that is a subset of the previous, showing \(2 \times 10^n\) steps, with \(n = 1,\ldots,6\). Notice how, when the number of steps is small, the RW and the OU processes look very similar, but with a larger number of steps, the mean reverting property of the OU process makes its trajectory look very different. 


"Gaussian script"

The piece above represents a new random alphabet. It was drawn by generating a sequence of 20 sample paths of a Gaussian Process in two dimensions with covariance kernel given by a squared exponential \(e^{-d^2/\lambda}\) where d is the distance in input space and \(\lambda\) is a relaxation constant. Note that the Ornstein Uhlenbeck process is also a Gaussian Process, with a covariance kernel given by the related absolute value exponential \(e^{-|d|/\lambda}\).


An afternoon with Geostatistics

Problem: Given a spatial field in 2 dimensions and time in one dimension, measurements are made at a few points in the three dimensions, and values at certain other points in the spatio-temporal field are required. 

Spatial Kriging. The 2-dimensional spatial version of the problem arises in several contexts. An oil company drills a few holes in the Earth’s crust for oil exploration and acquires a random variable that indicates the drilling potential at that spot. Each hole drilled costs a large sum of money. How best can the company estimate the drilling potential at other nearby spots without any further drilling? One way is to employ Gaussian Process Regression, where one places as Gaussian Process with a selected covariance kernel as a prior over the 2D space, adds the information gained by drilling at the few spots, and then computes a posterior distribution over the entire 2D field. This idea is termed kriging in geostatistics, after Daniel G. Krige, a South African mining engineer, who used similar techniques (weighted averaging, not GP regression) for measuring Gold grades in mines in South Africa, and whose work was taken up and formalized later by the French Mathematician Georges Matheron. 

The squared exponential Kernel is a popular choice for modeling covariance that depends only upon distance between two points and decays quickly. $$k(x_1, x_2) = \exp{\frac{(x_1-x_2)^2}{h^2}}$$

Spatio-temporal Kriging. The original problem above applies to a time varying spatial field. The idea of spatial kriging is naturally extensible to include the time dimension, if care is taken to ensure that the covariance kernel of the Gaussian Process degrades at an appropriate rate in the time dimension, which is chosen carefully, and separately from the the spatial dimension. One convenient alteration is to simply alter the bandwidth parameter of the temporal dimension. 

$$\exp{\frac{(x_1 - x_2)^2 + \alpha(t_1 - t_2)^2}{h^2}}$$ which of course turns out to be equivalent to simply using a product of two independent covariance kernels, one for space and one for time: $$\exp{\frac{(x_1 - x_2)^2}{h_x^2}} * \exp{\frac{(t_1 - t_2)^2}{h_t^2}}$$There is the question of whether such a kernel that is a product of two kernels, is guaranteed to product a positive-semidefinite matrix as the covariance metrix when arbitray values of \(x_1, x_2, t_1, t_2\) are plugged in. It is easy to show that the product of two valid kernel functions \(k = k_xk_t\) is also a valid positive-semidefinite kernel. This follows from observing that the covariance matrix for the kernel \(k\) is the Hadamard product (element-by-element product) of the covariance matrices resulting from the two component kernels. If \(K_x\) is the covariance matrix of a random vector \((x_1, x_2, \ldots, x_n)\) and \(K_t\) is the covariance matrix of a random vector \((t_1, t_2, \ldots, t_n)\), then the Hadamard product of \(K_x\) and \(K_t\) must be the covariance matrix of the random vector \((x_1t_1, \ldots, x_nt_n)\), and must therefore be positive semidefinite and hence \(k = k_x k_t\) must be a valid kernel.

In fact the product form of the kernel need not be limited to both kernels being of the squared exponential type. The temporal kernel can for instance be taken to be one where longer term correlations at the annual, monthly or weekly level can be exploited by the GP regression. This can be achieved via a sinusoidal temporal kernel or a product of a sinusoid with a slowly decaying exponential. Here is a cookbook of kernel functions from David Duvenaud, and there are several others in chapter 4 of the GPML book of Rasmussen & Williams. 

Notes on non-parametric regression & smoothing

Ordinary least squares regression analysis makes a few strong assumptions about the data:

• A linear relationship of y to the x’s, so that $$y | x = f(x) + \epsilon$$ • That the conditional distribution of y is, except for its mean, everywhere the same, and that this distribution is a normal distribution $$y \sim \mathcal{N}(f(x), \sigma)$$ • That observations are sampled independently, so the \(y_i\) and \(y_j\) are independent for \(i \neq j\).

There are a number of ways in this things can go wrong, for example:
• The errors may not be independent (common in time series data)
• The conditional variance of the residual error may not be constant in \(x\)
• The conditional distribution of \(y\) may be very non-Gaussian.

There are a few ways out, centering around non-parametric methods.  

Kernel Estimation: This is really a non-parametric smoothing method. The y value estimated for a given input x depends upon the weight of each labelled data point which in turn depends upon the distance between the input x and the labelled data points. 

$$\hat{y} = \frac{\sum_i w_i y_i}{\sum_i w_i},$$ where \(w_i = K(\frac{x - x_i}{h})\). \(K(\cdot)\) is a Kernel function, commonly taken to be the Gaussian Kernel, and \(h\) is a bandwidth parameter that dictates the extent to which labelled data points exert their influence in the weighted average. The weighted estimator is also called a Nadaraya-Watson Estimator. Unlike LOESS however, there is no fitting of parameters involved. The y value is simply the weighted average of the y values of all other labelled data points.

LOWESS / LOWESS: A non-parametric method (i.e. the mount of data one needs to keep around for a model specification grows linearly with the amount of training data) where, for every new input x for which the value y is desired, one computes a weighted version of the quadratic loss used in OLS linear regression. A weight is attached to each training data point, and the weight of a training point depends upon its the distance from the new input x.

SVM regression: Transform data to a higher dimension using the kernel trick and then perform linear regression using a specific type of loss function (epsilon - insensitive) rather than the usual quadratic (squared loss) function. The epsilon-insensitive loss function encourages the support vectors to be sparse. Imagine doing polynomial regression in the original space by including polynomial terms of the data. The SVM regression achieves a similar effect via the non-linear kernel transform. 

Gaussian Process regression: Also a non-parametric method, somewhat similar to LOESS in spirit in that all the data points need to be kept around (as in any non-parametric method), but this is a Bayesian method in that it allows not just a point estimate of y for a new input x, but a complete posterior distribution. This is achieved by placing a prior distribution on the values of a Guassian Process (A GP is simply a collection of random variables, any subset of which have a joint Gaussian distribution) via a covariance kernel function that only specifies the covariance between two Guassian Random variables in the GP given their x values (typically, only the distance between their x values matters). The posterior distribution of Y values for a given set of input X values also turns out to be a Gaussian distribution with an easily computable mean and covariance matrix. 

Unemployment Rate: The Devil is in the Details

The recent recession seems to have had a revealing effect on the economy. While the unemployment rate as of Dec 2014 is back to what it was in June 2008 (5.6%), the employed to population ratio as of Dec 2014 (59.2) is not what is was in June 2008 (62.2). Also, wile the former has dropped by about 5% since the official end of the recession, the latter ha barely moved up by 1%.

What explains the discrepancy? I think it reflects the manner in which the commonly reported unemployment rate is measured -- i.e. based on the people seeking work. If the number of people seeking work decreases, then the unemployment rate decreases, even if those people chose to remain unemployed, but the employed-to-population ratio still falls. 

Detecting Anomalies

In several fields, it is often required to detect the occurrence of an anomalous event while monitoring a stream of data. I first came across this problem in the context of network security during my time at AT&T Labs. In network security, the goal is to quickly identify suspected malicious activity, but it is difficult to know in advance what a problematic pattern will look like, especially when dealing with high dimensional data. On the other hand, it is typically easy to characterize non-abnormal data. 

I've found people sometimes mistakenly treat this problem as a 2-class classification problem. The 2-class approach is unsuitable because data corresponding to anomalous events may not arise from the same distribution each time an anomalous event occurs, and/or one is unlikely to have sufficient labelled samples of anomalous events to estimate the distribution of anomalous data properly. Therefore if one tries to apply a traditional classifier to this problem, one is likely to experience an unacceptably high missed detection rate (think of a SVM margin-maximizing hyperplane that positions itself between the nearest points from the non-anomalous class and a few points from the non-anomalous class. Since points from the anomalous class are few and/or they are drawn from different distributions, the hyperplane resulting from the SVM optimization is positioned closer to the anomalous points than it should be, and hence, an anomalous point may be wrongly classified as an anomalous point).

One approach to dealing with this problem is to focus on the distribution of the non-anomalous data. The problem then boils down to determining whether a given new test data point is drawn from the distribution of non-anomalous data or not. Except for the simplest of problems, the distribution of non-anomalous data may well be multi-modal, and we may not know the number of modes. The one-class SVM comes in handy here. The 1-class SVM basically attempts to estimate the density of the non-anomalous data non-parametrically by discovering a region in feature space that captures most (a user-supplied fraction) of the probability mass. It does this by finding a hyperplane (which is curved in the original space but linear in the original space -- thanks to the standard SVM kernel trick) that forces the specified fraction of data points to lie inside closed regions formed by the hyperplane in the original space. Happily, the 1-class SVM does this non-parametrically and can discover multi modal densities. The anomaly detection problem then boils down to determining whether a given new test data point lies inside or outside the regions of high-density of the non-anomalous data points.

The folks at scikit-learn have a good implementation of the 1-class SVM.  

Constructing a good anomaly detector is still a bit of an art because it requires proper selection of features to monitor. In a very high dimensional space, the curse of dimensionality is likely to make it difficult to build an anomaly detector with a small enough false positive rate. 

Religion, GDP and Happiness

The percentage of a country's population that feels religion plays an important role in their lives (x-axis) is negatively correlated with the GDP per capita (Y-axis, in log-scale):

Here are the heatmaps, first for religion:

and per capita GDP (PPP):

From the heat maps above, several countries in Africa and Asia stand out as having low per capita GDP and high importance of religion, while all of Scandinavia and Australia stand out as places with high per capita GDP and low importance of religion. The US is somewhat of an outlier with with highest per capita GDP in the world of ~$35k and about 65% of the population reporting that religion plays an important role in their lives.

For completeness sake, here is the heat map for life satisfaction, or long-term happiness:

Combining data on religion and GDP with data with data on long term happiness, we get roughly the following picture:

  1. Long term happiness correlates positively with per capita GDP.
  2. Long term happiness correlates negatively with importance of religion.
  3. Importance of religion correlates negatively with per capita GDP.

My guess is that the causal structure between the 3 variables: Religion (R), Happiness (H) and per capita GDP is the following:

When per capita  GDP is high, the basic needs of life are met, people are relatively comfortable, and therefore long term happiness is high, but for the same reason, very few people feel compelled to ask fundamental questions of the type that people turn to religion for. This has the effect of creating a negative correlation between Religion and Happiness. In the causal structure above, religion and happiness are related, but given per capita GDP, they are independent.

Creativity is Just Connecting Things

"Creativity is just connecting things. When you ask creative people how they did something, they feel a little guilty because they didn’t really do it, they just saw something. It seemed obvious to them after a while. That’s because they were able to connect experiences they’ve had and synthesize new things. And the reason they were able to do that was that they’ve had more experiences or they have thought more about their experiences than other people. (...)

Unfortunately, that’s too rare a commodity. A lot of people in our industry haven’t had very diverse experiences. So they don’t have enough dots to connect, and they end up with very linear solutions without a broad perspective on the problem. The broader one’s understanding of the human experience, the better design we will have."

- Steve Jobs

Lack of price discovery in the Healthcare Market

Price discovery is

... the process of determining the price of an asset in the marketplace through the interactions of buyers and sellers.

Buyers and sellers of a good or service set the price by expressing how much they are willing to buy or sell for. When a very large number of buyers and sellers interact in this way, the final price of the good or service settles to one that incorporates all this information. However, in the healthcare market, this mechanism is absent.

When you are sick and you visit a doctor, you do not get to ask up front what you will be charged for seeing the doctor! Instead, you present your health insurance card to the receptionist, thereby disclosing information about your ability to pay, and creating information assymetry. The other party (i.e. the healthcare provider) can now select a price , taking into account the information you have just disclosed, and that too, after you have made use of their services, rather than before. This is akin to getting to know the price of a haircut in a barber shop, after you have had the haircut. There is no price discovery mechanism in the healthcare industry in US. I recommend:

  1. Providers should not be allowed to see the insurance coverage a patient has. Each patient should simply be treated. Claims and payments should be handled by a third party, such as a government agency, that acts as a mediator between the hospital and the patient.
  2. Providers should be mandated to disclose a menu of charges, like a restaurant or a barber.
  3. Providers should charge the same fee, irrespective of whether a patient has insurance or not, or how much insurance coverage she has.

Ostensibly, providers check a patient's insurance card in order to make sure that the patient can pay for the services she is about to purchase. Instead, this function should be done by a 'lookup' over the internet on a server run by a government agency: Input = 'name of the medical procedure', Output = 'Covered or not'. This will prevent the provider from learning about the extent of coverages available to the patient. Not learning about this will force providers to more honestly assess fees, and compete with each other more realistically, bringing down costs borne by the patients. I think this will also prevent providers from passing on large unnecessary fees to insurance companies.

A formal mechanism design formulation of the problem would be interesting to tackle. 4 players: patients, providers, insurance companies, government.

Understanding Happiness

Happiness is as hard to define as it is to achieve. Everybody wants to be happy. Even masochists. I think it is best to use  a non-constructive definition:

Happiness is the goal that drives all human actions and desires.

If long term happiness if everybody's ultimate goal, then it is worth learning how to achieve long term happiness. In fact, if being happy is the ultimate goal (as opposed to say, being wealthy), then our education system should also be teaching us how to be happy over a life time, rather than purely technical or vocational skills. Simple GDP growth does not imply an increase in the happiness of a society -- as indicated by data from the last ~40 years in the US, comparing per capita GDP and happiness levels:

Screen Shot 2012-12-17 at 3.09.43 PM

While per capita GDP has risen more or less steadily, happiness levels have remained more or less stagnant in the last ~40 years.

Should countries develop public policy with the goal of making a society happier, rather than with the goal of increasing GDP? I think it is an idea worth exploring (Scandinavian countries seem to rank highest in in the world in happiness scores, despite high taxes). The government of Bhutan came up with the Gross National Happiness index, which measures the average life satisfaction of the citizens in a country.


This correlates well with health, access to education, and wealth (GDP per capita). At any given time, the relationship between average happiness of a country and per capita GDP seems to log-linear, meaning that happiness is roughly linear in the log of the per capita GDP.

Screen Shot 2012-12-15 at 6.07.29 PM

This is because in order to increase the happiness level of a society by 1 unit, the increase in wealth required is proportional to the current welath. For e.g., if the required amount of increase in personal wealth for a group with per capita income of $1000 is $x, then it is $10x for a group with per capita income of $10,000.

Near the end of this talk, Daniel Kahneman says that in a study done with the Gallup organization, he found that:

Below an income of … $60,000 a year, people are unhappy, and they get progressively unhappier the poorer they get. Above that, we get an absolutely flat line. … Money does not buy you experiential happiness, but lack of money certainly buys you misery.

Kahneman distinguishes between two types of happiness: that of the experiencing self and that of the reflecting self. It is possible to be happy in the experiencing self but have a poor happiness score when reflecting on a long time frame in the past, and vice-versa. For the type of happiness that measure life satisfaction in retrospect, there is no flat time -- i.e. it continues to increase with increasing wealth. I don't find this too surprising. It is the difference between short term and long term happiness. It is easy to be happy in the short term at the expense of the long term. On the other hand, tolerating displeasure during hard work in the present can have a huge payoff in long term happiness in the future.

In this TED talk, Dan Gilbert showcases his research that shows that happiness can be synthesized by individuals. So happiness is not some finite resource that needs to be distributed among people, instead one can simply choose to be happy, despite seemingly adverse conditions. This is fascinating, because it provides experimental evidence that happiness has to do not just with our external circumstances (such as GDP per capita), but also with how we process information in our minds. Several religions have the concept of expressing gratitude. The act of being grateful basically synthesizes happiness out of thin air.

Defining Intelligence

Intelligence is a slippery thing to define. The following definition recently struck me:

That which allows one to maximize happiness over long term.

I like this definition because it is short (c.f. MDL, Occam's Razor), it makes logical sense, and it carries a lot of meaning without going into details of how to be intelligent. It is logical to me because of the following argument: Suppose a person is allowed to life two versions of his life starting from some fixed point in his life. All events and circumstances in the two versions are the same except for actions taken by the person. Then he can be said to be more intelligent in that version of his life in which he achieves greater happiness over the rest of his life.

Intelligence is needed in order to understand what actions will make us happy, for how long, and whether there will be any effects of those actions on our future happiness. Making decisions to maximize cumulative happiness is certainly a non-trivial task. Sometimes one must put oneself through short-term adversity (e.g. graduate school at little on no stipend, or an athlete undergoing gruelling training for a race) to be able to do well later. Sometimes, one decides to undertake an action that provides short term happiness, but at the cost of long term happiness. It takes intelligence to learn to avoid such behaviour n the future.

Modern definitions of intelligence from the scientific and psychology community are incredibly long-winded [Wikipedia]

A very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings—"catching on," "making sense" of things, or "figuring out" what to do.


Individuals differ from one another in their ability to understand complex ideas, to adapt effectively to the environment, to learn from experience, to engage in various forms of reasoning, to overcome obstacles by taking thought. Although these individual differences can be substantial, they are never entirely consistent: a given person's intellectual performance will vary on different occasions, in different domains, as judged by different criteria. Concepts of "intelligence" are attempts to clarify and organize this complex set of phenomena. Although considerable clarity has been achieved in some areas, no such conceptualization has yet answered all the important questions, and none commands universal assent. Indeed, when two dozen prominent theorists were recently asked to define intelligence, they gave two dozen, somewhat different, definitions.

The same Wikipedia page also lists various different definitions given by researchers. The long-windedness of these definitions is somewhat excusable as an attempt to be all-inclusive and general. But in the end, the notion of intelligence is a man-made model, invented to try and explain phenomena. I think a focus on happiness as the central phenomenon to be explained goes a long way in simplifying our understanding of intelligence.

Average age of first-time Mothers

The average age of first time mothers in the developed countries of the world has been rising for the last ~40 years.

Here is another plot that shows the rate of occurrence of Down Syndrome, a chromosomal defect, as a function of the age of the mother at the time of child birth.

The curve really starts to shoot up at 30. In the UK, the average age of a first time mother is 30 years. It is well known that the fertility rate in women decreases after the age of 30 and drops rapidly after 35. Older mothers are likely to find it harder to have a baby and if they do, then they run a higher risk of chromosomal defects. Given the possibilities of all these negative consequences, the increase in the average age is a bit disturbing. It seems like there is a hidden cost to more women working and for longer.

Why is it that women are waiting longer before having their first born despite the risks? Most of my hypotheses (of which more than one, or none, may be true) have to do with women working:

  • Having invested significantly into an education, greater number of women are entering the workforce, with the desire to be financially independent.
  • There is greater financial pressure in families for women to work.
  • The policies of workplaces in these countries are not favourable to childbirth. I can see this is true in the US, but I doubt this holds for Wester European countries, which I know have policies favorable to child birth.

One source of further information is the following map, showing the absolute increase, in years, in the age of a first time mother, over the last 40 years, state by state in the US:

This number is highest in the North Eastern states of NY, NJ, MA, CT, etc. The intensity of the colors in the map above correlates well with population density, and with economic activity in general (meaning more working women).  Here are two more plots I came across in a US based study done by P&G, that suggest that at least in the US, employer policies may be responsible.

What would a mother value most from their employers?:

How much guilt to mothers feel about work-life balance?: