A paper is not scholarship

A word is not the same as the thing it represents, only a representation of it, meant to convey information about it. For e.g. the word 'Castle' is not the same as a Castle. This is obvious. Similarly, a map of a place is not the same as the place itself, it is only a map of it. This is also obvious. An academic paper published at a conference or journal is not scholarship! It is a presentation of the scholarship. The scholarship itself is the research work. Somehow, this is not so obvious to a lot of people.

Fresh stock of puzzles

The past few weeknights have been spent solving puzzles from Martin Gardner's books, which I really enjoy. I just came across a book by Peter Winkler, whose name I had heard before in talks by Muthu. After reading a few of the puzzles on Amazon's look inside preview, this book is too hard to resist buying. Also, here [PDF] is a beautiful set of 7 puzzles compiled by Peter Winkler on the occasion of a remembrance for Martin Gardner in 2006.  Other places I have found good puzzles recently include this page maintained by Gurmeet Singh Manku. Puzzles + coffee + weekend afternoon = awesomeness.

Comparing Cybersecurity and Financial Trading

  1. Both fields benefit from rigorous quantitative modeling for tasks such as detecting patterns and anomalies.
  2. Both financial trading and security operate in an adversarial environment. In finance, the adversary is the 'market', i.e. the rest of the participants. In security, the adversary is the set of attackers.
  3. Because of the adversarial environments, simple models do not work for very long, because the adversary evolves in response to actions taken by the other party, and it is hard for a model to capture the direction in which the adversary will evolve. Therefore both fields require that models be updated. Models are built using training data . The challenge therefore is to use the right mix of recent data and longer term historical data. How to do this is not obvious in both fields..
  4. Game theory is a useful tool for modeling problems in both fields because of the adversarial nature of the field.
  5. Modern financial instruments, apart from serving as investment vehicles, are designed for mitigating risk. The business of selling information security products or services is also one of mitigating risk -- risk of information loss. There is no such thing as perfect security. Instead, organizations allocate an amount of money they consider appropriate to meet their risk appetite.
  6. One difference is that trading in financial markets often has the 'self fulfilling prophesy' property, i.e. it can act as a positive feedback system (if a certain fraction of people with opinion that the price of a security is likely to fall, sell their holdings, the rest start thinking its probably a good idea to sell, and the end result is that the price does fall), whereas there doesn't seem to be an equivalent in cybersecurity.

Machine learning and Programming Languages

I have been thinking quite a bit about the best language to write machine learning algorithms in. While this depends on the exact algorithm in questions, there are certain constructs and tasks that show up in many algorithms. As I work through this, I'll publish my notes here on this blog as I go along. There seems to be a fairly clear tradeoff between the speed of computation and the speed of development. For example, the speed of computation is best with low level languages like C/C++, but writing code can be cumbersome and time consuming. On the other hand, high level languages like R and Python offer very quick development times and have a good number of libraries/packages developed by the community, but have very poor computational performance. For this reason most professional environments seem to adopt the approach of first hacking in a high level language like R and then writing production quality code in C or C++. I have been hearing this quite a bit recently both within AT&T, as well as from other friends doing machine learning work in other industries, and most recently at the NY linux user group's meeting about R. The tradeoff between performance and development time is important for me at AT&T because I often need to deal with very large volumes of data, often both in terms of sample size as well as dimensionality.

Programming languages are known to be good at specific things and not necessarily generalizable to all types of tasks. I will use this post as a working document to summarize the best practices in the use of picking the right languages/features for writing and running machine learning algorithms and why. Often, the type of computation needed by a class of machine learning algorithms (e.g. MCMC require repeated sampling in a sequential manner making parallelizeability difficult) suggests the type of language features that would be useful. Other times, the mapping is not so obvious.

Matlab/Octave: Expensive for individual users. Quite powerful, the many toolboxes that are available (for a price) make it worthwhile. Fast development time. Octave is the free open source version of Matlab, but does not include all its functionality. I don't know if special toolboxes for Octave exist.

Java: I have no real experience with writing machine learning algorithms or using other people's ML code in Java.

R: Open source, big community, large number of packages, slower than Python and Java, similar in speed to Octave. the large number of packages available makes it very attractive as the language for initial hacking on a learning problem. R also has very good built-in visualization. All these factors make R very very useful for interactively work with data.

Python: Python is my personal favorite language to write code in, so I may be biased, but there are good technical reasons for that bias. Python is open source, it has a huge user community, and a number of good packages for machine learning. Matplotlib/pylab are packages that offer good solid visualization, IMO comparable to R's visualization.  The design of Python stresses good code readability. It is a full fledged language offering simple objected oriented design. The only downside to Python is its execution speed. In many cases, this can be overcome by careful optimization of the data structures. Various efforts have been made to improve the speed of Python. PyPy is a just-in-time compiler for Python that can sometimes speed up your Python code, depending on what it does. Cython allows for low level optimization using C by first translating Python code to equivalent C code.

C/C++: Best in performance. Use only if (i) other approaches prove too slow for the problem at hand, or (ii) if a well-tested library is already available (there are lots of unbaked, buggy libraries out there), or (iii) the code needs to go into production or a real-time system in which  performance is important (e.g. finance). Some good libraries: SVMlight, liblinear. The boost set of libraries usually proves to be pretty useful, especially when dealing with a learning algorithm that afford parallelism. C and C++ are often clubbed together in discussions on programming languages, but there is a lot of difference between the two. Personally, I have spent more time with C than C++.  C is cleaner than C++. In terms of ease of coding and  readability, C++ is an ugly language. Here is what Linus Torvalds had to say about C++.

 Torch 7 is a recently (NIPS, Dec 2011) released library in C++/GPL for writing ML algorithms and interfaces with a scripting language called Lua. In their words. it provides "a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation." I am told that Torch7 + Lua is a good combination for writing neural networks, but I haven't verified this myself yet. I wish they had interfaced Torch 7 with Python instead of Lua, but in their words:

For neural networks, I have used R's nnet package (very mature, included in the base distribution), and toyed briefly with the lisp-like language Lush (also mature, but has a sttep learning curve if you havent played with functional languages like lisp/scheme before), developed by Yann lecun.

Vowpal Wabbit is another C++ library (with Boost as a dependency) with a focus on online learning, that I haven't yet tried but have heard about quite a bit. Claims by the authors of vw include high speed and scalability. Will post an update here after I try it.

The dream language for machine learning: Fast development, very good performance. Close to english (like Python) but compiled and optimized (like C) instead of interpreted. It sounds like these qualities have little to do with machine learning, but are desirable for most tasks.

Writeups by other people on the topic of best programming languages for machine learning (I will continue to add to this list as I find them):

  1. Hoyt Koepke on why Python rocks for use in machine learning (and scientific computing in general).
  2. Two writeups by Darren Wilkinson on empirical evaluations of R, Java, Python, Scala and C for MCMC based learning, in particular for Gibbs Sampling (a type of MCMC algorithm).
  3. John Langford has a nice wiretup discussing various aspects important to the choice of language for machine learning.

MLcomp

One of the problems for practitioners (as well as for researchers) of machine learning is that there is no well defined 'frontier' of machine learning algorithms that perform 'best' in a given problem domain. Even though there are some guidelines as to what is likely to work well in a given setting, this is not guaranteed, and often other approaches outperform well known approaches. In fact a large number of papers at machine learning conferences seem to be based on this kind of surprise ('look, I tried an autoencoder on this problem and it outperforms the algorithms people have been using for decades!'). It is hard to have a universal frontier of this type because much of the ability of a machine learning algorithm depends upon the structure of the data it is run on. For this reason I was very happy to come across MLcomp.org, a website that allow people to upload algorithms (coded up in any language), and datasets, and compare the performance of their algorithm on all the datasets in their system, as well as the performance of any of the algorithms in their system on one's dataset. This is immensely useful  because it serves as an empirical frontier of the performance of machine learning algorithms on different types of datasets. MLcomp connects to computational resources hosted on a standard Amazon EC2 linux instance, allowing users to test algorithms on datasets. And it is free. There are at least a couple of reasons why something like MLcomp should be adopted by more people (current usage seems low, based on how useful Ithink it can be). First, the utility of such a platform is proportional to the number of algorithms and datasets hosted on it, i.e. proportional to the number of people using it and contributing to it. I feel that if a certain threshold number of people begin using it actively, then it's utility as a platform will take off rapidly after that. Second, it can prove to be an invaluable resource for ensemble based learning: Upload your dataset, run a number of algorithms on it, download the code of the algorithms that seem to do well, and use them in a ensemble to improve the performance of your supervised classifier / regressor.

Debugging the process of applying a machine learning algorithm to a dataset

Getting one's hands dirty with applying machine learning theory to a real world dataset is really quite different from simply understanding the theory behind machine learning algorithms. Often, the experience of learning a regressor or a classifier from real a world dataset goes something like this: You have a nice big dataset. You have cleaned it up, normalized it and are ready to apply your chosen algorithm on the dataset in your chosen programming language/statistical computing framework. You find that the generalization error of the model you have learnt is really poor. What do you do next? You can chose a different learning algorithm, one that is more complex, one that is less complex. Or, you can try to supply more training data to the same algorithm in the hope that the generalization error performance improves. What you do next largely determines whether or not you are going to succeed in learning well from the dataset.

It turns out that there is a principled approach to figuring out how to navigate the set of available choices (more/less complex algorithm, more training data). Computational learning theory provides insights into what must be done next. Andrew Ng has a really nice section in his Machine Learning class in which he goes over the most problematic parts of applied machine learning research. According to Andrew, a lot of inexperienced practitioners and startups using machine learning are unable to answer this question, and go about making ad-hoc choices, wasting precious time in the process. I completely agree. I suspect that understanding this aspect of machine learning is also what separates the winners from the rest in machine learning competitions.

The plot on the left show what happens as the number of training samples is increased with a fixed model. The error on the training set actually increases, while the generalization error, i.e. the error on an unseen test sample, decreases. The two converge to an asymptote. Naturally, models with different complexity have asymptots at different error values. In the plot above, model 1 converges to a higher error rate than model 2 because model 1 is not complex enough to capture the structure of the dataset. However, if the amount of training data available is less than a certain threshold, then the less complex model 1 wins. This regime is important because often the amount of training data is fixed -- it is just not possible to obtain any more training data.

The plot on the right shows the change in training error and generalization error as the model complexity is increased, and the size of the training set is held constant. The training set error decreases monotonically, but the generalization error first falls and then increases. This occurs because by choosing a progressively complex model, at some point, the model begins to overfit the training data. That is, given the larger number of degrees of freedom in a more complex model, the model begins to adapt to the noise present in the dataset. This hurts generalization error. One way to get around this is to regularize the model -- i.e. somehow constrain the model parameters, which implicitly reduces the degrees of freedom. If the learning problem can be framed as an optimization problem (e.g. find minimum mean square error), then one way to regularize is to alter the objective function by adding on a term involving the model parameters. Regularization is a flexible means to control the model's capacity so as to avoid overfitting or underfitting.

If application of an ML algorithm shows that training and generalization error rates have leveled out, close to their asymptotic value, then adding additional training samples is not going to help! The only way to improve generalization error would be to use a different, more complex model. On the other hand, if all the available training samples have been used, and increasing the model's complexity seems to be increasing the generalization error, then this points to overfitting. In this case, increasing the model complexity is not going to help. generalization error is usually measured by keeping aside part of the training data (say, 30%) and using only the remaining data for training. A more common technique, termed k-fold cross validation, is to use a smaller portion (say, 5%) as test data, but repeat  k-times with random portions kept aside as test data.

Gradient descent

Gradient descent is used in a bewildering range of supervised machine learning areas as an optimization method to 'fit' a model to training data. Its use is obvious when the objective function being optimized (which is often some type of average error over the training set) is convex in the variables of optimization (which are usually the free parameters of the model one is trying to fit). However, gradient descent is (somewhat surprisingly) also used when the objective function surface is not convex! For example, in training a neural network using the back-propagation algorithm, the free parameters of the network are updated using a stochastic gradient descent over the error function surface, which is non-convex because of the non-linearities introduced by non-linear sigmoid/other neurons. Surprisingly, gradient descent seems to work fine. The reason for this is that when employing stochastic gradient descent, the weight updates are affected by noise in individual data points (this noise tends to get averaged out in batch (conventional) gradient descent), which provides the opportunity to escape from local optima wells in the objective-function surface. Due to the nonlinear sigmoid functions, the error surface typically contains many local minima wells, as in the figure below, and these minima as very close to each other. Therefore, even if one is ultimately stuck in a local minimum, it is often not too poor compared to the global minimum.

There are a bunch of necessary steps one must take to pre-process one's data when employing gradient descent in order to prevent crazy convergence times.

  1. Normalize each input feature to have a mean of zero.
  2. Normalize the dataset so that there are no linear correlations between features. This is done by projecting each point on to an orthonormal basis -- this is effectively, a rotation, which can be accomplished by multiplying the input data by a matrix  containing the eigenvectors of its covariance matrix.
  3. Normalize each input feature so that it has the same variance. Doing this allows one to use the same learning rate for each free parameter in the model. If the variance of each feature is different and one chose the same learning rate for each free parameter, then the convergence in some directions would be very slow, while converge in other directions would be quick.

Can we de-parameterize Neural Networks?

During the last 3 months, I have been constantly surprised and impressed by the recent achievements of neural networks. In particular, convolutional neural networks (see the work of Yann lecun) are currently the only technique that allow real-time object recognition using video input, with reasonable performance. Another piece of neural networks magic that I have been delighted by recently is autoencoders. In particular, a 2006 paper [PDF] in the journal Science by Ruslan Salakhutdinov and Geoffrey Hinton shows how multi-layer autoencoders (also called 'deep autoencoders') can be used to learn non-linear relationships. Yet another cool use of neural network that I recently came across is: paraphrase-detection on long sentences using recursive autoencoders -- the problem here is to detect, given two sentences, whether they are the same (or close enough) in meaning or not -- Here is the paper by Richard Socher, whose talk at NYU is where I first heard this stuff. However, I have been disappointed by the fact that there is no systematic way of determining the right number of layers, and the size of each hidden layer that one must use in a neural network. Intuitively, there seems to be a relationship between the number of layers needed and the 'amount of non-linearity' in the problem. For example, suppose one is required to find the classification boundary in N-dimensional space, between points belonging to two classes. If this boundary is 'highly non-linear', meaning that it is very contorted, then a deeper network is probably a good idea because each additional layer allows the neural network to afford a greater capacity for learning a non-linear function since non-linear units in the hidden layers at as building blocks for the non-linear surface. The issue is the standard one of model complexity versus data complexity -- a more complex model (e.g. a deeper neural network) is more capable of capturing complex data if the data really is complex enough to begin with. If however, the data isn't too complex then too complex a model can perform poorly if there isn't sufficient data.

In a way, the fact that one needs to have some idea of what neural network architecture to use is a bit like the problem with parametric Bayesian methods. The recent development of Bayesian non-parametrics (see for example, the work of Michael Jordan, Yee Why Teh, David Blei and Zhoubin Ghahramani) is an effort to fix that problem. Just like parametric Bayesian methods require the specification of a parametric family of distributions, thereby limiting the complexity that the model can capture, neural networks require the specification of an architecture, which if not well-tuned to the data at hand, can underfit or overfit the data. The paramaters in the case of neural networks are the number, size and type of layers in the network.

It appears to me that a similar effort is needed in the domain of neural networks -- i.e. come up with a systematic approach to estimate how much 'non-linearity' or complexity is present in data, so as to estimate good neural network architecture suitable for the problem at hand. At present, practitioners seem to mostly employ trial-and-error to guide the selection of an appropriate architecture. I, for one, would start using neural networks much more frequently in my research, were there a method to estimate a good architecture.

Intelligence is overrated.

Peter Norvig is, by now, well known to have stressed in 'The Unresonable Effectivenes of Data' that "a decent learning algorithm with tons of data can do much better than a very advanced learning algorithm and little data". Einstein is famously reported to have said "Success is 1% inspiration, and 99% perspiration".

I see a kind of parallel between the two trains of thought above. In particular, I have noticed that a lot of people build their careers based on a reasonable amount of intelligence + lots of hard work, while some others build their careers based on sheer intelligence. Invariably, the ones with careers based on hard work are monetarily more successful. In contrast, the ones with careers based on intelligence tend to get lazy after a while. Examples of the intelligence category are tenured professors, and examples of the hard work category are people in sales, consulting, and general management. Notice that the latter category also requires intelligence, but it is the hard work that is directly tied to the payoff. On the other hand, publishing papers sitting in the ivory tower is much more heavy on raw intelligence than hard work.

I feel this observation explains a lot of the different between the difference in (monetary) success rates of the two types of people. It should serve as a relief to people who overrate the value of intelligence.

The best case of course is to have  lots of intelligence *and* be very hardworking. Just like a super cool learning algorithm showered with tons of data. However, a very good alternative is to have reasonable intelligence and a ton of hard work!

Visualizing the IPv4 space using Hilbert Curves

The IPv4-heatmap is an awesome little program created by Duane Wessels. It allows the visualization of the entire 32-bit IPv4 address space in 2-dimensions, using the idea of a Hilbert curve, a type of space-filling curve, which are fascinating constructs in and of themselves. Here is an image from Wikipedia showing an iterative process to create higher order Hilbert curves from lower order ones.

Due to the properties of Hilbert curves, the IPv4 heatmap represents IP addresses that are numerically close to one another as physically proximate pixels. Thins of each IP address as a 32 bit integer. Then, 123.25.63.231 is just 1 larger than 123.25.63.230 in value, and the two  IP addresses probably belong to the same organization. The program takes in a raw text file containing IP addresses and creates an image like the one below. The particular map below shows the intensity of peer to peer traffic that I measured during one five minute segment. Since large blocks of IP addresses have well known organizations as their holders, the map also displays this information as an overlay.

Machine learning symposium - Oct 21, 2011

I spent the day at the annual machine learning symposium organized by the NY academy of sciences. The keynotes were by Léon BottouYoav Freund, and Stephen Boyd. In addition, there were ~30 posters and a number of short talks by some of the poster presenters. I enjoyed Stephen Boyd's keynote the most. He talked about a method for decoupling convex optimization problems for performing distributed machine learning with data sets that have a very large number of features and/or training samples. This is particularly relevant to the cluster/parallel computing paradigm.  Although I used his textbook in grad school, it was the first time I head him speak. His talk was lucid, entertaining and seemed effortless. The poster talks that caught my attention were:

  1. A Reliable, Effective Terascale Linear Learning System - Miroslav Dudik (Yahoo! Research). I'd heard a longer description of this work by John Langford at one of the NY ML meetups where John described a system they've developed called 'AllReduce'.
  2. Online Learning for Mixed Membership Network Models - Prem Gopalan (Princeton University). I enjoyed seeing this mainly because of my interest in Latent Dirichlet Allocation and because of a close parallel with some work I've trying to do with network data.
  3. Planning in Reward Rich Domains via PAC Bandits - Sergiu Goschin (Rutgers University). Sergio drove home the point of his work beautifully and most aptly by showing a demo of his idea applied to the old Super Mario Bros. Nintendo videogame. The video shows a Mario navigating deftly through a level with an artificially high number of obstacles (ducks, and other little creatures that Mario isn't supposed to come in to contact with). This pushed me to go back and take a look at their work.
  4. Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings - Amina Shabbeer (Rensselaer Polytechnic Institute). Amina's work on displaying high dimensional graph data (each vertex of the graph is a point in high dimensional metric space) in a lower dimensional space, while preserving the distance properties as far as possible (this is a standard dimension reduction problem - PCA, LLE, MDS all seem fair game -- she used MDS), and trying to prevent edge crossings (she addresses this by creating a small SVM-style linear separability problem for each pair of edges, and doing a global optimization). This work was interesting because it is a problem I encounter often when trying to visualize large scale data at AT&T as a graph. SFDP, an algorithm developed by Yifan Hu of AT&T Labs addresses the same problem.

Semi-supervised feature learning contest

A research challenge on semi-supervised feature learning, posted by D. Sculley of Google Research and hosted on Kaggle just ended this past Monday. The goal of the competition was to learn a collection of 100 features from a small amount of labelled data (50K points) and a large amount of unlabeled data (1M points) in a 1M dimensional space. The data in the original 1M-dimensional space was very sparse. The winning entry didn't use the unlabeled data at all! This was a surprise to me because I was convinced that it would be possible to better by learning a low dimensional feature representation at all. The winning strategy is described by one of its authors here. Although I didn't participate, I wanted to try building a deep autoencoder to lower the dimensionality. This Science 2006 paper by Hinton and Salakhutdinov addresses this very problem. This paper came up with a method for better training of deep autoencoder architectures, by training each RBM layer-by-layer, instead of doing backprop on the entire network end to end. The paper has a number of neat examples of high dimensional data with low inherent dimensionality, and the authors show how nicely their autoencoders manage to recover the low dimensional features better than other popular methods such as PCA, locally linear embedding and multi-dimensional scaling.

I think one reason the deep autoencoder approach didn't feature among the top entries in the Google/Kaggle challenge is that it is not obvious how to pick the right size and number of layers in the autoencoder. In general, there seems to be no solid work on this issue. Hinton and Slakhutdinov do not mention in their 2006 Science paper. A second reason I suspect it didn't show up in the challenge results is that the challenge ran for just about 3 weeks, and it requires significant time to build and fine-tune autoencoder architectures for the specific problem.

Few short links

  1. Intel has decided to shut down its lablets in Seattle, Pittsburgh and Berkeley! This is old news apparently, but I just learnt of it.
  2. I have been a lot of thinking about the best model for workspaces. Some of the models individual offices, cubicles, and clusters of desks. At WINLAB, we had low-walled cubes. Richard Hamming famously stressed on the importance of having one's office-door open if working in a research lab with individual offices. Here is a nice post on the subject from Matt Welsh who recently moved from being a professor at Harvard to working at Google. At the AT&T security research group where I work, we use a open collaborative workspace with clusters of desks and I can see that it is more productive than the cubicles at my erstwhile grad school lab.
  3. A set of math puzzles, courtesy Gurmeet Singh Manku.

Paper at ACM Mobisys 2011

My paper titled 'ProxiMate: Proximity-based Secure Pairing using Ambient Wireless Signals' has been accepted at the 9th Annual International Conference on Mobile Systems, Applications and Services (ACM Mobisys) to be held at Washington D.C.. The paper is about how existing wireless signals, such as FM radio and TV signals, can be used by radios in close proximity to generate a shared secret, and then use that secret to communicate securely. By building a shared key in this manner, it is hoped that the problem of identity spoofing and eavesdropping on wireless links can be addressed. The principles at work in ProxiMate are: (i) Small-scale fading: the wireless channel between a distance RF source (such as an FM radio tower) and a receiver, behaves as a random process, thereby serving as a source of randomness, and (ii) receivers that are within half a wavelength of each other can observe highly correlated (but not identical) versions of these random processes, allowing wireless devices in proximity access to a shared source of common randomness that is not available to any sufficiently distant adversary. ProxiMate is a proof of concept system built on software-defined radio and evaluated using ambient FM and TV signals. It is intended to be information-theoretically secure -- meaning, the key generation phase does not need to assume a computationally bounded adversary -- as opposed to say, Diffie Hellman, which must assume a computationally bounded adversary in order to guarantee that the observations available to the adversary will not allow it to compute the key. On a side note, D.C. is one of the more boring cities to have a conference once you have done the museums circuit. Anyhow, looking forward to it!


Information asymmetry in the healthcare system

I suspect one of the problems in the health care system in America is information asymmetry. The moment a customer hands over an insurance ID to a doctor or hospital, the clinic knows a significant amount of information about the patient. The clinic knows the patients health history, financial situation and the type of insurance plan he has. The average patient knows almost nothing about the internal working of the hospital. Try to think about how many bits of information the clinic possess about the patient, and how many bits the patient possess about the clinic. From a purely rational point of view, it may be in the best interest of the clinic to overcharge the patient for services he doesn't really need, knowing that his insurance will cover it. Even if the clinic gets caught doing this once in a while, the odds of getting caught may be low enough that the clinic may decide to pursue such a policy in the long run.One simple change that can be made is: do not provide insurance information about individual patients to the healthcare provider. This can be done by placing a government body as a liaison between the health care provide and the insurance companies. In this model, the hospital submit the bill to the government body and this body follows up with the insurance company. The hospital/clinic and the insurance company never interact directly. Another way to disincentivize misbehavior on the part of clinics might be to create a public, review-supported, ratings service, ala Yelp, for medical clinics and individual doctors. The problem with this is potential for false reviews and sybil attacks, which makes me wonder why apartmentratings.com seems to have so many fake reviews, while yelp seems relatively cleaner.Update: A large amount of medical records data has just been released by the Heritage Health Prize via Kaggle. While the intended purpose of this data is to host a competition to predict which patients are admitted to a hospital within 1 year of the time frame of the data, I wonder whether it could be used for other interesting investigations and verification of theories, such as the one above.


Comparing the communication problem with machine learning

In the problem of communicating information from point A to point B, we are faced with the task of dealing with a noisy channel that sits between A and B and corrupts or distorts the signal fed into it. A and B can be points in time intend of points in space - for e.g. recording data on some medium, such as a compact disk. Our goal is to take information produced by a source, and map it to a signal that can be carried by our intended channel, in such a way that when this signal is fed into the channel at point A, the distorted version produced by the channel at point B can be used by us to faithfully recover the original information with little or no change in meaning of the original information. In solving this problem, we first try to understand the nature of distortion that the channel creates, and based on this knowledge, we design an appropriate mapping between the original information emitted by the information source, and the signal that is fed into the channel. This process is called encoding. Using the knowledge of the type of distortion produced by the channel, we can then use the distorted signal at B and map it to our guess of the original information emitted by the source. This process is called decoding. The study of how to create efficient encoding and decoding strategies for specific types of channels is called coding theory.In machine learning, we are placed at point B and at point A is a source of information that is not under our control. There is no entity sitting at point A creating a mapping from the original information to a signal appropriate for the channel in order to better withstand the distortions the channel will produce. The channel between A and B is 'nature'. We are faced without the task of figuring out the original information at the information source by observing the signal emitted by the channel at B. To do this, we attempt to learn a function that can correctly map the signal output by the channel at B to the original information at A.  In terms of the communication problem, we are attempting to learn the decoding rule in the communication problem above, even though we do not have prior knowledge about the nature of distortion produced by the channel. In supervised learning, we are given several input-output pairs of the channel, and we attempt to learn this function that correctly maps the output to the input. Once we have estimated this function, we apply it to estimate the original information for instances where only the output of the channel is known.

Big data at Foursquare

Attended a meeting of the NY ML meetup. The speakers presented on the BIG DATA work happening at foursquare. I noticed that the speakers as well as the audience is heavily tilted towards programming rather than statistics/ml. Most questions were about hadoop, hive and mongo db. Hardly any questions about the statistical aspects. This could be because the audience understood the math behind recommendation systems quite well. One exception was the 'cold start' problem. One thing that surprised me was that Foursquare doesn't use matrix factorization but simpler nearest neighbor methods! I asked. Bell, koren and volinsky showed in the Netflix prize that latent models based on matrix factorization are better than NN.

Also felt that some of the questions were interesting ideas. For example, how they can collect implicit feedback by noticing what a user does after being offered recommendations. I thought this was a really neat idea for tuning the recommendations for a user -- someone in the audience asked if they were doing it, and they replied they aren't but they are thinking about it.


Voice morphing

http://mi.eng.cam.ac.uk/~hy216/VoiceMorphingPrj.html Why this is so cool (and dangerous): Imagine that I have a large enough sample of your voice (if I have access to some of your phone messages, or voice chats, this is easy). I can then go and build a model of your voice. Using the technique in the papers above, I can then transform my voice into your voice *in real time*. This means that I maybe able to defeat a speaker-recognition/speaker-verification system, thereby breaking biometric systems that rely on voice for authentication. For example, if the biometric system asks me to speak a particular random phrase or sentence, I speak into a computer, that modifies my voice to sound like your voice in software, and then plays it back to the biometric system. Biometric systems should have 2 or 3 factor authentication -- biometric alone are broken!