Monday, 18 April 2016

Html template generation in Clojure - with Enlive


Enlive Tutorial

While working with web applications in Clojure, I had the opportunity to work with the excellent HTML templating engine Enlive. I'm sharing a tutorial (I had written) on Enlive, which is at this Github worksheet

p.s. To run the worksheet, click on the "view this worksheet on Github" link on the top right corner. Alternatively, you can clone the Github repo, run lein gorilla at the terminal, and open the worksheet located at src/enliven.cljw.

Tuesday, 12 April 2016

Presentation on Probabilistic Graphical Models






Its been about a year and a half since my book Building Graphical Models in Python was published. Since the book is fairly technical and involves programming, it is not the easiest of introductions to graphical models, especially to somebody with little knowledge of Machine Learning.

To introduce the topic and whet people's knowledge about Graphical Models, I have given the following presentation a couple of times, and it was well received as an introduction to the lay person. Here it is.


Topic Modeling on Customer Experience data


Background


There exists a vast trove of Customer Experience data in the form of product reviews, forum posts, customer service/customer satisfaction surveys and suchlike. This data is often in unstructured form. Companies that own this data would like to summarize these (often vast) data-sets.

One of the most common methods of text mining (for lack of a better word), is Topic Modeling. Given a large corpus of text, a topic model can assign a probabilistic score for each document-topic pair.

At BAICONF'15, I presented a paper and presentation on the effectiveness of using Topic Modeling for summarizing Customer Experience data. This paper was the result of our experiences (working with Extrack at Bridgei2i) of applying multiple methods such as Unsupervised Topic Models, Semi-Supervised Topic Models and others, on multiple types of Customer Experience data. The paper, as well as the presentation, are presented below. (Note, due to scheduling issues, the paper/presentation is not listed in the BAICONF schedule).

Here's the link to the paper along with the accompanying presentation.

Thursday, 4 June 2015

Clojure wrapper for Word2Vec


The Word2Vec algorithm is an unsupervised algorithm that takes a word as input and returns a vector which describes this word in a high dimension space (for example, 300-500 dimensions). Prior to Word2Vec, the most popular algorithm to convert words to vectors was the bag-of-words model, which simply indicates presence or absence of the word. 

The weakness of Bag-of-words model was the vectors had a poor notion of distance. For example, the word vectors for Paris and France would have no apparent similarity in Bag-of-words, but Word2vec generates vectors that can not only tell that these words are similar, but can derive relationships which determine the capitals of other countries, given only the country names.
Image courtesy Maastricht Uni, Dept of Knowledge Engineering


I implemented a Clojure library that wraps a Java implementation of Word2Vec. More details (and a short tutorial) in this blog post (Crossposted from the Bridgei2i Github site)

Monday, 8 July 2013

My attempt at the Microsoft Author-Paper Identification Challenge at Kaggle


Introduction


The Microsoft Author-Paper Identification challenge is a machine learning competition hosted at Kaggle.com, and this post documents my attempts at this problem.

The Data


The data consists of a training and validation set that were released early in the competition. The training data consisted of the following comma separated files:

  • Paper.csv, where each record described data about a Paper, such as its id, publication year, journal or conference.
  • Author.csv, where each record described an Author, his Affiliation, etc
  • Journal.csv and Conference.csv, which described each journal or conference
  • PaperAuthor.csv, a noisy dataset that listed Authors and Papers ascribed to them.
The training file consisted of an AuthorId, followed by a list of PaperIds that were correctly, and incorrectly assigned to that author. The submission format was that each line should contain an author id, followed by a list of paper ids, in decreasing order of probability that the paper was authored by the given author.
For e.g. Author1, P1 P2 P3
Where the probability that Author1 wrote P1 is highest, and P3 being the lowest.

Analysis


Although the evaluation metric used was Mean Average Precision, this problem is a classification problem. Given an Author id-Paper id tuple, predict the probability that the Paper was authored by the said author. 

Provided along with the data was the code for a simple model that generated a 5 features, and then used a RandomForest classifier to classify each tuple.

Data Cleaning


I wasn't too comfortable with using sql to extract features, hence I loaded the csv files into mongodb using mongo import. The data cleaning stage took a while, I changed the contents to lowercase, removed newlines within some dataset entries and removed English stop words.

Approach:


The initial 5 features provided looked at the commonsense features such as number of coauthors who published in the same journal and so on. While treating this as a classification problem, the out-of-bag error rate (using RandomForests) was already around 95%.
I noticed at this point that to improve scores, I would have to start using features that depended on text fields such as AuthorName, PaperName, Affiliation (e.g. University).

In some cases the Author name was just marked wrong. For e.g. if 'Andrew B' was marked as the Author name for 99 papers and 'Kerning A' was the Author name for the 100th paper, this was obviously a mistake. I calculated the 'Jaccard distance' on the following set of features:

  • Author names: Intersection of names in the author set with the current author name.
  • Co author names: Intersection of coauthor names similar with current co-author.
  • Affiliation: Intersection of this paper's affiliation with the current author's affiliations.
  • Title: similarly computed the Jaccard similarity.
Adding all of these features improved my validation set submission score to about 95%. Despite toying around, I didn't make much improvements on the score after this.

I also tried to use a Naive Bayes classifier to classify the 'bag of title words', into 2 classes, 'current author' or 'other'. Unfortunately the projected time it would take seemed like 4 continuous days of CPU time, and with Bangalore's power failures, it was unlikely it would ever finish, so I abandoned the approach.

My last score improvement came unexpectedly, I noticed that several author-paper entries that were 'correctly marked', had multiple entries in the dataset. I was surprised at this rather obvious mistake (or goofup, however you want to look at it), but adding this feature (number of paper-author counts) improved the classification rate by a whole percent. 



Learning and Notes:



  • This competition required creating hand-coded features from the multiple files that comprised of the dataset. The ability to quickly try out a feature and see if it improved scores was critical. I spent a lot of time on implementing a Naive Bayes classifier from scratch, without testing out the classification accuracy on a small part of the dataset, and finally had to abandon this feature due to lack of CPU hours. 
  • This was probably one competition where Hadoop/MapReduce skills wouldn't hurt, since the datasets were large enough that parallelization would have helped. 
  • The time required for training and prediction using RandomForests was really small, probably of the order of 10 minutes or less. This was very different from other competitions (where I used deep neural nets, for e.g.) where I would have to watch plots of errors of training and validation sets for hours on end. 
  • Observing quirks of the data is important, as the duplicate paper-author ids demonstrated. 
  • I didn't visualize or graph anything in this competition, and that's something I should improve upon, although it wasn't obvious what it might have been useful for, in this case. 


This was the first competition where I got into the top 25%, and my overall Kaggle ranking was inside the top 1%.
   

Wednesday, 12 June 2013

Machine learning endeavour: At the finish line



So I'm finally at a point where I can say I'm done with my Machine Learning endeavour that I started about 19 months ago. Of course, learning is a lifelong process, but getting to the finish line of what seemed to be a far-fetched goal sometime back, feels incredibly sweet.

Some quick statistics:
  • Number of courses completed: 6
  • Number of courses partially completed: 3
  • Number of programming assignments: 32
    • Natural Language Processing: 3
    • Machine Learning: 8
    • Probabilistic Graphical models :9
    • Neural Networks for Machine Learning: 4
    • Computing for Data Analysis: 2
    • Data Analysis: 2
    • Design and Analysis of Algorithms: 4
  • Number of Kaggle competitions entered: 4
    • Notable rankings:
      • 72/1158 in the Digit Recognition competition. 
      • 125/557 in the KDD Cup 2013-Author Paper Identification Challenge
      • Ranked among the top 2% of all Kagglers. 

This is the list of courses that I completed:

No
Course name
Completed
1
Probabilistic Graphical Models: Advanced track with distinction
June 2012
2
Dec 2012
3
Dec 2012
4
Dec 2012
5
Mar 2013
6
Natural Language Processing with distinction
May 2013


Thoughts on getting here? In no particular order:


  • Learning is incredibly fun. And self-motivated learning is much more fun than college. Motivation in college education is usually built around too much stick and too little carrot. 
  • Although online learning might seem very disconnected, the online student community is tight-knit and extremely helpful. I would not have completed certain assignments in PGM and NLP if not for other students who posted helpful tips. 
  • Support from family and friends matter a lot. 
  • Finishing a course is just one part of the story. Applying the learning in a real-world context is a very different challenge.
Blogs/books that I found helpful:
Other self-learners:



Sunday, 14 April 2013

Data Analysis @ Coursera - a review


Here is my review of the Data Analysis course at Coursera which I recently completed.

There are several "data X" and "big data Y" kind of courses nowadays, and its quite difficult to know up front if the course you signed up for is the course you need. I'll try to outline what this particular course is, and what you can expect from it.

First off, this is a Data Analysis course in R. Knowing R is a prerequisite and if you come to this course without any knowledge of R expecting to pick up the basics along the way, it will be quite challenging. Completing Prof. Roger Peng's R course is the ideal way to ease into the material for this course.

This course teaches you statistics while trying to make sense of the data. There are innumerable data sets that are explained, and playing with data is the ideal tool to practically understand those statistics concepts. The initial part of the course tends to repeat R's strengths in graphing and data cleaning and munging. The initial part of the course goes at a relaxed pace, and somewhere in the middle the speed picks up and the last few weeks become quite hectic. The second part of the course is pretty much a machine learning course, with clustering and classification algorithms being explained in quick succession. If you have been through Prof. Andrew Ng's machine learning course, the difference here is that very little mathematics involved. The classification methods used (such as random forests, which I had never used before) are explained from a "how to use it" point of view and the math basics are not covered.

Since it does not try to get into the mathematical basis of every method, it covers much more ground, such as ensemble learning and ways to do model averaging. Although the knowledge of math is certainly useful, this course showed that it is possible to do predictive modelling quite effectively simply by knowing the methods and learning how to apply them. It is therefore a practitioner's course in Data Analysis.

The difference between this course and the Machine learning course, is that this one is much more exploratory. Often in machine learning problems, the goal is often to just get to the lowest mis-classification rate on the validation/test set. Here, the emphasis is much more on interpreting and explaining the data the data (usually graphically) and understanding how a few features (especially if the dataset has relatively many dimensions) are responsible for most of the variance. I often struggled with this, because in a classification problem, it was relatively easy to do dimensionality reduction (using Principal Component Analysis) and then use multiple classifiers such as SVM/Neural Nets/Random forests, while it was relatively hard to explain feature variance on data that has been processed through PCA.

The assignments in this course also reflect its exploratory nature and are peer assessed using a rubric. Both assignments requires one to write a data analysis explaining the motivations, the methods to clean the data, methods used to classify or cluster, the statistical tests used and so on. So sticking to what the rubric demands is quite important, and straying from it, even though your write-up is excellent, leads to lower scores.

All said and done, this is an excellent course to improve your knowledge of data analysis, statistics and machine learning.

Here are my suggestions to make this course even better:

  • More assignments. The assignments are quite big, and they take a bit of time. I would prefer shorter assignment that gives one the opportunity to play with more data sets.
  • The course probably tries to cover too much material. I was quite confused by the various tests for statistical significance which were explained in bits and pieces in various parts of the course, and only later did I develop some understanding of what should be used in which case.  If this course included more material, it can be certainly split into a basic and advanced data analysis course. 
  • I would love to have this course offered in a language agnostic way. For simple one and two liners, R is bearable. Writing code longer than that makes one itchy for a 'real' language (I'm working my way through Data analysis in Clojure, hopefully with the knowledge gained there I could finally say goodbye to R). 
  • The course should probably include some material on handling analysis in big data. R holds its data  largely in-memory, and datasets that push this limit will make any analysis difficult. I think Clojure/Incanter is a good combination of an excellent language married to a robust toolkit, but I'm yet to run classification on large data sets using that toolkit, so that calls for more experimentation (and a blog post too).