Positive Trail

Wednesday 7 September 2016

Blog migration announcement

My technical blog has moved to kirank.netlify.com since early 2016.
While the Blogger template has served me well so far, I was keen on using a toolkit similar to Github publishing. The Cryogen project was the ideal choice, as it is built in Clojure and allowed me to do interesting enhancements without having to learn a new language.

So that's that. Hope to see you there! :)

Monday 18 April 2016

Html template generation in Clojure - with Enlive

Enlive Tutorial

While working with web applications in Clojure, I had the opportunity to work with the excellent HTML templating engine Enlive. I'm sharing a tutorial (I had written) on Enlive, which is at this Github worksheet.

p.s. To run the worksheet, click on the "view this worksheet on Github" link on the top right corner. Alternatively, you can clone the Github repo, run lein gorilla at the terminal, and open the worksheet located at src/enliven.cljw.

Tuesday 12 April 2016

Presentation on Probabilistic Graphical Models

Its been about a year and a half since my book Building Graphical Models in Python was published. Since the book is fairly technical and involves programming, it is not the easiest of introductions to graphical models, especially to somebody with little knowledge of Machine Learning.

To introduce the topic and whet people's knowledge about Graphical Models, I have given the following presentation a couple of times, and it was well received as an introduction to the lay person. Here it is.

Topic Modeling on Customer Experience data

Background

There exists a vast trove of Customer Experience data in the form of product reviews, forum posts, customer service/customer satisfaction surveys and suchlike. This data is often in unstructured form. Companies that own this data would like to summarize these (often vast) data-sets.

One of the most common methods of text mining (for lack of a better word), is Topic Modeling. Given a large corpus of text, a topic model can assign a probabilistic score for each document-topic pair.

At BAICONF'15, I presented a paper and presentation on the effectiveness of using Topic Modeling for summarizing Customer Experience data. This paper was the result of our experiences (working with Extrack at Bridgei2i) of applying multiple methods such as Unsupervised Topic Models, Semi-Supervised Topic Models and others, on multiple types of Customer Experience data. The paper, as well as the presentation, are presented below. (Note, due to scheduling issues, the paper/presentation is not listed in the BAICONF schedule).

Here's the link to the paper along with the accompanying presentation.

Thursday 4 June 2015

Clojure wrapper for Word2Vec

The Word2Vec algorithm is an unsupervised algorithm that takes a word as input and returns a vector which describes this word in a high dimension space (for example, 300-500 dimensions). Prior to Word2Vec, the most popular algorithm to convert words to vectors was the bag-of-words model, which simply indicates presence or absence of the word.

The weakness of Bag-of-words model was the vectors had a poor notion of distance. For example, the word vectors for Paris and France would have no apparent similarity in Bag-of-words, but Word2vec generates vectors that can not only tell that these words are similar, but can derive relationships which determine the capitals of other countries, given only the country names.

Image courtesy Maastricht Uni, Dept of Knowledge Engineering

I implemented a Clojure library that wraps a Java implementation of Word2Vec. More details (and a short tutorial) in this blog post (Crossposted from the Bridgei2i Github site)

Monday 8 July 2013

My attempt at the Microsoft Author-Paper Identification Challenge at Kaggle

Introduction

The Microsoft Author-Paper Identification challenge is a machine learning competition hosted at Kaggle.com, and this post documents my attempts at this problem.

The Data

The data consists of a training and validation set that were released early in the competition. The training data consisted of the following comma separated files:

Paper.csv, where each record described data about a Paper, such as its id, publication year, journal or conference.
Author.csv, where each record described an Author, his Affiliation, etc
Journal.csv and Conference.csv, which described each journal or conference
PaperAuthor.csv, a noisy dataset that listed Authors and Papers ascribed to them.

The training file consisted of an AuthorId, followed by a list of PaperIds that were correctly, and incorrectly assigned to that author. The submission format was that each line should contain an author id, followed by a list of paper ids, in decreasing order of probability that the paper was authored by the given author.
For e.g. Author1, P1 P2 P3
Where the probability that Author1 wrote P1 is highest, and P3 being the lowest.

Analysis

Although the evaluation metric used was Mean Average Precision, this problem is a classification problem. Given an Author id-Paper id tuple, predict the probability that the Paper was authored by the said author.

Provided along with the data was the code for a simple model that generated a 5 features, and then used a RandomForest classifier to classify each tuple.

Data Cleaning

I wasn't too comfortable with using sql to extract features, hence I loaded the csv files into mongodb using mongo import. The data cleaning stage took a while, I changed the contents to lowercase, removed newlines within some dataset entries and removed English stop words.

Approach:

The initial 5 features provided looked at the commonsense features such as number of coauthors who published in the same journal and so on. While treating this as a classification problem, the out-of-bag error rate (using RandomForests) was already around 95%.
I noticed at this point that to improve scores, I would have to start using features that depended on text fields such as AuthorName, PaperName, Affiliation (e.g. University).

In some cases the Author name was just marked wrong. For e.g. if 'Andrew B' was marked as the Author name for 99 papers and 'Kerning A' was the Author name for the 100th paper, this was obviously a mistake. I calculated the 'Jaccard distance' on the following set of features:

Author names: Intersection of names in the author set with the current author name.
Co author names: Intersection of coauthor names similar with current co-author.
Affiliation: Intersection of this paper's affiliation with the current author's affiliations.
Title: similarly computed the Jaccard similarity.

Adding all of these features improved my validation set submission score to about 95%. Despite toying around, I didn't make much improvements on the score after this.

I also tried to use a Naive Bayes classifier to classify the 'bag of title words', into 2 classes, 'current author' or 'other'. Unfortunately the projected time it would take seemed like 4 continuous days of CPU time, and with Bangalore's power failures, it was unlikely it would ever finish, so I abandoned the approach.

My last score improvement came unexpectedly, I noticed that several author-paper entries that were 'correctly marked', had multiple entries in the dataset. I was surprised at this rather obvious mistake (or goofup, however you want to look at it), but adding this feature (number of paper-author counts) improved the classification rate by a whole percent.

Learning and Notes:

This competition required creating hand-coded features from the multiple files that comprised of the dataset. The ability to quickly try out a feature and see if it improved scores was critical. I spent a lot of time on implementing a Naive Bayes classifier from scratch, without testing out the classification accuracy on a small part of the dataset, and finally had to abandon this feature due to lack of CPU hours.
This was probably one competition where Hadoop/MapReduce skills wouldn't hurt, since the datasets were large enough that parallelization would have helped.
The time required for training and prediction using RandomForests was really small, probably of the order of 10 minutes or less. This was very different from other competitions (where I used deep neural nets, for e.g.) where I would have to watch plots of errors of training and validation sets for hours on end.
Observing quirks of the data is important, as the duplicate paper-author ids demonstrated.
I didn't visualize or graph anything in this competition, and that's something I should improve upon, although it wasn't obvious what it might have been useful for, in this case.

This was the first competition where I got into the top 25%, and my overall Kaggle ranking was inside the top 1%.

Wednesday 12 June 2013

Machine learning endeavour: At the finish line

So I'm finally at a point where I can say I'm done with my Machine Learning endeavour that I started about 19 months ago. Of course, learning is a lifelong process, but getting to the finish line of what seemed to be a far-fetched goal sometime back, feels incredibly sweet.

Some quick statistics:

Number of courses completed: 6
Number of courses partially completed: 3
Number of programming assignments: 32

Natural Language Processing: 3
Machine Learning: 8
Probabilistic Graphical models :9
Neural Networks for Machine Learning: 4
Computing for Data Analysis: 2
Data Analysis: 2
Design and Analysis of Algorithms: 4

Number of Kaggle competitions entered: 4

Notable rankings:

72/1158 in the Digit Recognition competition.
125/557 in the KDD Cup 2013-Author Paper Identification Challenge
Ranked among the top 2% of all Kagglers.

This is the list of courses that I completed:

No	Course name	Completed
1	Probabilistic Graphical Models: Advanced track with distinction	June 2012
2	Computing for Data Analysis	Dec 2012
3	Machine Learning	Dec 2012
4	Neural Networks for machine learning	Dec 2012
5	Data Analysis	Mar 2013
6	Natural Language Processing with distinction	May 2013

Thoughts on getting here? In no particular order:

Learning is incredibly fun. And self-motivated learning is much more fun than college. Motivation in college education is usually built around too much stick and too little carrot.
Although online learning might seem very disconnected, the online student community is tight-knit and extremely helpful. I would not have completed certain assignments in PGM and NLP if not for other students who posted helpful tips.
Support from family and friends matter a lot.
Finishing a course is just one part of the story. Applying the learning in a real-world context is a very different challenge.

Blogs/books that I found helpful:

Study Hacks, Cal Newport's blog
Cal Newport: So good they can't ignore you
Scott Young's blog
The Kaggle blog

Other self-learners: