I’ve moved my blog to http://blogs.skicelab.com/maurizio/
bye bye WordPress.
Many years ago, SkiceLab was just an idea and Marco and I were technology enthusiasts trying to build whatever came to our minds.
Divx were getting popular and the regular media devices lacked support for it. We wanted to build the first video player and recorder with divx support. I guess most of the big companies were already working on that, we were just poor kids looking for fun. That machine, codenamed Nadir, was our prototype and was supposed to become be called Zenith once the software was completed.
Nadir was built using recycling old components from other computers, mainly collected from our parents and friends. We didn’t have a case and Marco’s father built it using a piece of wood, screwing everything over it. We’ve installed Linux and made some of the low level software to handle the devices. I’ve started looking at Python at that time and I’ve decided to use it for the media player. The Python binding for lirc was written at that time, in order to be able to use the r/c. (Note: there are better bindings than mine, I’ve linked it just for archeological interests)
Zenith has never seen the light. I don’t remember the exact reason, I guess we’d got demotivated when the first divx players were released. I don’t think we would have got any chances to build our first business with that, we’ve underestimated the market and the problem and over-featured the prototype (for instance the record function).. Still was fun and a nice story to remind today.
My master thesis work with my collegue and friend Salvatore Trani was a nice challenging problem on recommending citations for academic papers. We’ve basically built a system that is able to suggest related papers given the general idea of some research you want to do.
The system wants to shorten and simplify the initial explorative step required during a research. The prototype is available at http://vinello.isti.cnr.it/cite-as-you-write/ but it’s quite far from being production ready (kinda slow, SQLite doesn’t scale enough for the served archive and the web service used for the internal communication is Bottle, a very nice and simple Python module but probably not the best solution we could’ve adopted). Still it might be interesting for some of you. We’ve tested it in a few real life cases and it gave surprising results. Our thesis is written in Italian but I’m very glad to share what we’ve done. This post wants to be an introduction to the most interesting parts of the work from the algorithm point of view. Actually most of the job was done on the preprocessing part of the data but I’ll omit this part, maybe I’ll talk about it in another post.
First let me talk about the differences against other search engine: Cite-as-you-write is not a search engine. Usually search engines are keyword based, this means that if you put a very long query on Google then you’re likely to see non relevant results. To make Cite-as-you-write answer to very long queries (that can consist of more than one sentence) the index must be very tolerant and returns documents that don’t have all the words inserted by the user. We’ve accomplished this by extracting a kind of context from the query. The summarisation is done simply by selecting the most important words from the query, we did it in a very naive way: sort the terms by tf-idf and choose the first K. The posting lists linked to each term are then merged all together, we don’t do the intersection of the lists, that would reduce the candidate list massively.
After the candidates are selected we do the ranking to order the suggestions. The ranking is done in two steps: the first step just compute the cosine similarity between the query and the abstracts/title of the candidates, the second step uses a Random Forest model to exploit also other features like the number of citations, pagerank, publication date, etc…
Some of you could argue that the cosine similarity might be too slow for the first ranking step. Indeed it is, what we can do in this case is to adopt some of the early termination methods discussed in literature. In our case we’ve tried another path that is an aggressive index pruning. What does it mean? Usually to index a document you split the text in terms and then you add the document to all the posting lists of the terms. A light pruning strategy would remove at least the stopwords, our aggressive choice remove all but the “important terms”. Once again, the important terms are the first K terms sorted by tf-idf.
The code of the prototype is not published yet, we’ve released a small part that was the tool used to train the function for the second step of the ranking phase, that we’ve called mltool (it stands for Machine Learning Tool). The tool is available on Bitbucket at the following address: https://bitbucket.org/duilio/mltool. Please, feel free to use, fork, complain and suggest improvements. You can actually install it without even looking at the source code using pip or easy_install:
$ pip install mltool
$ easy_install mltool