sorting information retrieval has been one of the core issues, Learning to Rank (short LTR) with the idea of machine learning to solve scheduling problems (on Learning to Rank's profile, see my blog post Learning to Rank Profile ). LTR There are three main ways: PointWise, PairWise, ListWise. Ranking SVM algorithm is a method PointWise by R. Herbrich et al in 2000 raised, T. Joachims presents a user-based Clickthrough data to be sorted using the Ranking SVM method (SIGKDD, 2002).
1. Ranking SVM main idea
Ranking SVM is a Pointwise sorting algorithm, for a given query q, document d 1 > d 2 > d 3 (ie document d 1 than the document d 2 related documents d 2 than the document d 3 -related , x 1 , x 2 , x 3 are d 1 , d 2 , d 3 characteristics). Using machine learning methods to sort the sort we transformed into a classification problem. We define a new training sample, so that x 1 -x 2 , x 1 -x 3 , x < sub> 2 -x 3 is a positive sample, so that x 2 -x 1 , x 3 -x 1 , x 3 -x 2 is negative sample, and then training a two classifier (support vector machine) to these new training samples are classified as shown below:
left of each ellipse represents a query ellipse to a point on behalf of those calculations and the relevance of the query document, triangle represents a very relevant circles represent generally related to cross represents irrelevant. We left in the documents into a single document on the right in (d i , d j ), solid boxes represent the positive samples, ie d i > d j , hollow boxes represent negative samples, ie d i
2. Ranking SVM
the sorting problem into a classification problem, we can use a common machine learning methods to solve the problem. Ranking SVM using SVM to classify:
where w is the parameter vector, x is a feature of the document, y is the document to the relative correlation, ξ is slack variable.
3. use Clickthrough data as training data <p>
T. Joachims proposed a very clever way to use Clickthrough data as Ranking SVM training data.
Suppose you are given a query "Support Vector Machine", the search engine returns results
which 1, 3, 7 three results clicked by the user, the other did not. Because the returned result itself is ordered, users are more inclined to click on the top surface of the results, so the user clicks the act itself is biased (Bias) of. To click data from biased obtain information about the document, we believe that: If a user clicks on a without clicking b, but b sort results in higher position than a, then a> b.
so the above user clicks on the behavior means: 3> 2, 7> 2, 7> 4, 7> 5, 7> 6.
4. Ranking SVM open source implementation
H. Joachims homepage Ranking SVM on the open source implementation.
LIBSVM data format of the input format is similar to the first column represents the relevance of documents, greater value represents the more relevant, representative for the second column, the latter representative feature
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D
training model and the test data were sort code:
. / svm_rank_learn path / to / train path / to / model
. / svm_classify path / to / test path / to / model path / to / rank_result
References:
[1]. R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression . In Advances in Large Margin Classifiers, 2000.
[2]. T. Joachims. Optimizing Search Engines using Clickthrough Data <em>. SIGKDD, 2002.
[3]. Hang Li. A Short Introduction to Learning to Rank .
[4]. Tie-yan Liu. Learning to Rank for Information Retrieval .
没有评论:
发表评论