WebIR (dotIR)

Introduction

dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web. Some characteristics of the collection are:

  • Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format.
  • It is prepared in such a way to be a good representative of Iranian web.
  • A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.
  • It is not very large, so that it does not require high processing resources.

dotIR Test Collection

dotIR contains 1,000,000 web pages that are gathered by selective crawling many websites from the .IR domain. Also, 50 queries and their relevance judgments are created by 25 users by use of UTIRE evaluation system. Different web retrieval algorithms are employed to create the judgment pool and totally 18,424 documents are judged by the users (on average 369 document for each query).

In order to ease comparison of different ranking algorithms on the collection, 56 features are calculated and added to the collection. These are standard features that are presented in the LETOR collection (provided by Microsoft Research Asia). The features can be used for training or tuning of web information retrieval algorithms

Download

you can download the whole WebIR(dorIR) collection by clicking Here – 1.4 GB (Google Drive Link Also Provided)

.

Copyright

dotIR collection is created by crawling of Iranian web. All rights of the collection and the tools of the collection are reserved for Database Research Group of the University of Tehran. If you use this collection, please use [۱] to refer to the collection

if you have any problem accessing the Dataset, contact us at dbrg@ut.ac.ir