روزنامه همشهری

?What is Hamshahri Corpus

Hamshahri is one of the most popular daily newspapers in Iran that has been publishing for more than 20 years. Hamshahri corpus is a Persian test collection that consists of 345 MB of news texts from this newspaper from year 1996 to 2002 (corpus size with tags is 564 MB). This corpus contains more that 160,000 news articles about variety of subjects (82 categories like politic, literature, art, economy, …) and includes nearly 417000 different words. Hamshahri articles vary between 1KB and 140KB in size.
Hamshahri corpus is prepared for different information retrieval research areas. we have created 65 queries and their relevance judgments for top 100 retrieved documents (according to TREC standard). We have used this corpus in different projects that lead to some tools that are included in this package (indexer, vector space and language modeling retrieval engines). Also, for ease of use we added an indexed version of Hamshahri corpus and it’s statistics in SQL Server 2000 database format.
In addition, we recommend you to visit web site of Bijankhan corpus that is more suitable for natural language processing researches.

Persian@CLEF2008

The current version of Hamshahri collection is also revised based of CLEF standards. DBRG have contributed Hamshahri collection to CLEF 2008 and some additional queries both in English and Persian will be added to the collection. This year Persian@CLEF offers both mono and bi-lingual tasks. For more information please see Call For Participation.

Copyright

Hamshahri corpus was created in DBRG Lab. at University of Tehran – ECE department. All rights of the corpus’ news are reserved for Hamshahri newspaper. All rights of the corpus’ data and the tools that are included in this package are reserved for University of Tehran – Database Research Group. Usage of this package for any research or non-commercial purposes is free with the precondition that you cite the related papers below.

This Package’s Components

  • Hamshahri corpus (unicode text format)
  • Indexed version of Hamshahri collection (SQL Server 2000)
  • 65 Query Judgments (According to TREC standards)
  • Some tools that we developed during our experiments on the collection (Mostly C#)
  • 58 Query Judgments (Created previously and not based on TREC standards)
  • Persian stopwords list (796 items)
  • Some extra information about Hamshahri corpus
  • Some related publications

Downloads (Version 1.0)

(Google Drive Link Also provided)* Notice: the new version of Hamshahri Corpus (version 2) is now available for download from Here.

FilesDescription
1Hamshahri-Corpus.zipTagged corpus (154 MB): This file is a compressed version of the whole corpus in Unicode text format. In this text file, each document is tagged with DID (document number), Date (publication date) and Category (subject domain) and then the main script of the document follows. To download a sample of the corpus click here and click here to download documents’ categories.
2Hamshahri-All%20(SQL).zipThe Corpus and its statistics in SQL Server 2000 format (102 MB)For easer access of those who use SQL Server 2000, we supplemented this file to the package. However, conversion of this database to other formats is easy. In this database file, tables contain all information that common IR systems need, like:Terms and their IDsDocuments and their IDsTerm frequency of each documentMaximum term frequency of each documentAverage term frequency of each documentNumber of unique terms of each documentTotal length of each documentDocument Frequency of each termCollection frequency of each term
3Hamshahri-Query_Judgement.zipRelevance judgment of the 65 standard queries (42.6 KB): 65 queries were created and judged for this corpus according to TREC standards. These queries and their relevance judgment can be used for evaluation of any information retrieval system that uses Hamshahri as its corpus.
4VectorSpace_All.rarVector space retrieval system (36.9 KB): This file contains a retrieval system that works based on vector space model. Different weighting schemes like Lnu.ltu, atc.atc and lnc.btc are implemented.
5LM-Top%201000.rarLanguage modeling retrieval system (85.6 KB): This file contains a retrieval system that works based on language modeling. Different lambda parameters can be set easily.
6Hamshahri-Query_Judgement_old.zipExtra 58 Queries and their relevance judgment (27.1 KB): These queries were used in some of our previous projects. They are not created according to TREC standard. The relevance judgment was done for top 20 retrieved documents.
7PersianStopWords.zipA List of Persian stop words (9.4 KB): This stop word list is create based on Bijankhan corpus tags. Also some stopwords are added statistically based on Hamshahri collection.

Published Papers

ReferenceDescription
[1]Darrudi E., Hejazi M.R., Oroumchian F., Assessment of a Modern Farsi Corpus. In Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID) 2004, ITRC, Kish Island, Iran.This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. (fitness of the frequency, Zipf-Mandelbrot’s law, etc.)
[2]Hadi Amiri, Abolfazl AleAhmad, Farhad Oroumchian, Caro Lucas, Masoud Rahgozar, “Using OWA Fuzzy Operator to Merge Retrieval System Results”, The Second Workshop on Computational Approaches to Arabic Script-based Languages, LSA 2007 Linguistic Institute, Stanford University, USA, 2007.In this study, we investigated performance of Persian retrieval by merging four different language modeling methods and two vector space models with Lnu.ltu and Lnc.btc weighting schemes by use of a quantifier based OWA operator.
[3]Abolfazl Aleahmad, Parsia Hakimian, Farzad Mahdikhani and Farhad Oroumchian. N-Gram and Local Context Analysis For Persian Text Retrieval. International Sympo-sium on Signal Processing and Its Applications, Sharjah U.A.E., 2007.In this experimental study, we assessed term and N-gram based vector space model and a query expansion method, namely, Local Context Analysis using different weighting schemes on Hamshahri corpus.
[4]Farhad Oroumchian, Ehsan Darrudi, Fattane Taghiyareh, Neeyaz Angoshtari. Experiments with persian text compression for web. 13th International World Wide Web conference, New York, NY, USA, 2004The approach presented in this paper aims to reduce the storage and the transmission time for Persian text files in web-based applications and Internet. Moreover, a genetic algorithm is utilized to select the most appropriate n-grams. In the best case, we have achieved 52.26 % reduction of the file size.
[5]Nayyeri, A. Oroumchian, F. FuFaIR: a Fuzzy Farsi Information Retrieval System. IEEE International Conference on Computer Systems and Applications, Sharjah U.A.E., 2006.This paper discusses the design, implementation and testing of a Fuzzy retrieval system for Persian called FuFaIR. This system also supports Fuzzy quantifiers in its query language. Tests have been conducted using a standard Persian test corpus called Hamshari.
[6]Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, Farhad Oroumchian. Hamshahri: A Standard Persian Text Collection.This paper describes the Hamshari collection and it’s characteristics. (It’s a draft version and may have some minor faults)