Search | arXiv e-print repository

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Authors: Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Abstract: Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20… ▽ More Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:1403.4583 [pdf, other]

doi 10.1109/TIT.2016.2518171

An Achievable rate region for the $3-$user interference channel based on coset codes

Authors: Arun Padakandla, Aria G. Sahebi, S. Sandeep Pradhan

Abstract: We consider the problem of communication over a three user discrete memoryless interference channel ($3-$IC). The current known coding techniques for communicating over an arbitrary $3-$IC are based on message splitting, superposition coding and binning using independent and identically distributed (iid) random codebooks. In this work, we propose a new ensemble of codes - partitioned coset codes (… ▽ More We consider the problem of communication over a three user discrete memoryless interference channel ($3-$IC). The current known coding techniques for communicating over an arbitrary $3-$IC are based on message splitting, superposition coding and binning using independent and identically distributed (iid) random codebooks. In this work, we propose a new ensemble of codes - partitioned coset codes (PCC) - that possess an appropriate mix of empirical and algebraic closure properties. We develop coding techniques that exploit algebraic closure property of PCC to enable efficient communication over $3-$IC. We analyze the performance of the proposed coding technique to derive an achievable rate region for the general discrete $3-$IC. Additive and non-additive examples are identified for which the derived achievable rate region is the capacity, and moreover, strictly larger than current known largest achievable rate regions based on iid random codebooks. △ Less

Submitted 12 January, 2015; v1 submitted 18 March, 2014; originally announced March 2014.

Comments: New examples for which coset codes yield strictly larger achievable rate regions in comparison to those achievable using unstructured iid codes are identified. The issue of aligning interference at multiple receiver terminals addressed through an example. Revised submission to IEEE Trans. on Information Theory

Journal ref: 10.1109/TIT.2016.2518171

arXiv:1401.7006 [pdf, other]

Polar Codes for Some Multi-terminal Communications Problems

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: It is shown that polar coding schemes achieve the known achievable rate regions for several multi-terminal communications problems including lossy distributed source coding, multiple access channels and multiple descriptions coding. The results are valid for arbitrary alphabet sizes (binary or nonbinary) and arbitrary distributions (symmetric or asymmetric). It is shown that polar coding schemes achieve the known achievable rate regions for several multi-terminal communications problems including lossy distributed source coding, multiple access channels and multiple descriptions coding. The results are valid for arbitrary alphabet sizes (binary or nonbinary) and arbitrary distributions (symmetric or asymmetric). △ Less

Submitted 27 April, 2014; v1 submitted 25 January, 2014; originally announced January 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1401.6482

arXiv:1401.6482 [pdf, other]

Nested Polar Codes Achieve the Shannon Rate-Distortion Function and the Shannon Capacity

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: It is shown that nested polar codes achieve the Shannon rate-distortion function for arbitrary (binary or non-binary) discrete memoryless sources and the Shannon capacity of arbitrary discrete memoryless channels. It is shown that nested polar codes achieve the Shannon rate-distortion function for arbitrary (binary or non-binary) discrete memoryless sources and the Shannon capacity of arbitrary discrete memoryless channels. △ Less

Submitted 24 January, 2014; originally announced January 2014.

arXiv:1305.1598 [pdf, ps, other]

Abelian Group Codes for Source Coding and Channel Coding

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: In this paper, we study the asymptotic performance of Abelian group codes for the lossy source coding problem for arbitrary discrete (finite alphabet) memoryless sources as well as the channel coding problem for arbitrary discrete (finite alphabet) memoryless channels. For the source coding problem, we derive an achievable rate-distortion function that is characterized in a single-letter informati… ▽ More In this paper, we study the asymptotic performance of Abelian group codes for the lossy source coding problem for arbitrary discrete (finite alphabet) memoryless sources as well as the channel coding problem for arbitrary discrete (finite alphabet) memoryless channels. For the source coding problem, we derive an achievable rate-distortion function that is characterized in a single-letter information-theoretic form using the ensemble of Abelian group codes. When the underlying group is a field, it simplifies to the symmetric rate-distortion function. Similarly, for the channel coding problem, we find an achievable rate characterized in a single-letter information-theoretic form using group codes. This simplifies to the symmetric capacity of the channel when the underlying group is a field. We compute the rate-distortion function and the achievable rate for several examples of sources and channels. Due to the non-symmetric nature of the sources and channels considered, our analysis uses a synergy of information theoretic and group-theoretic tools. △ Less

Submitted 7 May, 2013; originally announced May 2013.

arXiv:1202.0864 [pdf, ps, other]

Nested Lattice Codes for Arbitrary Continuous Sources and Channels

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: In this paper, we show that nested lattice codes achieve the capacity of arbitrary channels with or without non-casual state information at the transmitter. We also show that nested lattice codes are optimal for source coding with or without non-causal side information at the receiver for arbitrary continuous sources. In this paper, we show that nested lattice codes achieve the capacity of arbitrary channels with or without non-casual state information at the transmitter. We also show that nested lattice codes are optimal for source coding with or without non-causal side information at the receiver for arbitrary continuous sources. △ Less

Submitted 17 March, 2012; v1 submitted 3 February, 2012; originally announced February 2012.

arXiv:1202.0863 [pdf, other]

Asymptotically Good Codes Over Non-Abelian Groups

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: It has been shown that good structured codes over non-Abelian groups do exist. Specifically, we construct codes over the smallest non-Abelian group $\mathds{D}_6$ and show that the performance of these codes is superior to the performance of Abelian group codes of the same alphabet size. This promises the possibility of using non-Abelian codes for multi-terminal settings where the structure of the… ▽ More It has been shown that good structured codes over non-Abelian groups do exist. Specifically, we construct codes over the smallest non-Abelian group $\mathds{D}_6$ and show that the performance of these codes is superior to the performance of Abelian group codes of the same alphabet size. This promises the possibility of using non-Abelian codes for multi-terminal settings where the structure of the code can be exploited to gain performance. △ Less

Submitted 21 February, 2012; v1 submitted 3 February, 2012; originally announced February 2012.

arXiv:1107.1535 [pdf, other]

Multilevel Polarization of Polar Codes Over Arbitrary Discrete Memoryless Channels

Authors: Aria G. Sahebi, S. Sandeep Pradhan

Abstract: It is shown that polar codes achieve the symmetric capacity of discrete memoryless channels with arbitrary input alphabet sizes. It is shown that in general, channel polarization happens in several, rather than only two levels so that the synthesized channels are either useless, perfect or "partially perfect". Any subset of the channel input alphabet which is closed under addition, induces a coset… ▽ More It is shown that polar codes achieve the symmetric capacity of discrete memoryless channels with arbitrary input alphabet sizes. It is shown that in general, channel polarization happens in several, rather than only two levels so that the synthesized channels are either useless, perfect or "partially perfect". Any subset of the channel input alphabet which is closed under addition, induces a coset partition of the alphabet through its shifts. For any such partition of the input alphabet, there exists a corresponding partially perfect channel whose outputs uniquely determine the coset to which the channel input belongs. By a slight modification of the encoding and decoding rules, it is shown that perfect transmission of certain information symbols over partially perfect channels is possible. Our result is general regarding both the cardinality and the algebraic structure of the channel input alphabet; i.e we show that for any channel input alphabet size and any Abelian group structure on the alphabet, polar codes are optimal. It is also shown through an example that polar codes when considered as group/coset codes, do not achieve the capacity achievable using coset codes over arbitrary channels. △ Less

Submitted 1 June, 2012; v1 submitted 7 July, 2011; originally announced July 2011.

arXiv:1102.3243 [pdf, ps, other]

On the Capacity of Abelian Group Codes Over Discrete Memoryless Channels

Authors: Aria Ghasemian Sahebi, S. Sandeep Pradhan

Abstract: For most discrete memoryless channels, there does not exist a linear code for the channel which uses all of the channel's input symbols. Therefore, linearity of the code for such channels is a very restrictive condition and there should be a loosening of the algebraic structure of the code to a degree that the code can admit any channel input alphabet. For any channel input alphabet size, there al… ▽ More For most discrete memoryless channels, there does not exist a linear code for the channel which uses all of the channel's input symbols. Therefore, linearity of the code for such channels is a very restrictive condition and there should be a loosening of the algebraic structure of the code to a degree that the code can admit any channel input alphabet. For any channel input alphabet size, there always exists an Abelian group structure defined on the alphabet. We investigate the capacity of Abelian group codes over discrete memoryless channels and provide lower and upper bounds on the capacity. △ Less

Submitted 16 February, 2011; originally announced February 2011.

Showing 1–9 of 9 results for author: Sahebi, A