Search | arXiv e-print repository

Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus

Authors: Jalil Nourmohammadi Khiarak, Ammar Ahmadi, Taher Ak-bari Saeed, Meysam Asgari-Chenaghlu, Toğrul Atabay, Mohammad Reza Baghban Karimi, Ismail Ceferli, Farzad Hasanvand, Seyed Mahboub Mousavi, Morteza Noshad

Abstract: This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance… ▽ More This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance natural language processing (NLP) applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: This paper is accepted and published at NeTTT 2024 Conf

arXiv:2305.17784 [pdf, other]

ConvGenVisMo: Evaluation of Conversational Generative Vision Models

Authors: Narjes Nikzad Khasmakhi, Meysam Asgari-Chenaghlu, Nabiha Asghar, Philipp Schaer, Dietlind Zühlke

Abstract: Conversational generative vision models (CGVMs) like Visual ChatGPT (Wu et al., 2023) have recently emerged from the synthesis of computer vision and natural language processing techniques. These models enable more natural and interactive communication between humans and machines, because they can understand verbal inputs from users and generate responses in natural language along with visual outp… ▽ More Conversational generative vision models (CGVMs) like Visual ChatGPT (Wu et al., 2023) have recently emerged from the synthesis of computer vision and natural language processing techniques. These models enable more natural and interactive communication between humans and machines, because they can understand verbal inputs from users and generate responses in natural language along with visual outputs. To make informed decisions about the usage and deployment of these models, it is important to analyze their performance through a suitable evaluation framework on realistic datasets. In this paper, we present ConvGenVisMo, a framework for the novel task of evaluating CGVMs. ConvGenVisMo introduces a new benchmark evaluation dataset for this task, and also provides a suite of existing and new automated evaluation metrics to evaluate the outputs. All ConvGenVisMo assets, including the dataset and the evaluation code, will be made available publicly on GitHub. △ Less

Submitted 28 May, 2023; originally announced May 2023.

arXiv:2110.01186 [pdf, other]

doi 10.1007/s42001-022-00178-4

Text-based automatic personality prediction: A bibliographic review

Authors: Ali-Reza Feizi-Derakhshi, Mohammad-Reza Feizi-Derakhshi, Majid Ramezani, Narjes Nikzad-Khasmakhi, Meysam Asgari-Chenaghlu, Taymaz Akan, Mehrdad Ranjbar-Khadivi, Elnaz Zafarni-Moattar, Zoleikha Jahanbakhsh-Naghadeh

Abstract: Personality detection is an old topic in psychology and Automatic Personality Prediction (or Perception) (APP) is the automated (computationally) forecasting of the personality on different types of human generated/exchanged contents (such as text, speech, image, video). The principal objective of this study is to offer a shallow (overall) review of natural language processing approaches on APP si… ▽ More Personality detection is an old topic in psychology and Automatic Personality Prediction (or Perception) (APP) is the automated (computationally) forecasting of the personality on different types of human generated/exchanged contents (such as text, speech, image, video). The principal objective of this study is to offer a shallow (overall) review of natural language processing approaches on APP since 2010. With the advent of deep learning and following it transfer-learning and pre-trained model in NLP, APP research area has been a hot topic, so in this review, methods are categorized into three; pre-trained independent, pre-trained model based, multimodal approaches. Also, to achieve a comprehensive comparison, reported results are informed by datasets. △ Less

Submitted 5 September, 2022; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: This is a preprint of an article published in "Journal of Computational Social Science". The final authenticated version is available online at: https://doi.org/10.1007/s42001-022-00178-4

arXiv:2106.04939 [pdf, other]

Phraseformer: Multimodal Key-phrase Extraction using Transformer and Graph Embedding

Authors: Narjes Nikzad-Khasmakhi, Mohammad-Reza Feizi-Derakhshi, Meysam Asgari-Chenaghlu, Mohammad-Ali Balafar, Ali-Reza Feizi-Derakhshi, Taymaz Rahkar-Farshi, Majid Ramezani, Zoleikha Jahanbakhsh-Nagadeh, Elnaz Zafarani-Moattar, Mehrdad Ranjbar-Khadivi

Abstract: Background: Keyword extraction is a popular research topic in the field of natural language processing. Keywords are terms that describe the most relevant information in a document. The main problem that researchers are facing is how to efficiently and accurately extract the core keywords from a document. However, previous keyword extraction approaches have utilized the text and graph features, th… ▽ More Background: Keyword extraction is a popular research topic in the field of natural language processing. Keywords are terms that describe the most relevant information in a document. The main problem that researchers are facing is how to efficiently and accurately extract the core keywords from a document. However, previous keyword extraction approaches have utilized the text and graph features, there is the lack of models that can properly learn and combine these features in a best way. Methods: In this paper, we develop a multimodal Key-phrase extraction approach, namely Phraseformer, using transformer and graph embedding techniques. In Phraseformer, each keyword candidate is presented by a vector which is the concatenation of the text and structure learning representations. Phraseformer takes the advantages of recent researches such as BERT and ExEm to preserve both representations. Also, the Phraseformer treats the key-phrase extraction task as a sequence labeling problem solved using classification task. Results: We analyze the performance of Phraseformer on three datasets including Inspec, SemEval2010 and SemEval 2017 by F1-score. Also, we investigate the performance of different classifiers on Phraseformer method over Inspec dataset. Experimental results demonstrate the effectiveness of Phraseformer method over the three datasets used. Additionally, the Random Forest classifier gain the highest F1-score among all classifiers. Conclusions: Due to the fact that the combination of BERT and ExEm is more meaningful and can better represent the semantic of words. Hence, Phraseformer significantly outperforms single-modality methods. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2105.01331 [pdf, other]

BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on Twitter

Authors: Hasan Kemik, Nusret Özateş, Meysam Asgari-Chenaghlu, Yang Li, Erik Cambria

Abstract: Protection of human rights is one of the most important problems of our world. In this paper, our aim is to provide a dataset which covers one of the most significant human rights contradiction in recent months affected the whole world, George Floyd incident. We propose a labeled dataset for topic detection that contains 17 million tweets. These Tweets are collected from 25 May 2020 to 21 August 2… ▽ More Protection of human rights is one of the most important problems of our world. In this paper, our aim is to provide a dataset which covers one of the most significant human rights contradiction in recent months affected the whole world, George Floyd incident. We propose a labeled dataset for topic detection that contains 17 million tweets. These Tweets are collected from 25 May 2020 to 21 August 2020 that covers 89 days from start of this incident. We labeled the dataset by monitoring most trending news topics from global and local newspapers. Apart from that, we present two baselines, TF-IDF and LDA. We evaluated the results of these two methods with three different k values for metrics of precision, recall and f1-score. The collected dataset is available at https://github.com/MeysamAsgariC/BLMT. △ Less

Submitted 17 October, 2023; v1 submitted 4 May, 2021; originally announced May 2021.

arXiv:2010.05349 [pdf, other]

doi 10.1007/s40745-022-00426-4

ComStreamClust: a communicative multi-agent approach to text clustering in streaming data

Authors: Ali Najafi, Araz Gholipour-Shilabin, Rahim Dehkharghani, Ali Mohammadpur-Fard, Meysam Asgari-Chenaghlu

Abstract: Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we pro… ▽ More Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we propose a novel, multi-agent, communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g., COVID-19. The proposed approach is parallelizable, and can simultaneously handle several data-point. The LaBSE sentence embedding is used to measure the semantic similarity between two tweets. ComStreamClust has been evaluated on two datasets: the COVID-19 and the FA CUP. The results obtained from ComStreamClust approve the effectiveness of the proposed approach when compared to existing methods. △ Less

Submitted 27 April, 2021; v1 submitted 11 October, 2020; originally announced October 2020.

Comments: 11 pages, 6 Figures, 4 Tables

arXiv:2009.03947 [pdf, other]

Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence Encoder

Authors: Meysam Asgari-Chenaghlu, Narjes Nikzad-Khasmakhi, Shervin Minaee

Abstract: The novel corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere, and therefore there are a large number of tweets coming out from every corner of the world, about COVID-19 related topics. In this work, we try to analyze the tweets and detect the… ▽ More The novel corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere, and therefore there are a large number of tweets coming out from every corner of the world, about COVID-19 related topics. In this work, we try to analyze the tweets and detect the trending topics and major concerns of people on Twitter, which can enable us to better understand the situation, and devise better planning. More specifically we propose a model based on the universal sentence encoder to detect the main topics of Tweets in recent months. We used universal sentence encoder in order to derive the semantic representation and the similarity of tweets. We then used the sentence similarity and their embeddings, and feed them to K-means clustering algorithm to group similar tweets (in semantic sense). After that, the cluster summary is obtained using a text summarization algorithm based on deep learning, which can uncover the underlying topics of each cluster. Through experimental results, we show that our model can detect very informative topics, by processing a large number of tweets on sentence level (which can preserve the overall meaning of the tweets). Since this framework has no restriction on specific data distribution, it can be used to detect trending topics from any other social media and any other context rather than COVID-19. Experimental results show superiority of our proposed approach to other baselines, including TF-IDF, and latent Dirichlet allocation (LDA). △ Less

Submitted 19 September, 2020; v1 submitted 8 September, 2020; originally announced September 2020.

arXiv:2008.06877 [pdf, other]

doi 10.1016/j.chaos.2021.111274

TopicBERT: A Transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection

Authors: Meysam Asgari-Chenaghlu, Mohammad-Reza Feizi-Derakhshi, Leili farzinvash, Mohammad-Ali Balafar, Cina Motamed

Abstract: Real time nature of social networks with bursty short messages and their respective large data scale spread among vast variety of topics are research interest of many researchers. These properties of social networks which are known as 5'Vs of big data has led to many unique and enlightenment algorithms and techniques applied to large social networking datasets and data streams. Many of these resea… ▽ More Real time nature of social networks with bursty short messages and their respective large data scale spread among vast variety of topics are research interest of many researchers. These properties of social networks which are known as 5'Vs of big data has led to many unique and enlightenment algorithms and techniques applied to large social networking datasets and data streams. Many of these researches are based on detection and tracking of hot topics and trending social media events that help revealing many unanswered questions. These algorithms and in some cases software products mostly rely on the nature of the language itself. Although, other techniques such as unsupervised data mining methods are language independent but many requirements for a comprehensive solution are not met. Many research issues such as noisy sentences that adverse grammar and new online user invented words are challenging maintenance of a good social network topic detection and tracking methodology; The semantic relationship between words and in most cases, synonyms are also ignored by many of these researches. In this research, we use Transformers combined with an incremental community detection algorithm. Transformer in one hand, provides the semantic relation between words in different contexts. On the other hand, the proposed graph mining technique enhances the resulting topics with aid of simple structural rules. Named entity recognition from multimodal data, image and text, labels the named entities with entity type and the extracted topics are tuned using them. All operations of proposed system has been applied with big social data perspective under NoSQL technologies. In order to present a working and systematic solution, we combined MongoDB with Neo4j as two major database systems of our work. The proposed system shows higher precision and recall compared to other methods in three different datasets. △ Less

Submitted 16 August, 2020; originally announced August 2020.

arXiv:2007.05056 [pdf, other]

doi 10.1007/s40745-021-00326-z

Multimodal price prediction

Authors: Aidin Zehtab-Salmasi, Ali-Reza Feizi-Derakhshi, Narjes Nikzad-Khasmakhi, Meysam Asgari-Chenaghlu, Saeideh Nabipour

Abstract: Price prediction is one of the examples related to forecasting tasks and is a project based on data science. Price prediction analyzes data and predicts the cost of new products. The goal of this research is to achieve an arrangement to predict the price of a cellphone based on its specifications. So, five deep learning models are proposed to predict the price range of a cellphone, one unimodal an… ▽ More Price prediction is one of the examples related to forecasting tasks and is a project based on data science. Price prediction analyzes data and predicts the cost of new products. The goal of this research is to achieve an arrangement to predict the price of a cellphone based on its specifications. So, five deep learning models are proposed to predict the price range of a cellphone, one unimodal and four multimodal approaches. The multimodal methods predict the prices based on the graphical and non-graphical features of cellphones that have an important effect on their valorizations. Also, to evaluate the efficiency of the proposed methods, a cellphone dataset has been gathered from GSMArena. The experimental results show 88.3% F1-score, which confirms that multimodal learning leads to more accurate predictions than state-of-the-art techniques. △ Less

Submitted 2 April, 2021; v1 submitted 9 July, 2020; originally announced July 2020.

Comments: This is a preprint of an article published in "Annals of Data Science". The final authenticated version is available online at: https://link.springer.com/article/10.1007/s40745-021-00326-z

arXiv:2007.04571 [pdf, other]

doi 10.1007/s00521-022-07444-6

Automatic Personality Prediction; an Enhanced Method Using Ensemble Modeling

Authors: Majid Ramezani, Mohammad-Reza Feizi-Derakhshi, Mohammad-Ali Balafar, Meysam Asgari-Chenaghlu, Ali-Reza Feizi-Derakhshi, Narjes Nikzad-Khasmakhi, Mehrdad Ranjbar-Khadivi, Zoleikha Jahanbakhsh-Nagadeh, Elnaz Zafarani-Moattar, Taymaz Rahkar-Farshi

Abstract: Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the pers… ▽ More Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP. △ Less

Submitted 8 June, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

Comments: This is a preprint of an article published in "Neural Computing and Applications"

arXiv:2002.07563 [pdf, other]

doi 10.1007/s12652-022-04034-1

A Model to Measure the Spread Power of Rumors

Authors: Zoleikha Jahanbakhsh-Nagadeh, Mohammad-Reza Feizi-Derakhshi, Majid Ramezani, Taymaz Akan, Meysam Asgari-Chenaghlu, Narjes Nikzad-Khasmakhi, Ali-Reza Feizi-Derakhshi, Mehrdad Ranjbar-Khadivi, Elnaz Zafarani-Moattar, Mohammad-Ali Balafar

Abstract: With technologies that have democratized the production and reproduction of information, a significant portion of daily interacted posts in social media has been infected by rumors. Despite the extensive research on rumor detection and verification, so far, the problem of calculating the spread power of rumors has not been considered. To address this research gap, the present study seeks a model t… ▽ More With technologies that have democratized the production and reproduction of information, a significant portion of daily interacted posts in social media has been infected by rumors. Despite the extensive research on rumor detection and verification, so far, the problem of calculating the spread power of rumors has not been considered. To address this research gap, the present study seeks a model to calculate the Spread Power of Rumor (SPR) as the function of content-based features in two categories: False Rumor (FR) and True Rumor (TR). For this purpose, the theory of Allport and Postman will be adopted, which it claims that importance and ambiguity are the key variables in rumor-mongering and the power of rumor. Totally 42 content features in two categories "importance" (28 features) and "ambiguity" (14 features) are introduced to compute SPR. The proposed model is evaluated on two datasets, Twitter and Telegram. The results showed that (i) the spread power of False Rumor documents is rarely more than True Rumors. (ii) there is a significant difference between the SPR means of two groups False Rumor and True Rumor. (iii) SPR as a criterion can have a positive impact on distinguishing False Rumors and True Rumors. △ Less

Submitted 17 June, 2022; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: This is a preprint of an article published in "Journal of Ambient Intelligence and Humanized Computing"

arXiv:2001.06888 [pdf, other]

doi 10.1007/s00521-021-06488-4

A multimodal deep learning approach for named entity recognition from social media

Authors: Meysam Asgari-Chenaghlu, M. Reza Feizi-Derakhshi, Leili Farzinvash, M. A. Balafar, Cina Motamed

Abstract: Named Entity Recognition (NER) from social media posts is a challenging task. User generated content that forms the nature of social media, is noisy and contains grammatical and linguistic errors. This noisy content makes it much harder for tasks such as named entity recognition. We propose two novel deep learning approaches utilizing multimodal deep learning and Transformers. Both of our approach… ▽ More Named Entity Recognition (NER) from social media posts is a challenging task. User generated content that forms the nature of social media, is noisy and contains grammatical and linguistic errors. This noisy content makes it much harder for tasks such as named entity recognition. We propose two novel deep learning approaches utilizing multimodal deep learning and Transformers. Both of our approaches use image features from short social media posts to provide better results on the NER task. On the first approach, we extract image features using InceptionV3 and use fusion to combine textual and image features. This presents more reliable name entity recognition when the images related to the entities are provided by the user. On the second approach, we use image features combined with text and feed it into a BERT like Transformer. The experimental results, namely, the precision, recall and F1 score metrics show the superiority of our work compared to other state-of-the-art NER solutions. △ Less

Submitted 12 July, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

Showing 1–12 of 12 results for author: Asgari-Chenaghlu, M