Skip to main content

Showing 1–21 of 21 results for author: Luu, S T

.
  1. arXiv:2405.13292  [pdf, other

    cs.CL

    Metadata Integration for Spam Reviews Detection on Vietnamese E-commerce Websites

    Authors: Co Van Dinh, Son T. Luu

    Abstract: The problem of detecting spam reviews (opinions) has received significant attention in recent years, especially with the rapid development of e-commerce. Spam reviews are often classified based on comment content, but in some cases, it is insufficient for models to accurately determine the review label. In this work, we introduce the ViSpamReviews v2 dataset, which includes metadata of reviews wit… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Accepted for publication in International Journal of Asian Language Processing (IJALP)

  2. arXiv:2404.19252  [pdf, other

    cs.CL

    Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts

    Authors: Cuong Nhat Vo, Khanh Bao Huynh, Son T. Luu, Trong-Hop Do

    Abstract: The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social m… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  3. arXiv:2403.03435  [pdf, ps, other

    cs.CL

    VLSP 2023 -- LTER: A Summary of the Challenge on Legal Textual Entailment Recognition

    Authors: Vu Tran, Ha-Thanh Nguyen, Trung Vo, Son T. Luu, Hoang-Anh Dang, Ngoc-Cam Le, Thi-Thuy Le, Minh-Tien Nguyen, Truong-Son Nguyen, Le-Minh Nguyen

    Abstract: In this new era of rapid AI development, especially in language processing, the demand for AI in the legal domain is increasingly critical. In the context where research in other languages such as English, Japanese, and Chinese has been well-established, we introduce the first fundamental research for the Vietnamese language in the legal domain: legal textual entailment recognition through the Vie… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  4. arXiv:2402.02655  [pdf, other

    cs.CL

    VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

    Authors: Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or te… ▽ More

    Submitted 6 April, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: To appear as the main conference paper at EACL 2024

  5. A Text-based Approach For Link Prediction on Wikipedia Articles

    Authors: Anh Hoang Tran, Tam Minh Nguyen, Son T. Luu

    Abstract: This paper present our work in the DSAA 2023 Challenge about Link Prediction for Wikipedia Articles. We use traditional machine learning models with POS tags (part-of-speech tags) features extracted from text to train the classification model for predicting whether two nodes has the link. Then, we use these tags to test on various machine learning models. We obtained the results by F1 score at 0.9… ▽ More

    Submitted 6 November, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: Accepted by DSAA 2023 Conference in the DSAA Student Competition Section

  6. arXiv:2303.18162  [pdf, other

    cs.CL

    A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education

    Authors: Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which conta… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

  7. Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

    Authors: Triet Minh Thai, Son T. Luu

    Abstract: Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are writt… ▽ More

    Submitted 3 September, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: VLSP2022-EVJVQA

  8. Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts

    Authors: An Long Doan, Son T. Luu

    Abstract: The sentiment analysis task has various applications in practice. In the sentiment analysis task, words and phrases that represent positive and negative emotions are important. Finding out the words that represent the emotion from the text can improve the performance of the classification models for the sentiment analysis task. In this paper, we propose a methodology that combines the emotion lexi… ▽ More

    Submitted 3 December, 2022; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: Published at the International Conference on Asian Language Processing (IALP 2022)

  9. arXiv:2209.06668  [pdf, other

    cs.CL

    UIT-ViCoV19QA: A Dataset for COVID-19 Community-based Question Answering on Vietnamese Language

    Authors: Triet Minh Thai, Ngan Ha-Thao Chu, Anh Tuan Vo, Son T. Luu

    Abstract: For the last two years, from 2020 to 2021, COVID-19 has broken disease prevention measures in many countries, including Vietnam, and negatively impacted various aspects of human life and the social community. Besides, the misleading information in the community and fake news about the pandemic are also serious situations. Therefore, we present the first Vietnamese community-based question answerin… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Accepted as poster paper at The 36th annual Meeting of Pacific Asia Conference on Language, Information and Computation (PACLIC 36). The dataset and code are available at https://github.com/minhtriet2397/UIT-ViCoV19QA

  10. Detecting Spam Reviews on Vietnamese E-commerce Websites

    Authors: Co Van Dinh, Son T. Luu, Anh Gia-Tuan Nguyen

    Abstract: The reviews of customers play an essential role in online shop**. People often refer to reviews or comments of previous customers to decide whether to buy a new product. Catching up with this behavior, some people create untruths and illegitimate reviews to hoax customers about the fake quality of products. These are called spam reviews, confusing consumers on online shop** platforms and negat… ▽ More

    Submitted 8 December, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

    Comments: Published at The 14th Asian Conference on Intelligent Information and Database Systems (ACIIDS 2022). The dataset is available at https://github.com/sonlam1102/vispamdetection

  11. VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension

    Authors: Kiet Van Nguyen, Son Quoc Tran, Luan Thanh Nguyen, Tin Van Huynh, Son T. Luu, Ngan Luu-Thuy Nguyen

    Abstract: One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To a… ▽ More

    Submitted 4 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

  12. Predicting Job Titles from Job Descriptions with Multi-label Text Classification

    Authors: Hieu Trung Tran, Hanh Hong Phuc Vo, Son T. Luu

    Abstract: Finding a suitable job and hunting for eligible candidates are important to job seeking and human resource agencies. With the vast information about job descriptions, employees and employers need assistance to automatically detect job titles based on job description texts. In this paper, we propose the multi-label classification approach for predicting relevant job titles from job description text… ▽ More

    Submitted 9 February, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: Published in the 2021 NAFOSTED Conference on Information and Computer Science (NICS 2021)

  13. Automatically Detecting Cyberbullying Comments on Online Game Forums

    Authors: Hanh Hong-Phuc Vo, Hieu Trung Tran, Son T. Luu

    Abstract: Online game forums are popular to most of game players. They use it to communicate and discuss the strategy of the game, or even to make friends. However, game forums also contain abusive and harassment speech, disturbing and threatening players. Therefore, it is necessary to automatically detect and remove cyberbullying comments to keep the game forum clean and friendly. We use the Cyberbullying… ▽ More

    Submitted 26 December, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    Comments: Published in the 2021 RIVF International Conference on Computing and Communication Technologies (RIVF)

  14. Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts

    Authors: Son T. Luu, Mao Nguyen Bui, Loi Duc Nguyen, Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading compr… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Published at The 13th International Conference on Computational Collective Intelligence (ICCCI 2021)

  15. UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification

    Authors: Son T. Luu, Ngan Luu-Thuy Nguyen

    Abstract: We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.

    Submitted 29 July, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

  16. A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset cont… ▽ More

    Submitted 20 July, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

    Comments: IEA/AIE 2021: Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices, pp 415-426

  17. arXiv:2009.12319  [pdf, other

    cs.CL

    Empirical Study of Text Augmentation on Social Media Text in Vietnamese

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually… ▽ More

    Submitted 9 October, 2020; v1 submitted 25 September, 2020; originally announced September 2020.

    Comments: Accepted by The 34th Pacific Asia Conference on Language, Information and Computation

  18. BANANA at WNUT-2020 Task 2: Identifying COVID-19 Information on Twitter by Combining Deep Learning and Transfer Learning Models

    Authors: Tin Van Huynh, Luan Thanh Nguyen, Son T. Luu

    Abstract: The outbreak COVID-19 virus caused a significant impact on the health of people all over the world. Therefore, it is essential to have a piece of constant and accurate information about the disease with everyone. This paper describes our prediction system for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. The dataset for this task contains size 10,000 tweets in English la… ▽ More

    Submitted 1 April, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

    Comments: Submitted to 2020 The 6th Workshop on Noisy User-generated Text (W-NUT)

  19. An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension

    Authors: Son T. Luu, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like… ▽ More

    Submitted 18 February, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: Published in the 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE)

  20. Comparison Between Traditional Machine Learning Models And Neural Network Models For Vietnamese Hate Speech Detection

    Authors: Son T. Luu, Hung P. Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Hate-speech detection on social network language has become one of the main researching fields recently due to the spreading of social networks like Facebook and Twitter. In Vietnam, the threat of offensive and harassment cause bad impacts for online user. The VLSP - Shared task about Hate Speech Detection on social networks showed many proposed approaches for detecting whatever comment is clean o… ▽ More

    Submitted 27 September, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

    Comments: Published in The 2020 RIVF International Conference on Computing and Communication Technologies (RIVF)

  21. Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension

    Authors: Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of m… ▽ More

    Submitted 1 November, 2020; v1 submitted 16 January, 2020; originally announced January 2020.

    Journal ref: IEEE Access, 2020