-
Generating insights about financial asks from Reddit posts and user interactions
Authors:
Sachin Thukral,
Suyash Sangwan,
Vipul Chauhan,
Arnab Chatterjee,
Lipika Dey
Abstract:
As an increasingly large number of people turn to platforms like Reddit, YouTube, Twitter, Instagram, etc. for financial advice, generating insights about the content generated and interactions taking place within these platforms have become a key research question. This study proposes content and interaction analysis techniques for a large repository created from social media content, where peopl…
▽ More
As an increasingly large number of people turn to platforms like Reddit, YouTube, Twitter, Instagram, etc. for financial advice, generating insights about the content generated and interactions taking place within these platforms have become a key research question. This study proposes content and interaction analysis techniques for a large repository created from social media content, where people interactions are centered around financial information exchange. We propose methods for content analysis that can generate human-interpretable insights using topic-centered clustering and multi-document abstractive summarization. We share details of insights generated from our experiments with a large repository of data gathered from subreddit for personal finance. We have also explored the use of ChatGPT and Vicuna for generating responses to queries and compared them with human responses. The methods proposed in this work are generic and applicable to all large social media platforms.
△ Less
Submitted 12 March, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Understanding how social discussion platforms like Reddit are influencing financial behavior
Authors:
Sachin Thukral,
Suyash Sangwan,
Arnab Chatterjee,
Lipika Dey,
Aaditya Agrawal,
Pramit Kumar Chandra,
Animesh Mukherjee
Abstract:
This study proposes content and interaction analysis techniques for a large repository created from social media content. Though we have presented our study for a large platform dedicated to discussions around financial topics, the proposed methods are generic and applicable to all platforms. Along with an extension of topic extraction method using Latent Dirichlet Allocation, we propose a few mea…
▽ More
This study proposes content and interaction analysis techniques for a large repository created from social media content. Though we have presented our study for a large platform dedicated to discussions around financial topics, the proposed methods are generic and applicable to all platforms. Along with an extension of topic extraction method using Latent Dirichlet Allocation, we propose a few measures to assess user participation, influence and topic affinities specifically. Our study also maps user-generated content to components of behavioral finance. While these types of information are usually gathered through surveys, it is obvious that large scale data analysis from social media can reveal many potentially unknown or rare insights. Characterising users based on their platform behavior to provide critical insights about how communities are formed and trust is established in these platforms using graphical analysis is also studied.
△ Less
Submitted 12 March, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
Authors:
C. M. Downey,
Shannon Drizin,
Levon Haroutunian,
Shivin Thukral
Abstract:
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. I…
▽ More
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).
△ Less
Submitted 14 March, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Probing Language Models for Understanding of Temporal Expressions
Authors:
Shivin Thukral,
Kunal Kukreja,
Christian Kavouras
Abstract:
We present three Natural Language Inference (NLI) challenge sets that can evaluate NLI models on their understanding of temporal expressions. More specifically, we probe these models for three temporal properties: (a) the order between points in time, (b) the duration between two points in time, (c) the relation between the magnitude of times specified in different units. We find that although lar…
▽ More
We present three Natural Language Inference (NLI) challenge sets that can evaluate NLI models on their understanding of temporal expressions. More specifically, we probe these models for three temporal properties: (a) the order between points in time, (b) the duration between two points in time, (c) the relation between the magnitude of times specified in different units. We find that although large language models fine-tuned on MNLI have some basic perception of the order between points in time, at large, these models do not have a thorough understanding of the relation between temporal expressions.
△ Less
Submitted 3 October, 2021;
originally announced October 2021.
-
Detection of Malicious Android Applications: Classical Machine Learning vs. Deep Neural Network Integrated with Clustering
Authors:
Hemant Rathore,
Sanjay K. Sahay,
Shivin Thukral,
Mohit Sewak
Abstract:
Today anti-malware community is facing challenges due to the ever-increasing sophistication and volume of malware attacks developed by adversaries. Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks. Therefore in this paper, we propose effective and efficient Android malware detection models based on machine learning and deep learning integrated w…
▽ More
Today anti-malware community is facing challenges due to the ever-increasing sophistication and volume of malware attacks developed by adversaries. Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks. Therefore in this paper, we propose effective and efficient Android malware detection models based on machine learning and deep learning integrated with clustering. We performed a comprehensive study of different feature reduction, classification and clustering algorithms over various performance metrics to construct the Android malware detection models. Our experimental results show that malware detection models developed using Random Forest eclipsed deep neural network and other classifiers on the majority of performance metrics. The baseline Random Forest model without any feature reduction achieved the highest AUC of 99.4%. Also, the segregating of vector space using clustering integrated with Random Forest further boosted the AUC to 99.6% in one cluster and direct detection of Android malware in another cluster, thus reducing the curse of dimensionality. Additionally, we found that feature reduction in detection models does improve the model efficiency (training and testing time) many folds without much penalty on the effectiveness of the detection model.
△ Less
Submitted 28 February, 2021;
originally announced March 2021.
-
Identifying pandemic-related stress factors from social-media posts -- effects on students and young-adults
Authors:
Sachin Thukral,
Suyash Sangwan,
Arnab Chatterjee,
Lipika Dey
Abstract:
The COVID-19 pandemic has thrown natural life out of gear across the globe. Strict measures are deployed to curb the spread of the virus that is causing it, and the most effective of them have been social isolation. This has led to wide-spread gloom and depression across society but more so among the young and the elderly. There are currently more than 200 million college students in 186 countries…
▽ More
The COVID-19 pandemic has thrown natural life out of gear across the globe. Strict measures are deployed to curb the spread of the virus that is causing it, and the most effective of them have been social isolation. This has led to wide-spread gloom and depression across society but more so among the young and the elderly. There are currently more than 200 million college students in 186 countries worldwide, affected due to the pandemic. The mode of education has changed suddenly, with the rapid adaptation of e-learning, whereby teaching is undertaken remotely and on digital platforms. This study presents insights gathered from social media posts that were posted by students and young adults during the COVID times. Using statistical and NLP techniques, we analyzed the behavioral issues reported by users themselves in their posts in depression-related communities on Reddit. We present methodologies to systematically analyze content using linguistic techniques to find out the stress-inducing factors. Online education, losing jobs, isolation from friends, and abusive families emerge as key stress factors.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Chitrakar: Robotic System for Drawing Jordan Curve of Facial Portrait
Authors:
Aniruddha Singhal,
Ayush Kumar,
Shivam Thukral,
Deepak Raina,
Swagat Kumar
Abstract:
This paper presents a robotic system (\textit{Chitrakar}) which autonomously converts any image of a human face to a recognizable non-self-intersecting loop (Jordan Curve) and draws it on any planar surface. The image is processed using Mask R-CNN for instance segmentation, Laplacian of Gaussian (LoG) for feature enhancement and intensity-based probabilistic stippling for the image to points conve…
▽ More
This paper presents a robotic system (\textit{Chitrakar}) which autonomously converts any image of a human face to a recognizable non-self-intersecting loop (Jordan Curve) and draws it on any planar surface. The image is processed using Mask R-CNN for instance segmentation, Laplacian of Gaussian (LoG) for feature enhancement and intensity-based probabilistic stippling for the image to points conversion. These points are treated as a destination for a travelling salesman and are connected with an optimal path which is calculated heuristically by minimizing the total distance to be travelled. This path is converted to a Jordan Curve in feasible time by removing intersections using a combination of image processing, 2-opt, and Bresenham's Algorithm. The robotic system generates $n$ instances of each image for human aesthetic judgement, out of which the most appealing instance is selected for the final drawing. The drawing is executed carefully by the robot's arm using trapezoidal velocity profiles for jerk-free and fast motion. The drawing, with a decent resolution, can be completed in less than 30 minutes which is impossible to do by hand. This work demonstrates the use of robotics to augment humans in executing difficult craft-work instead of replacing them altogether.
△ Less
Submitted 28 June, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
Characterizing behavioral trends in a community driven discussion platform
Authors:
Sachin Thukral,
Arnab Chatterjee,
Hardik Meisheri,
Tushar Kataria,
Aman Agarwal,
Ishan Verma,
Lipika Dey
Abstract:
This article presents a systematic analysis of the patterns of behavior of individuals as well as groups observed in community-driven platforms for discussion like Reddit, where users usually exchange information and viewpoints on their topics of interest. We perform a statistical analysis of the behavior of posts and model the users' interactions around them. A platform like Reddit which has grow…
▽ More
This article presents a systematic analysis of the patterns of behavior of individuals as well as groups observed in community-driven platforms for discussion like Reddit, where users usually exchange information and viewpoints on their topics of interest. We perform a statistical analysis of the behavior of posts and model the users' interactions around them. A platform like Reddit which has grown exponentially, starting from a very small community to one of the largest social networks, with its large user base and popularity harboring a variety of behavior of users in terms of their activity. Our work provides interesting insights about a huge number of inactive posts which fail to attract attention despite their authors exhibiting Cyborg-like behavior to attract attention. We also observe short-lived yet extremely active posts emulate a phenomenon like Mayfly Buzz. A method is presented, to study the activity around posts which are highly active, to determine the presence of Limelight hogging activity. We also present a systematic analysis to study the presence of controversies in posts. We analyzed data from two periods of one-year duration but separated by few years in time, to understand how social media has evolved through the years.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.
-
Analyzing behavioral trends in community driven discussion platforms like Reddit
Authors:
Sachin Thukral,
Hardik Meisheri,
Tushar Kataria,
Aman Agarwal,
Ishan Verma,
Arnab Chatterjee,
Lipika Dey
Abstract:
The aim of this paper is to present methods to systematically analyze individual and group behavioral patterns observed in community driven discussion platforms like Reddit where users exchange information and views on various topics of current interest. We conduct this study by analyzing the statistical behavior of posts and modeling user interactions around them. We have chosen Reddit as an exam…
▽ More
The aim of this paper is to present methods to systematically analyze individual and group behavioral patterns observed in community driven discussion platforms like Reddit where users exchange information and views on various topics of current interest. We conduct this study by analyzing the statistical behavior of posts and modeling user interactions around them. We have chosen Reddit as an example, since it has grown exponentially from a small community to one of the biggest social network platforms in the recent times. Due to its large user base and popularity, a variety of behavior is present among users in terms of their activity. Our study provides interesting insights about a large number of inactive posts which fail to gather attention despite their authors exhibiting Cyborg-like behavior to draw attention. We also present interesting insights about short-lived but extremely active posts emulating a phenomenon like Mayfly Buzz. Further, we present methods to find the nature of activity around highly active posts to determine the presence of Limelight hogging activity, if any. We analyzed over $2$ million posts and more than $7$ million user responses to them during entire 2008 and over $63$ million posts and over $608$ million user responses to them from August 2014 to July 2015 amounting to two one-year periods, in order to understand how social media space has evolved over the years.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.