-
SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models
Authors:
Zhihao Wang,
Yiqun Xie,
Zhili Li,
Xiaowei Jia,
Zhe Jiang,
Aolin Jia,
Shuo Xu
Abstract:
Fairness-awareness has emerged as an essential building block for the responsible use of artificial intelligence in real applications. In many cases, inequity in performance is due to the change in distribution over different regions. While techniques have been developed to improve the transferability of fairness, a solution to the problem is not always feasible with no samples from the new region…
▽ More
Fairness-awareness has emerged as an essential building block for the responsible use of artificial intelligence in real applications. In many cases, inequity in performance is due to the change in distribution over different regions. While techniques have been developed to improve the transferability of fairness, a solution to the problem is not always feasible with no samples from the new regions, which is a bottleneck for pure data-driven attempts. Fortunately, physics-based mechanistic models have been studied for many problems with major social impacts. We propose SimFair, a physics-guided fairness-aware learning framework, which bridges the data limitation by integrating physical-rule-based simulation and inverse modeling into the training design. Using temperature prediction as an example, we demonstrate the effectiveness of the proposed SimFair in fairness preservation.
△ Less
Submitted 5 February, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Cross-Inlining Binary Function Similarity Detection
Authors:
Ang Jia,
Ming Fan,
Xi Xu,
Wuxia **,
Haijun Wang,
Ting Liu
Abstract:
Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function map** is more complex, especially when function inlining happens.
In this paper, we will systematically…
▽ More
Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function map** is more complex, especially when function inlining happens.
In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function map**s by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Swarm-GPT: Combining Large Language Models with Safe Motion Planning for Robot Choreography Design
Authors:
Aoran Jiao,
Tanmay P. Patel,
Sanjmi Khurana,
Anna-Mariya Korol,
Lukas Brunke,
Vivek K. Adajania,
Utku Culha,
Siqi Zhou,
Angela P. Schoellig
Abstract:
This paper presents Swarm-GPT, a system that integrates large language models (LLMs) with safe swarm motion planning - offering an automated and novel approach to deployable drone swarm choreography. Swarm-GPT enables users to automatically generate synchronized drone performances through natural language instructions. With an emphasis on safety and creativity, Swarm-GPT addresses a critical gap i…
▽ More
This paper presents Swarm-GPT, a system that integrates large language models (LLMs) with safe swarm motion planning - offering an automated and novel approach to deployable drone swarm choreography. Swarm-GPT enables users to automatically generate synchronized drone performances through natural language instructions. With an emphasis on safety and creativity, Swarm-GPT addresses a critical gap in the field of drone choreography by integrating the creative power of generative models with the effectiveness and safety of model-based planning algorithms. This goal is achieved by prompting the LLM to generate a unique set of waypoints based on extracted audio data. A trajectory planner processes these waypoints to guarantee collision-free and feasible motion. Results can be viewed in simulation prior to execution and modified through dynamic re-prompting. Sim-to-real transfer experiments demonstrate Swarm-GPT's ability to accurately replicate simulated drone trajectories, with a mean sim-to-real root mean square error (RMSE) of 28.7 mm. To date, Swarm-GPT has been successfully showcased at three live events, exemplifying safe real-world deployment of pre-trained models.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Generic Attention-model Explainability by Weighted Relevance Accumulation
Authors:
Yiming Huang,
Aozhe Jia,
Xiaodan Zhang,
Jiawei Zhang
Abstract:
Attention-based transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy…
▽ More
Attention-based transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation. In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning validate that our explainability method outperforms existing methods.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Authors:
Aojun Zhou,
Ke Wang,
Zimu Lu,
Weikang Shi,
Sichun Luo,
Zipeng Qin,
Shaoqing Lu,
Anya Jia,
Linqi Song,
Mingjie Zhan,
Hongsheng Li
Abstract:
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different con…
▽ More
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization
Authors:
Yansong Huang,
Zherui Zhang,
Ao Jiao,
Yuxin Ma,
Ran Cheng
Abstract:
Evolutionary multi-objective optimization (EMO) algorithms have been demonstrated to be effective in solving multi-criteria decision-making problems. In real-world applications, analysts often employ several algorithms concurrently and compare their solution sets to gain insight into the characteristics of different algorithms and explore a broader range of feasible solutions. However, EMO algorit…
▽ More
Evolutionary multi-objective optimization (EMO) algorithms have been demonstrated to be effective in solving multi-criteria decision-making problems. In real-world applications, analysts often employ several algorithms concurrently and compare their solution sets to gain insight into the characteristics of different algorithms and explore a broader range of feasible solutions. However, EMO algorithms are typically treated as black boxes, leading to difficulties in performing detailed analysis and comparisons between the internal evolutionary processes. Inspired by the successful application of visual analytics tools in explainable AI, we argue that interactive visualization can significantly enhance the comparative analysis between multiple EMO algorithms. In this paper, we present a visual analytics framework that enables the exploration and comparison of evolutionary processes in EMO algorithms. Guided by a literature review and expert interviews, the proposed framework addresses various analytical tasks and establishes a multi-faceted visualization design to support the comparative analysis of intermediate generations in the evolution as well as solution sets. We demonstrate the effectiveness of our framework through case studies on benchmarking and real-world multi-objective optimization problems to elucidate how analysts can leverage our framework to inspect and compare diverse algorithms.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
ScrollTimes: Tracing the Provenance of Paintings as a Window into History
Authors:
Wei Zhang,
Wong Kam-Kwai,
Yitian Chen,
Ailing Jia,
Luwei Wang,
Jian-Wei Zhang,
Lechao Cheng,
Huamin Qu,
Wei Chen
Abstract:
The study of cultural artifact provenance, tracing ownership and preservation, holds significant importance in archaeology and art history. Modern technology has advanced this field, yet challenges persist, including recognizing evidence from diverse sources, integrating sociocultural context, and enhancing interactive automation for comprehensive provenance analysis. In collaboration with art his…
▽ More
The study of cultural artifact provenance, tracing ownership and preservation, holds significant importance in archaeology and art history. Modern technology has advanced this field, yet challenges persist, including recognizing evidence from diverse sources, integrating sociocultural context, and enhancing interactive automation for comprehensive provenance analysis. In collaboration with art historians, we examined the handscroll, a traditional Chinese painting form that provides a rich source of historical data and a unique opportunity to explore history through cultural artifacts. We present a three-tiered methodology encompassing artifact, contextual, and provenance levels, designed to create a "Biography" for handscroll. Our approach incorporates the application of image processing techniques and language models to extract, validate, and augment elements within handscroll using various cultural heritage databases. To facilitate efficient analysis of non-contiguous extracted elements, we have developed a distinctive layout. Additionally, we introduce ScrollTimes, a visual analysis system tailored to support the three-tiered analysis of handscroll, allowing art historians to interactively create biographies tailored to their interests. Validated through case studies and expert interviews, our approach offers a window into history, fostering a holistic understanding of handscroll provenance and historical significance.
△ Less
Submitted 16 January, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Deep Learning for Solving and Estimating Dynamic Macro-Finance Models
Authors:
Benjamin Fan,
Edward Qiao,
Anran Jiao,
Zhouzhou Gu,
Wenhao Li,
Lu Lu
Abstract:
We develop a methodology that utilizes deep learning to simultaneously solve and estimate canonical continuous-time general equilibrium models in financial economics. We illustrate our method in two examples: (1) industrial dynamics of firms and (2) macroeconomic models with financial frictions. Through these applications, we illustrate the advantages of our method: generality, simultaneous soluti…
▽ More
We develop a methodology that utilizes deep learning to simultaneously solve and estimate canonical continuous-time general equilibrium models in financial economics. We illustrate our method in two examples: (1) industrial dynamics of firms and (2) macroeconomic models with financial frictions. Through these applications, we illustrate the advantages of our method: generality, simultaneous solution and estimation, leveraging the state-of-art machine-learning techniques, and handling large state space. The method is versatile and can be applied to a vast variety of problems.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Reliable extrapolation of deep neural operators informed by physics or sparse observations
Authors:
Min Zhu,
Handi Zhang,
Anran Jiao,
George Em Karniadakis,
Lu Lu
Abstract:
Deep neural operators can learn nonlinear map**s between infinite-dimensional function spaces via deep neural networks. As promising surrogate solvers of partial differential equations (PDEs) for real-time prediction, deep neural operators such as deep operator networks (DeepONets) provide a new simulation paradigm in science and engineering. Pure data-driven neural operators and deep learning m…
▽ More
Deep neural operators can learn nonlinear map**s between infinite-dimensional function spaces via deep neural networks. As promising surrogate solvers of partial differential equations (PDEs) for real-time prediction, deep neural operators such as deep operator networks (DeepONets) provide a new simulation paradigm in science and engineering. Pure data-driven neural operators and deep learning models, in general, are usually limited to interpolation scenarios, where new predictions utilize inputs within the support of the training set. However, in the inference stage of real-world applications, the input may lie outside the support, i.e., extrapolation is required, which may result to large errors and unavoidable failure of deep learning models. Here, we address this challenge of extrapolation for deep neural operators. First, we systematically investigate the extrapolation behavior of DeepONets by quantifying the extrapolation complexity via the 2-Wasserstein distance between two function spaces and propose a new behavior of bias-variance trade-off for extrapolation with respect to model capacity. Subsequently, we develop a complete workflow, including extrapolation determination, and we propose five reliable learning methods that guarantee a safe prediction under extrapolation by requiring additional information -- the governing PDEs of the system or sparse new observations. The proposed methods are based on either fine-tuning a pre-trained DeepONet or multifidelity learning. We demonstrate the effectiveness of the proposed framework for various types of parametric PDEs. Our systematic comparisons provide practical guidelines for selecting a proper extrapolation method depending on the available information, desired accuracy, and required inference speed.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
The RoyalFlush System for the WMT 2022 Efficiency Task
Authors:
Bo Qin,
Aixin Jia,
Qiang Wang,
Jianning Lu,
Shuqin Pan,
Haibo Wang,
Ming Chen
Abstract:
This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressivel…
▽ More
This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressively generates a discontinuous sequence (e.g., make a prediction every $k$ tokens, $k>1$) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Thus, we can easily trade off the translation quality and speed by adjusting $k$. In addition, by integrating other modeling techniques (e.g., sequence-level knowledge distillation and deep-encoder-shallow-decoder layer allocation strategy) and a mass of engineering efforts, HRT improves 80\% inference speed and achieves equivalent translation performance with the same-capacity AT counterpart. Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
△ Less
Submitted 3 December, 2022;
originally announced December 2022.
-
Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining
Authors:
Ang Jia,
Ming Fan,
Xi Xu,
Wuxia **,
Haijun Wang,
Qiyi Tang,
Sen Nie,
Shi Wu,
Ting Liu
Abstract:
Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such map** could be "1-to-n" (one query binary function maps multiple source functions),…
▽ More
Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such map** could be "1-to-n" (one query binary function maps multiple source functions), due to the existence of function inlining.
To help conduct binary2source function matching under function inlining, we propose a method named O2NMatcher to generate Source Function Sets (SFSs) as the matching target for binary functions with inlining. We first propose a model named ECOCCJ48 for inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined or not), extract several features from the call sites, and design a compiler-opt-based multi-label classifier by inspecting the inlining correlations between different compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to evaluate the effectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds all the state-of-the-art works.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis
Authors:
Ang Jia,
Ming Fan,
Wuxia **,
Xi Xu,
Zhaohui Zhou,
Qiyi Tang,
Sen Nie,
Shi Wu,
Ting Liu
Abstract:
Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function map** is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining.
In this paper, we investigate the effect…
▽ More
Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function map** is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining.
In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies.
△ Less
Submitted 5 May, 2022; v1 submitted 23 December, 2021;
originally announced December 2021.
-
One-shot learning for solution operators of partial differential equations
Authors:
Anran Jiao,
Haiyang He,
Rishikesh Ranade,
Jay Pathak,
Lu Lu
Abstract:
Learning and solving governing equations of a physical system, represented by partial differential equations (PDEs), from data is a central challenge in a variety of areas of science and engineering. Traditional numerical methods for solving PDEs can be computationally expensive for complex systems and require the complete PDEs of the physical system. On the other hand, current data-driven machine…
▽ More
Learning and solving governing equations of a physical system, represented by partial differential equations (PDEs), from data is a central challenge in a variety of areas of science and engineering. Traditional numerical methods for solving PDEs can be computationally expensive for complex systems and require the complete PDEs of the physical system. On the other hand, current data-driven machine learning methods require a large amount of data to learn a surrogate model of the PDE solution operator, which could be impractical. Here, we propose the first solution operator learning method that only requires one PDE solution, i.e., one-shot learning. By leveraging the principle of locality of PDEs, we consider small local domains instead of the entire computational domain and define a local solution operator. The local solution operator is then trained using a neural network, and utilized to predict the solution of a new input function via mesh-based fixed-point iteration (FPI), meshfree local-solution-operator informed neural network (LOINN) or local-solution-operator informed neural network with correction (cLOINN). We test our method on diverse PDEs, including linear or nonlinear PDEs, PDEs defined on complex geometries, and PDE systems, demonstrating the effectiveness and generalization capabilities of our method across these varied scenarios.
△ Less
Submitted 6 June, 2024; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Interpretation-enabled Software Reuse Detection Based on a Multi-Level Birthmark Model
Authors:
Xi Xu,
Qinghua Zheng,
Zheng Yan,
Ming Fan,
Ang Jia,
Ting Liu
Abstract:
Software reuse, especially partial reuse, poses legal and security threats to software development. Since its source codes are usually unavailable, software reuse is hard to be detected with interpretation. On the other hand, current approaches suffer from poor detection accuracy and efficiency, far from satisfying practical demands. To tackle these problems, in this paper, we propose \textit{ISRD…
▽ More
Software reuse, especially partial reuse, poses legal and security threats to software development. Since its source codes are usually unavailable, software reuse is hard to be detected with interpretation. On the other hand, current approaches suffer from poor detection accuracy and efficiency, far from satisfying practical demands. To tackle these problems, in this paper, we propose \textit{ISRD}, an interpretation-enabled software reuse detection approach based on a multi-level birthmark model that contains function level, basic block level, and instruction level. To overcome obfuscation caused by cross-compilation, we represent function semantics with Minimum Branch Path (MBP) and perform normalization to extract core semantics of instructions. For efficiently detecting reused functions, a process for "intent search based on anchor recognition" is designed to speed up reuse detection. It uses strict instruction match and identical library call invocation check to find anchor functions (in short anchors) and then traverses neighbors of the anchors to explore potentially matched function pairs. Extensive experiments based on two real-world binary datasets reveal that \textit{ISRD} is interpretable, effective, and efficient, which achieves $97.2\%$ precision and $94.8\%$ recall. Moreover, it is resilient to cross-compilation, outperforming state-of-the-art approaches.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
From Innovations to Prospects: What Is Hidden Behind Cryptocurrencies?
Authors:
Ang Jia,
Ming Fan,
Xi Xu,
Di Cui,
Wenying Wei,
Zijiang Yang,
Kai Ye,
Ting Liu
Abstract:
The great influence of Bitcoin has promoted the rapid development of blockchain-based digital currencies, especially the altcoins, since 2013. However, most altcoins share similar source codes, resulting in concerns about code innovations. In this paper, an empirical study on existing altcoins is carried out to offer a thorough understanding of various aspects associated with altcoin innovations.…
▽ More
The great influence of Bitcoin has promoted the rapid development of blockchain-based digital currencies, especially the altcoins, since 2013. However, most altcoins share similar source codes, resulting in concerns about code innovations. In this paper, an empirical study on existing altcoins is carried out to offer a thorough understanding of various aspects associated with altcoin innovations. Firstly, we construct the dataset of altcoins, including source code repositories, GitHub fork relations, and market capitalizations (cap). Then, we analyze the altcoin innovations from the perspective of source code similarities. The results demonstrate that more than 85% of altcoin repositories present high code similarities. Next, a temporal clustering algorithm is proposed to mine the inheritance relationship among various altcoins. The family pedigrees of altcoin are constructed, in which the altcoin presents similar evolution features as biology, such as power-law in family size, variety in family evolution, etc. Finally, we investigate the correlation between code innovations and market capitalization. Although we fail to predict the price of altcoins based on their code similarities, the results show that altcoins with higher innovations reflect better market prospects.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
Cultivating Online: Question Routing in a Question and Answering Community for Agriculture
Authors:
Xiaoxue Shen,
Liyang Gu,
Adele Lu Jia
Abstract:
Community-based Question and Answering (CQA) platforms are nowadays enlightening over a billion people with crowdsourced knowledge. A key design issue in CQA platforms is how to find the potential answerers and to provide the askers timely and suitable answers, i.e., the so-called \textit{question routing} problem. State-of-art approaches often rely on extracting topics from the question texts. In…
▽ More
Community-based Question and Answering (CQA) platforms are nowadays enlightening over a billion people with crowdsourced knowledge. A key design issue in CQA platforms is how to find the potential answerers and to provide the askers timely and suitable answers, i.e., the so-called \textit{question routing} problem. State-of-art approaches often rely on extracting topics from the question texts. In this work, we analyze the question routing problem in a CQA system named Farm-Doctor that is exclusive for agricultural knowledge. The major challenge is that its questions contain limited textual information.
To this end, we conduct an extensive measurement and obtain the whole knowledge repository of Farm-Doctor that consists of over 690 thousand questions and over 3 million answers. To remedy the text deficiency, we model Farm-Doctor as a heterogeneous information network that incorporates rich side information and based on network representation learning models we accurately recommend for each question the users that are highly likely to answer it. With an average income of fewer than 6 dollars a day, over 300 thousands farmers in China seek online in Farm-Doctor for agricultural advices. Our method helps these less eloquent farmers with their cultivation and hopefully provides a way to improve their lives.
△ Less
Submitted 17 February, 2020; v1 submitted 17 April, 2019;
originally announced April 2019.
-
User Donations in a Crowdsourced Video System
Authors:
Adele Lu Jia,
Xiaoxue Shen,
Siqi Shen,
Jun Xu
Abstract:
Crowdsourced video systems like YouTube and Twitch.tv have been a major internet phenomenon and are nowadays entertaining over a billion users. In addition to video sharing and viewing, over the years they have developed new features to boost the community engagement and some managed to attract users to donate, to the community as well as to other users. User donation directly reflects and influen…
▽ More
Crowdsourced video systems like YouTube and Twitch.tv have been a major internet phenomenon and are nowadays entertaining over a billion users. In addition to video sharing and viewing, over the years they have developed new features to boost the community engagement and some managed to attract users to donate, to the community as well as to other users. User donation directly reflects and influences user engagement in the community, and has a great impact on the success of such systems. Nevertheless, user donations in crowdsourced video systems remain trade secrets for most companies and to date are still unexplored. In this work, we attempt to fill this gap, and we obtain and provide a publicly available dataset on user donations in one crowdsourced video system named BiliBili. Based on information on nearly 40 thousand donators, we examine the dynamics of user donations and their social relationships, we quantitively reveal the factors that potentially impact user donation, and we adopt machine-learned classifiers and network representation learning models to timely and accurately predict the destinations of the majority and the individual donations.
△ Less
Submitted 27 January, 2019;
originally announced January 2019.