\WithSuffix

[1]

WeaverBird: Empowering Financial Decision-Making with
Large Language Model, Knowledge Base, and Search Engine

Siqiao Xue

{}^{{\diamondsuit},{*}}

, Fan Zhou

{}^{{\diamondsuit},{*}}

, Yi Xu

{}^{\diamondsuit}

, Ming **

{}^{\clubsuit}

, Qingsong Wen

{}^{\heartsuit}

, Hongyan Hao

{}^{\diamondsuit}

, Qingyang Dai

{}^{\diamondsuit}

and Caigao Jiang

{}^{\diamondsuit}

, Hongyu Zhao

{}^{\diamondsuit}

, Shuo Xie

{}^{\diamondsuit}

, Jianshan He

{}^{\diamondsuit}

, James Zhang

{}^{\diamondsuit}

, Hongyuan Mei

{}^{\spadesuit}

{\diamondsuit}

Ant Group,

{\clubsuit}

Monash University,

{\heartsuit}

Alibaba DAMO Academy,

{}^{\spadesuit}

TTIC siqiao.xsq,hanlian.zf,[email protected],[email protected], [email protected] [email protected]

(2023)

Abstract.

We present WeaverBird, an intelligent dialogue system tailored for the finance sector. Our system harnesses a large language model of GPT architecture that has been tuned using extensive corpora of financial texts. As a result, our system possesses the capability to understand complex financial queries, such as “How should I manage my investments during inflation?”, and provide informed responses. Furthermore, our system integrates a local knowledge base and search engine to retrieve relevant information. The final system responses are conditioned on the web search and include proper references to the sources, thus enjoying enhanced credibility. Comparative evaluations on a wide range of financial question-answering highlight our system’s superior performance over other models. Experience WeaverBird for yourself through our live demo at https://weaverbird.ttic.edu and view our concise 2-minute video at https://www.youtube.com/watch?v=yofgeqnlrMc. Our code has also been made publicly available at https://github.com/ant-research/fin_domain_llm.

large language models, finance, dialogue system

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2023; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Social and professional topics Computing and business^†^†ccs: Computing methodologies Natural language processing^\text{*}^\text{*}footnotetext: These authors contributed equally to this work.

1. Introduction

Everyone faces financial decisions from time to time. From questions like “What does the earnings report indicate for stock X?” to “How should I navigate my investments during inflation?”, many individuals prefer consulting financial professionals. Yet, not everyone has the means to access such expert services.

In this paper, we introduce WeaverBird¹¹1This name embodies our vision for WeaverBird, symbolizing our commitment to delivering comprehensive and intelligent support in the finance domain. It represents the intricate and skillful weaving of financial knowledge and information, empowering users with a robust and reliable resource., an intelligent dialogue system tailored for the financial sector. Our primary objective is to bridge the disparity in accessing financial advisory services and to democratize financial expertise. Acting as an intelligent financial assistant, WeaverBird is adept at understanding complex economic and financial queries, offering financial perspectives, and enabling users to make informed financial decisions. Through the launch of our system, we aspire to ensure that people across varied financial backgrounds gain access to trustworthy financial advice.

	WebGLM (Liu et al., 2023)	BloombergGPT (Wu et al., 2023)	FinGPT (Yang et al., 2023)	FinChat (fin, 2023)	WeaverBird
Finance-domain specific LLM	✗	✓	✓	✓	✓
Finetuned on bilingual financial data	✗	✗	✓	✗	✓
Local knowledge base querying	✗	✗	✓	✓	✓
Web search querying	✓	✗	✗	✓	✓
Efficiency-optimized search on multi-source knowledge	✗	✗	✗	N/A	✓
Temporal reasoning optimized	✗	✗	✗	N/A	✓
Open access to training data $\&$ code	✗	✗	✓	✗	✓
Open access to deployed chatbot	✗	✗	✗	✗	✓

Table 1. Comparative summary of competing approaches on various dimensions.

Another key motivation for develo** a finance dialogue system is to support future research in this area. A comparison of our WeaverBird system to other competitive approaches is summarized in Table 1. Throughout the development process of our system, we have amassed extensive collections of finance-centric text documents, which have been leveraged in fine-tuning language foundation models. This resource will be available to the research community for future exploration and analysis. Notably, our collection features a substantial corpus of Chinese financial documents, addressing the gap in the predominantly English-based finance corpora and offering a valuable addition. This equips our WeaverBird system with a distinctive edge in processing Chinese queries and delving into discussions concerning Chinese firms and equities. Furthermore, we have constructed a collection of challenging financial question-answer pairs, which can serve as a benchmark for evaluating the performance of future models and systems. Each question stems from a selected document, such as an industry research report. Consequently, answering these questions correctly requires the discovery of the specific document or equivalent information from the local knowledge base or across the internet. Our WeaverBird system harnesses the latest research advancements in large language models, enhanced by our novel training and inference methods, which we will elaborate on in section 2. This system sets a robust baseline for evaluating future models and systems in the finance domain.

An additional compelling motivation is the opportunity to leverage the knowledge and assets acquired during the development of our system to create intelligent assistants in broader domains, such as education and healthcare.

2. System Design

When presented with a natural language prompt, such as “How many EVs did BYD sell in Q1 2023?”, our system employs a two-phase process. First, it transforms the prompt into queries suitable for the knowledge base and search engine. Next, it analyzes the retrieved information, identifies the most relevant pieces, and generates a comprehensive natural language response, with proper references to the original sources. The Internet’s vast resources enable us to craft responses and insights based on up-to-date information, enhancing the adaptability and reliability of our approach. The overall pipeline of WeaverBird is depicted in Figure 1. While it builds upon the general framework of LangChain (Chase, 2022; LangChain-ChatGLM, 2023), our system integrates our novel training and inference techniques, which significantly enhance its overall performance and efficiency. In this section, we delineate the design of each phase, including the model architecture as well as training and inference paradigms.

Refer to caption — Figure 1. An illustration of WeaverBird that answers a financial query by intelligent search and generation.

2.1. Efficiency-Optimized Search and Update

The first phase is to retrieve information from our local knowledge base and the broader internet (i.e., the first step in Figure 1). Our knowledge base $\mathcal{K}$ is a collection of financial documents ${\bm{\mathbf{d}}}^{\text{loc}}_{1},\ldots,{\bm{\mathbf{d}}}^{\text{loc}}_{N}$ where the number of documents $N$ is large. Following Chase (2022), we split each document ${\bm{\mathbf{d}}}_{n}$ into multiple paragraphs ${\bm{\mathbf{p}}}^{\text{loc}}_{n,1},\ldots,{\bm{\mathbf{p}}}^{\text{loc}}_{n,% M_{n}}$ , where $M_{n}$ (and $m$ below) denotes the index of paragraph for the $n$ -th document, and embed each paragraph into a multidimensional embedding ${\bm{\mathbf{e}}}^{\text{loc}}_{n,m}$ through a neural encoder $f_{\text{key}}$ . When a language query ${\bm{\mathbf{x}}}$ comes, it is embedded into a vector ${\bm{\mathbf{q}}}$ through another neural encoder $f_{\text{query}}$ and we retrieve the top $K$ relevant paragraphs from the knowledge base according to their cosine similarities, i.e., ${\bm{\mathbf{q}}}^{\top}{\bm{\mathbf{e}}}/\|{\bm{\mathbf{q}}}\|\|{\bm{\mathbf{% e}}}\|$ , where $K$ is a hyperparameter.

If the cosine similarity of the most relevant retrieval is larger than a predefined threshold $c\in(0,1)$ , we skip the internet search and proceed directly to the response generation phase that we present in section 2.3. Otherwise, we start an internet search, which is slower than a local search but may yield higher-quality results. This efficiency-optimized design is not in the original LangChain pipeline and we propose it to strike a balance between performance and efficiency.²²2Internet search is also not in the original LangChain, but it is in the latest release, concurrent with our work. When the local knowledge base already provides highly relevant information, we can avoid the need to search the internet. Otherwise, we invest additional effort to retrieve internet-based information for enhanced quality.

When an internet search³³3The search engine we use is Google Search. is used, we select the highest-ranked $N$ documents ${\bm{\mathbf{d}}}^{\text{web}}_{1},\ldots,{\bm{\mathbf{d}}}^{\text{web}}_{N}$ , split each document into paragraphs ${\bm{\mathbf{p}}}^{\text{web}}_{n,1},\ldots,{\bm{\mathbf{p}}}^{\text{web}}_{n,% M_{n}}$ , and embed each paragraph into a vector ${\bm{\mathbf{e}}}^{\text{web}}_{n,m}$ through the neural encoder $f_{\text{key}}$ that we have used for the knowledge base. Then we select the top $K$ relevant paragraphs according to their cosine similarities. Combined with the local retrievals, we end up with $2K$ paragraphs and proceed to the next phase.

If any of the web paragraph exhibits a cosine similarity greater than the threshold $c$ , our system will automatically add the entire document to which it belongs into our local knowledge base. Users can also manually add any online document. This update ensures that future queries, whether identical or related, will receive more efficient responses. As updates occur, the knowledge base will expand, but the local search is still significantly more efficient than a web search.

2.2. Learning to Embed and Search

We confidently consider a higher similarity to signify a more relevant paragraph due to the training of the encoders $f_{\text{key}}$ and $f_{\text{query}}$ . Their optimization involves maximizing a well-defined objective:

(1)

\displaystyle\ell={\bm{\mathbf{q}}}^{\top}{\bm{\mathbf{e}}}_{0}-\log\sum_{i=0}% ^{I}\exp\left({\bm{\mathbf{q}}}^{\top}{\bm{\mathbf{e}}}_{i}\right),

where ${\bm{\mathbf{e}}}_{0}$ is the embedding of the paragraph known to contain relevant information for the query, and the other $I$ embeddings ${\bm{\mathbf{e}}}_{1},\ldots,{\bm{\mathbf{e}}}_{I}$ belong to a set of negative paragraphs (see section 3.3 for how they are selected). Intuitively, we want the dot products ${\bm{\mathbf{q}}}^{\top}{\bm{\mathbf{e}}}$ to be relatively large for the query-paragraph pairs that are actually relevant. Our encoders use Multilingual-E5-base model architecture (Wang et al., 2022).

Once we have trained the encoders, we could determine the threshold $c$ by computing the cosine similarities for a holdout set of query-paragraph pairs and analyzing the histogram.⁴⁴4We experimented with dot product and cosine similarity and found that the latter yields a better overall performance even though our training objective has used the former. In practice, we choose the 1% quantile to be our threshold $c$ .

Another question is where our query-paragraph pairs come from. We have collected a corpus of query-paragraph pairs by leveraging financial documents and question generation methods, which we will outline in section 3.

2.3. Generation by Large Language Models

In this phase, our system analyzes the search results, fills an appropriate prompt template and generates a response. We finally determine the strategies as follows:

Ranker.

The system firstly first ranks all the $2K$ search results (or $K$ results if the internet search has been skipped as discussed in section 2.1) based on their cosine similarities with the query (i.e., the second step in Figure 1).

Prompt Formulation with Temporal Context.

After receiving the ranking results from the previous step, we plug the top $J$ (where $J<K$ ) results into the context part of the predefined prompt template. Since the knowledge of finance domain is dynamically evolving, shown in LABEL:lst:en_template, we ground each retrieved result and question with temporal facts (e.g., published date and question date) to facilitate the language model to adapt to time-evolved knowledge, possibly making the response more meaningful and accurate (** et al., 2023b).

Having temporal metadata for questions and retrieved results allows us to ask questions with relative time specifications (e.g., “3 months ago”), which are often used in finance queries. Moreover, it enables us to ingest new knowledge periodically, in a streaming setup, and evaluate on questions asked during that period (Liška et al., 2022).

⬇

English template

Context information:

{CONTEXT_DATE_1:CONTEXT_RETRO_1}

⋮

{CONTEXT_DATE_2K:CONTEXT_RETRO_2K}

The current date is {QUESTION_DATE}. Based on the given information, please provide a concise and professional response to the user’s question. If there are multiple questions in a query, please answer all of them. If the user’s question includes keywords like ’recent’ or ’latest’ to indicate a recent time frame, pay attention to the correspondence between the current date and the date of the information. If a clear answer cannot be determined, respond with ”Unable to answer the question based on the information provided”. You MUST respond in the same language as the question!

The question is: {QUESTION}.

Listing 1: Prompt templates for LLM.

Response Generation.

The large language model organizes the information to form a fluent and informed response, corresponding to the third step in Figure 1. The response also includes accurate references to the sources of the information. As an option, our system can further provide a list of citations for the remaining $(K-J)$ paragraphs at the end of the response, to complement the main response and provide users with additional recommended reading materials.

Our large language model has undergone adaptation to the finance domain, thus being able to skillfully process the input information and generate appropriate responses. We started with a pre-trained GLM (Du et al., 2022; Zeng et al., 2022) that has been pre-trained using extensive English and Chinese corpora. Then we fine-tuned this model using our collected Chinese and English financial documents as well as expert-annotated financial query-response pairs, which we introduce in section 3.

3. Data Collection and Utilization

In this section, we introduce our corpora of financial documents and their collection process, as well as how we used them to build the local knowledge base and generate training and evaluation query-paragraph pairs.

3.1. Collection of Financial Documents

We collected a large set of financial documents from well-known and accredited sources, most of which are global financial institutes. Our collection includes 240 million Chinese documents and 25 million English documents. They were all published between 2010-Jan-01 and 2023-June-30, and cover diverse topics such as macroeconomics and fixed income. Table 2 shows a break-down into the information sources, and Table 3 shows a break-down into the topics.

Chinese docs		English docs
CICC	$3.5\%$	Deutsche Bank	$18.5\%$
CITIC Sec.	$3.1\%$	CICC	$17.3\%$
Industrial Sec.	$2.9\%$	HSBC	$11.8\%$
Haitong Sec.	$2.8\%$	CITIC Sec.	$6.4\%$
Guotaijunan Sec.	$2.7\%$	Guotaijunan Sec.	$5.6\%$
Huatai Sec.	$2.1\%$	Kim Eng Sec.	$4.8\%$
Guangfa Sec.	$2.0\%$	BNP Paribas	$3.9\%$
Changjiang Sec.	$2.0\%$	Industrial Sec.	$3.7\%$
CMS	$2.0\%$	Shenwan Hongyuan	$3.7\%$
Founderfu	$1.9\%$	UBS	$3.1\%$

Table 2. Top 10 sources of our financial documents.

Chinese docs		English docs
Financial Engr.	$25.8\%$	HK Stocks	$34.9\%$
Company	$20.5\%$	Company	$21.7\%$
Industry	$20.5\%$	Industry	$11.8\%$
Strategies	$6.9\%$	Macro econ	$18.5\%$
HK Stocks	$5.5\%$	Strategies	$5.6\%$
Morning report	$4.9\%$	Morning report	$4.8\%$
Macro econ	$4.8\%$	Bond	$3.9\%$
China GEM	$4.5\%$	China GEM	$3.7\%$
Bond	$4.1\%$	Financial Engr.	$0.1\%$
Fund	$1.5\%$	Fund	$0.1\%$

Table 3. Top 10 topics of our financial documents.

For each document, we save into a text file the concatenation of its release time, title, summary, and other metadata (such as topic). By manually analyzing a sampled set of documents and consulting domain experts, we recognized that every document has a summary section that includes nearly all the important information of the document. Therefore, we did not save the full content, which will consume a significantly larger amount of disk space and memory but contribute minimal additional information. After processing, the average token count for Chinese documents is 5000 and that for English documents is 2000. Below are two examples of processed documents:

Example 0 (Chinese).

2023-04-03 00:00:00;海底捞(6862.HK)：
2H22净利率7.5%; 门店重启稳步推进;海底捞发布2022年度业绩…(full content omitted); 港股;个股;公司定期报告评述;股票;**香港.

Example 0 (English).

2023-02-07 00:00:00; Feb 223 ECB monetary policy meeting commentary: ECB maintains pace of rate hikes without reinforcing…(full content omitted); macro;oversea;finance;economics research;economics;europe.

The primary use of our large-scale data collection is to build a local knowledge base. As mentioned in section 2.1, we split each document ${\bm{\mathbf{d}}}^{\text{loc}}_{n}$ into smaller chunks ${\bm{\mathbf{p}}}^{\text{loc}}_{n,1},\ldots,{\bm{\mathbf{p}}}^{\text{loc}}_{n,% M_{n}}$ (about 250 tokens per chunk), and embed each chunk into a vector ${\bm{\mathbf{e}}}^{\text{loc}}_{n,m}$ with our trained neural encoder $f_{\text{key}}$ . We implemented this vector-based knowledge base and its retrieval method using the Faiss (Johnson et al., 2019), a library highly optimized for efficient similarity search and clustering of dense vectors.

3.2. Collection of Query-Response Pairs

Another important use of our data collection is to create query-response pairs for tuning the language model. This step is crucial for the performance of our system: recent research has shown that language models tuned with query-response pairs have demonstrated superior performance in answering open-domain questions and engaging in conversations (Mangrulkar et al., 2022). We collected four kinds of query-response pairs, three of which were created from our collection of financial documents. Table 4 shows a breakdown of the pairs into these four categories. In each category, 60% of the pairs are in Chinese while the rest are in English.

Generation

For each document, we can construct a query-response pair that tunes the language model to perform content generation. To create the query, we fill the template “Please write a [TYPE] on [TOPIC] with the title [TITLE]” where TYPE, TOPIC, and TITLE can be found in the corresponding text file. The SUMMARY part of the file is used as the response. An example of query is “Please write a review report on China economics with the title CHINA POLICY: POLITBURO MAINTAINS PRO-GROWTH STANCE” and its response is “The politburo indicates continuous pro-growth policy without big stimulus…”

Summarization

The reverse of a content generation pair can be used as a content summarization pair. Now the query will be created by filling the template “Please summarize the following [TYPE]: [SUMMARY]” and the response is the TITLE. An example of query is “Please summarize the following report: NIO’s 4Q22 earnings missed our prior forcast…” and its response is “NIO INC. (NIO.US): 4Q22 miss”.

Close-book QA

Given a document, we would like the model to answer any questions that could be answered by directly identifying the relevant information from the document. This is the focus of the close-book QA research (Roberts et al., 2020). To achieve this goal, we need a specific kind of query-response pairs: query is the question and the document; response is the answer. We created such pairs using the question generation method proposed by Rajpurkar et al. (2016). Precisely, given a document, we used a pre-trained question generation model to identify some interesting information in the document and then generate a question whose answer should be that piece of information.

Open QA

We would also like the system to answer open questions: unlike close-book QA, the evidence is not available in the context. We created this kind of Chinese question-answer (i.e., query-response) pairs by scrap** well-known online QA forums such as Zhihu and SinaFinance. On such forums, users post questions, tag questions, answer questions, and upvote or downvote answers. We identified 6,000 highly recognized questions with the “finance” tag; with each question being the query, we took the answer with the most upvotes to be its response. We then randomly sampled 4,000 Enligsh open QA questions from the English finance QA dataset FinQA (Bharti, 2023).

Use case	%	# Pairs	Source
Generation	38%	13000	Collected financial documents
Summarization	20%	7000	Collected financial documents
Close-book QA	12%	4100	Collected financial documents
Open QA	30%	10000	Chinese: collected from online QA forums (e.g., Zhihu and SinaFinance)
			English: sampled from FinQA (Bharti, 2023)

Table 4. Distribution of use case categories of the query-response pairs.

3.3. Data Collection for Training Encoders

Recall from section 2.2 that we need query-paragraph pairs to train the key and query encoders $f_{\text{key}}$ and $f_{\text{query}}$ . The close-book QA subset available in section 3.2 provides a direct resource to fulfill this requirement. First, we categorized all the close-book query-response pairs according to their topics. Then we manually reviewed the data of each topic and selected a total of 1,000 pairs while ensuring a comprehensive coverage of all topics. The selection of each pair was based on our anticipated likelihood that someone would ask this question during interaction with the demo. We used these 1,000 query-response pairs as the positive pairs in training the encoders. For each of them, we randomly sampled five negative responses from the entire pool of paragraphs.

4. Related Work

Large language models such as ChatGPT (Brown et al., 2020) and GPT-4 (OpenAI, 2023) have showcased their remarkable capabilities in engaging in human-like communication and understanding complex queries, bringing a trend of incorporating LLMs in various fields (Anil et al., 2023; Gunasekar et al., 2023; ** et al., 2023a). Recent advancements have further enhanced these models by incorporating external tools, enabling them to search for relevant online information (Nakano et al., 2021), utilize tools (Schick et al., 2023), and create more sophisticated applications (Chase, 2022). In this paper, we harness these advancements to build an intelligent dialogue system that specializes in the finance domain. The most relevant work is BloombergGPT (Wu et al., 2023), a generative Transformer trained with large corpora of finance-related text. While BloombergGPT has only been trained with English corpora, our system is also carefully tailored to handle Chinese, a language spoken by one-fifth of the global population. In addition, BloombergGPT is only trained to predict the next token given the context but not to engage in conversations like our WeaverBird system; nor is it integrated with knowledge bases or search engines.

Our work aligns with research on retrieval-based language generation, which involves enhancing language models by incorporating external datastores. This area has seen advancements in new model architectures that condition the next token generation on retrieved text (Khandelwal et al., 2020; Wu et al., 2022; Borgeaud et al., 2022), as well as developments in training and inference methods aimed at improving the quality of retrieval (Zhong et al., 2022) and the performance in downstream tasks such as question answering (Karpukhin et al., 2020; Izacard et al., 2022b; Shi et al., 2022; Menick et al., 2022) and code generation (Zhou et al., 2023). Our work aims to deliver a system that has superior performance in answering finance-related questions. It uses the standard GPT architecture, and augments the inputs (i.e., queries) with retrieved information (i.e., online documents or KB entries): this design is similar to the approaches of Shi et al. (2023); Ram et al. (2023), but our efficiency-optimized search and update (section 2.1) is novel.

5. Models and Training

5.1. Language Model

Architecture.

The system is based on the pre-trained ChatGLM2-6B (Du et al., 2022; Zeng et al., 2022), and we fine-tune it with the data and training strategy specified below.

Dataset.

We use the collected query-response pairs clarified in section 3.2 for training and evaluation. The whole set of pairs are divided into sets of 17,000 training pairs, 3,000 development pairs, and 4,100 test pairs.

Training.

We apply the LoRA (Hu et al., 2021) strategy to fine-tune the base model to adapt to the finance domain. The LoRA layer details are as follows: the LoRA rank is 16, the LoRA alpha is 32, the dropout is 0.1, and other settings are default.

5.2. Encoders

Architecture.

The key and query encoders $f_{\text{key}}$ and $f_{\text{query}}$ are initialized as Multilingual-E5-base (ME5) model architecture (Wang et al., 2022).

Dataset.

Recall from section 2.2 that we need query-paragraph pairs to train the key and query encoders $f_{\text{key}}$ and $f_{\text{query}}$ . We use the collected query-response pairs clarified in section 2.3 for training and evaluation. The chosen pairs were then divided into sets of 700 training pairs, 100 development pairs, and 200 test pairs.

Training.

We pass query-response pairs to the model to yield a scalar score for each of the pair and maximize the scores for the positive pairs while minimizing the scores for the negative pairs with the cross entropy loss.

5.3. Implementation and Deployment Details.

Knowledge Base and WebUI.

For the implementation of LangChain, we use the code from the public GitHub repository at https://github.com/imClumsyPanda/langchain-ChatGLM (LangChain-ChatGLM, 2023) without any License. For the implementation of WebUI, we use the code from the public GitHub repository at https://github.com/THUDM/WebGLM (Liu et al., 2023) with Apache 2.0 License.

Computation Cost.

All the experiments were conducted on a server with 128G RAM, a 64 logical cores CPU (Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz) and 8 NVIDIA A100 80G Tensor Core GPUs for acceleration. During the fine-tuning process on the finance dataset, we set the train batch size on each device equal to $8$ , the gradient accumulation step equal to 100, the max step equal to $10,000$ and the learning rate equal to 0.0001, and the running time of each step is close 72 seconds, the total running time is close to $200$ hours.

Deployment Details.

Our system is deployed on a server on Alibaba Cloud with 30G RAM, a 8 logical cores (Intel Xeon(Ice Lake) Platinum 8369B), and a NVIDIA A100 80G Tensor Core GPU. The inference speed depends search speed on the website, searches in local knowledge only take 1 second, and the search on the website takes 10 seconds on average.

6. Experiments

We present the experiments designed to evaluate the performance of the WeaverBird system, including the effectiveness of the trained encoders (section 6.1) and the generation quality (section 6.2).

6.1. Paragraph Retrieval Evaluation

Now we show the results of evaluating the trained encoders against pretrained ME5 and trained Contriver-based encoders (Izacard et al., 2022a).

Experimental Setup.

We use the test set specified in section 5.2 for evaluation. When evaluating a method (i.e., a specific model paired with a specific similarity metric), we take the top $K=5$ highest-ranked retrievals for each given query, and evaluate the mean average precision (MAP) and mean average recall (MAR) on the entire test set of query-paragraph pairs. Intuitively, MAP measures the fraction of the predictions that are correct, while MAR measures the fraction of the actual paragraphs that are covered by the predictions. Higher is better.

Results.

The results are in Figure 3. As we can see, trained encoders work significantly better than pretrained encoders, indicating the effectiveness of our training method. Moreover, our ME5-based encoders significantly outperform the Contriver-based encoders: it is perhaps because a large portion of our test cases are in Chinese and ME5 is a bilingual model. Interestingly, cosine similarity turns out to work the best among all the similarity metrics, even though our training objective uses dot product.

6.2. Response Quality Evaluation

Experimental Setup

. We use the test set specified in section 5.1 for evaluation. Since automatic evaluation for complicated natural language QA still remains an open research question, we rate the response by both human expert and GPT-4 (OpenAI, 2023) due to its efficiency and consistency (Hackl et al., 2023): each response is given a score in $[0,100]$ by a financial expert and GPT-4, respectively and the final score is an average of the two. Higher is better. We compare WeaverBird with WebGLM, FinGPT and FinChat. BloombergGPT (Wu et al., 2023) is non open-sourced so it is excluded from the baselines. Due to the budget constraint, we evaluate FinChat, a commercial chatbot in finance, on a subset of 300 questions sampled from the test set.

Results.

The strongest results of our WeaverBird system is summarized in Figure 3. Although WebGLM is also bilingual and empowered with search engine, WeaverBird has superior performance mainly due to its domain-finetuned base model, trained encoders on financial documents as well as optimized retrieval on knowledge base and searched results. FinGPT has lower score than WeaverBird and WebGLM because it has not used multi-source knowledge. FinChat performs poorly because it is not billingual and its knowledge base covers a restricted list of stocks.

6.3. More Analysis

Analysis I: Effect of knowledge base and search engine on the response quality.

Firstly, we remove the knowledge base (‘wv-wo-kb’ in Figure 3) and this strategy causes a moderate drop in performance. Secondly, we remove the search engine (‘wv-wo-se’ in Figure 3) and the performance declines with a notable drop, indicating the web search ability, which introduces the ‘global knowledge base’ on the internet, greatly improves the accuracy of responses. To conclude, the searched results contribute most to the response quality while the local knowledge base contributes moderately.

Analysis II: Effect of the retriever on the response quality?

We randomly pick up $500$ examples of query-response pairs generated by WeaverBird and group the responses by the accuracies of retrieved documents (annotated by the experts). Not surprisingly, Figure 4 shows that higher accuracy of retrievals brings significant improvement on response qualities.

Analysis III: How to correct flawed cases.

We manually analyze $50$ examples of finance-related query-response pairs annotated by the experts. Of these $50$ , $70\%$ cases arrived at reasonable responses. However, we also found some flawed output, most of which can be corrected with one of the following classes of modification:

•

Increase $K$ to have more searched results; increase $J$ to have more input for LLM; or increase both $K$ and $J$ .
•

The searched results usually consist of a number of relevant paragraphs instead of the full content of the web page. If needed, by using the functionality provided by our system, the user can choose to scrape the site and save the content to the local knowledge base. Given the full text, more relevant paragraphs are retrieved and better responses are possibly generated.

Analysis IV: Qualitative examples of challenging queries.

In this section, we present two illustrative examples. The first example poses a challenging question regarding the holdings of a renowned fund manager. Obviously, WeaverBird outperforms the competitors by delivering a reasonably accurate response, accompanied by citations to credible sources.

⬇

Query: What is Microsoft’s dividend policy and how does it compare to other companies in the technology sector?

WebGLM: Microsoft has a history of paying dividends to its shareholders. The company’s most recent dividend increase was a 10

\%

increase on November 17, 2021, which brought its per-share payout to

0.68.Microsofthasalsoyields0.9

onitsstock.Incomparisontoothertechnologycompanies,Microsoft^{\prime}% sdividendpolicyisconsideredtobeagoodone.\par//% citationstoinformationsourcesMoredetails:1.[Investor^{\prime}sBusinessDaily],2% .[TheMotleyFool],3.[Microsoft],4.[Forbes],5.[Barron^{\prime}s].\par@% \lx@texthl@color{FinGPT:}@% Microsofthasahistoryofpayingdividendstoitsshareholders.Thecompany^{\prime}% smostrecentdividendincreasewasa10

increaseonNovember17,2021,whichbroughtitsper-sharepayoutto

0.68. Microsoft has also yields 0.9

\%

on its stock. In comparison to other technology companies, Microsoft’s dividend policy is considered to be a good one.

FinChat: Microsoft Corporation has a dividend policy in place. As of the most recent data, Microsoft pays a dividend four times a year. The dividend amount is

3.00pershare.Thedividendyield,whichisthedividendamountdividedbythestockprice,% is0.88

asofOctober25,2023.\par Microsofthasaconsistenthistoryofpayingdividends,% withthefirstdividendpaymentdatingbacktoFebruary19,2003.% ThemostrecentdividendpaymentwasmadeonNovember15,2023.IntermsofhowMicrosoft^{% \prime}sdividendpolicycomparestoothercompaniesinthetechnologysector,% itisimportanttonotethatdividendpoliciescanvaryamongcompanies...(% longdetailsomitted).\par@\lx@texthl@color{WeaverBird:}@Microsoft^{\prime}sdividendpolicyistopayaquarterlydividendof

0.68 per share. This means that for each share of Microsoft stock, its shareholders will receive

0.68every3monthsor4timesperyear.Thecompany^{\prime}% sdividendpaymentsaretypicallymadeintheformofaquarterlydividend,% whichisapaymentmadebythecompanytoitsshareholdersonaregularbasis.% WhencomparingMicrosoft^{\prime}% sdividendpolicytoothercompaniesinthetechnologysector,% itisimportanttoconsiderthecompany^{\prime}spayoutratio,% whichisthepercentageofitsearningsthatarepaidouttoshareholdersasdividends.% Microsoft^{\prime}spayoutratiois66.1

,whichisrelativelyhighcomparedtosomeothertechnologycompanies.Forexample,Apple,% Microsoft^{\prime}smaincompetitorinthetechnologysector,paysadividendofonly3.4

ofitsearningsouttoshareholders.Google^{\prime}% sdividendpolicyissimilartoMicrosoft^{\prime}s,butitspayoutratioislowerat52.5

Overall, Microsoft’s dividend policy is relatively high compared to other technology companies, and its shareholders can expect to receive regular dividends as the company continues to perform well.

// citations to information sources

More details: 1. [Dividends Diversify], 2. [The Motley Fool], 3. [Seeking Alpha], 4. [Forbes], 5. [Barron’s].

Listing 2: About company dividend policy.

The second example is in Chinese. As we can see, our WeaverBird system exhibits superior performance in answering Chinese finance-related questions, which neither WebGLM nor FinChat could answer well.

⬇

Query: 半夏资本**蓓的最新投资观点是什么?

<– English translation

What is the latest investment viewpoint of Li Bei at Banxia Capital?

–>

WebGLM: 半夏资本**蓓**日在半夏投资公众号发表了两篇文章，对当前全球宏观形势和投资机会进行了展望。她表示，人民币和**股市或进入牛市。在今年的4月底，她曾经提出过“股票风险已经越来越小”的观点。半夏投资是一家极具流量的知名私募公司，**蓓经常通过官方公众号发布自己对投资、市场以及一些热点话题的看法，且观点鲜明，行文犀利。

<– English translation

Li Bei, from Banxia Capital, recently published two articles on the Banxia Investment official WeChat account, providing insights into the current global macro situation and investment opportunities. She mentioned that the Renminbi and Chinese stock market may enter a bull market. In late April of this year, she also expressed the view that ”stock market risks are becoming smaller”. Banxia Investment is a well-known private equity firm with a large following. Li Bei often shares her views on investment, markets, and hot topics through the official WeChat account, presenting distinct opinions and sharp writing style.

–>

More Details: 1. [stcn.com], 2. [stcn.com], 3. [ce.cn], 4. [sina.cn], 5. [sina.com.cn].

FinGPT: 半夏资本投资总监**蓓在投资界活动分享中表示，A股市场投资难度越来越大，价值投资和成长投资之间的边界越来越模糊，在这样的市场环境中，投资策略的边界也变得越来越模糊。她认为，在这样的市场环境中，投资策略的边界也变得越来越模糊，在这样的市场环境中，投资者需要有更加开放的心态，能够灵活应对市场变化，寻找一些未被充分发掘的投资机会。

<– English translation

Banxia Capital’s Investment Director, Li Bei, shared during an investment conference that investing in the A-share market has become increasingly challenging, with the boundaries between value investing and growth investing becoming more blurred. In such a market environment, the boundaries of investment strategies have also become increasingly unclear. She believes that in such a market environment, investors need to have a more open mindset and be flexible in responding to market changes, seeking out investment opportunities that have not been fully explored.

–>

FinChat: I’m sorry, but I couldn’t find any information about半夏资本**蓓’s latest investment views in the provided data sources. The data sources primarily focus on different companies and their financial information.

<– English translation

Based on the provided data, there is no information available about the latest investment views of Li Bei at Banxia Capital. The data source does not include Li Bei’s investment perspectives. If you have any other questions about Li Bei at Banxia Capital, I will do my best to answer them.

–>

WeaverBird: 根据已知信息，半夏投资创始合伙人兼基金经理**蓓在最**的采访中透露了一些最新的投资观点。她表示，过去一个月，半夏投资在持续提高股票仓位，主要集中在中下游大盘股。另外，她也提到了跨资产类别的好处，认为当你发现一个大的经济趋势的时候，并不是说投股票才是最好的选择，有些时候可能商品是更好的选择。**蓓还表示，基金规模过大并不利于宏观对冲策略的发挥。

<– English translation

Based on the known information, Li Bei, the co-founder and fund manager of Banxia Capital, revealed some of the latest investment perspectives in a recent interview. She mentioned that over the past month, Banxia Capital has been consistently increasing its stock positions, mainly focusing on large-cap stocks in the mid-to-low sectors. Additionally, she emphasized the benefits of diversifying across asset classes, stating that when you identify a major economic trend, it doesn’t always mean investing in stocks is the best choice; sometimes commodities might be a better option. Li Bei also mentioned that having an excessively large fund size is not conducive to the implementation of macro hedging strategies.

–>

More Details: 1. [21**gji.com], 2. [stcn.com], 3. [ce.cn], 4. [sina.cn], 5. [sina.com.cn].

Listing 3: A Chinese example.

7. Conclusion

We presented an open-source, intelligent dialogue system tailored for the finance sector, which outperforms the best available solutions as evidenced by its superior capabilities in solving a wide range of finance-related questions. Our systematic approach contributes to the line of research on building domain-specific large language models. In addition, our training and inference strategies may be useful for develo** retrieval-based dialogue systems in other domains, allowing us to unlock broader real applications.

References

(1)
fin (2023) 2023. FinChat. https://finchat.io/
Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
Bharti (2023) Gaurang Bharti. 2023. wealth-alpaca. https://github.com/gaurangbharti1/wealth-alpaca
Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v162/borgeaud22a/borgeaud22a.pdf
Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Chase (2022) Harrison Chase. 2022. LangChain. https://github.com/hwchase17/langchain
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.longhoe.net/abs/2103.10360
Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
Hackl et al. (2023) Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, and Maximilian Sailer. 2023. Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings. arXiv:2308.02575 [cs.CL]
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021). https://arxiv.longhoe.net/pdf/2106.09685.pdf
Izacard et al. (2022a) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022a. Unsupervised Dense Information Retrieval with Contrastive Learning. Trans. Mach. Learn. Res. 2022 (2022). https://openreview.net/forum?id=jKN1pXi7b0
Izacard et al. (2022b) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022b. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022). https://arxiv.longhoe.net/abs/2208.03299
** et al. (2023a) Ming **, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. 2023a. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. arXiv preprint arXiv:2310.01728 (2023).
** et al. (2023b) Ming **, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. 2023b. Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook. arXiv preprint arXiv:2310.10196 (2023).
Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019). https://arxiv.longhoe.net/abs/1702.08734
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/2004.04906
Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.longhoe.net/abs/1911.00172
LangChain-ChatGLM (2023) LangChain-ChatGLM. 2023. LangChain-ChatGLM. https://github.com/imClumsyPanda/langchain-ChatGLM
Liška et al. (2022) Adam Liška, Tomáš Kočiský, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien de Masson d’Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-McMahon Sophia Austin, Phil Blunsom, and Angeliki Lazaridou. 2022. StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models. arXiv preprint arXiv:2205.11388 (2022). https://arxiv.longhoe.net/abs/2205.11388
Liu et al. (2023) Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. arXiv:2306.07906 [cs.CL]
Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft
Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. 2022. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 (2022). https://arxiv.longhoe.net/abs/2203.11147
Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021). https://arxiv.longhoe.net/abs/2112.09332
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023). https://arxiv.longhoe.net/abs/2303.08774.pdf
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/1606.05250
Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics (2023). https://arxiv.longhoe.net/abs/2302.00083
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5418–5426. https://doi.org/10.18653/v1/2020.emnlp-main.437
Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023). https://arxiv.longhoe.net/abs/2302.04761
Shi et al. (2022) Weijia Shi, Julian Michael, Suchin Gururangan, and Luke Zettlemoyer. 2022. Nearest neighbor zero-shot inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://aclanthology.org/2022.emnlp-main.214/
Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023). https://arxiv.longhoe.net/abs/2301.12652
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533 (2022).
Wu et al. (2023) Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023). https://arxiv.longhoe.net/abs/2303.17564
Wu et al. (2022) Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing Transformers. In Proceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=TrjbxzRcnf-
Yang et al. (2023) Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open-Source Financial Large Language Models. arXiv preprint arXiv:2306.06031 (2023).
Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022). https://arxiv.longhoe.net/abs/2210.02414
Zhong et al. (2022) Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training language models with memory augmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/2205.12674
Zhou et al. (2023) Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig. 2023. Docprompting: Generating code by retrieving the docs. In Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.longhoe.net/abs/2207.05987

Appendix A Ongoing and Future Work

We are currently exploring several extensions to deal with more complex dialogue cases in the finance domain. We are particularly interested in handling:

•

Multiple-turn conversations. In such scenarios, the system has to engage in the communication and provide new responses at each turn that are consistent and coherent with previous statements.
•

External tools. Users may want our system to interact with other external tools, such as pulling information from commercial data vendors (e.g., Bloomberg Terminal) and placing transaction orders in online brokerage software (e.g., Charles Schwab).
•

More user-friendly presentation. Users may desire our system presenting answers in richer formats such as tables and diagrams.

Appendix B Software Interface

The main interface of our WeaverBird system can be seen in Figures 5 and 6. Experience it at https://weaverbird.ttic.edu.

WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine

Abstract.

1. Introduction

2. System Design

2.1. Efficiency-Optimized Search and Update

2.2. Learning to Embed and Search

2.3. Generation by Large Language Models

Ranker.

Prompt Formulation with Temporal Context.

Response Generation.

3. Data Collection and Utilization

3.1. Collection of Financial Documents

Example 0 (Chinese).

Example 0 (English).

3.2. Collection of Query-Response Pairs

Generation

Summarization

Close-book QA

Open QA

3.3. Data Collection for Training Encoders

4. Related Work

5. Models and Training

5.1. Language Model

Architecture.

Dataset.

Training.

5.2. Encoders

Architecture.

Dataset.

Training.

5.3. Implementation and Deployment Details.

Knowledge Base and WebUI.

Computation Cost.

Deployment Details.

6. Experiments

6.1. Paragraph Retrieval Evaluation

Experimental Setup.

Results.

6.2. Response Quality Evaluation

Experimental Setup

Results.

6.3. More Analysis

Analysis I: Effect of knowledge base and search engine on the response quality.

Analysis II: Effect of the retriever on the response quality?

Analysis III: How to correct flawed cases.

Analysis IV: Qualitative examples of challenging queries.

7. Conclusion

References

Appendix A Ongoing and Future Work

Appendix B Software Interface

WeaverBird: Empowering Financial Decision-Making with
Large Language Model, Knowledge Base, and Search Engine