[1]
WeaverBird: Empowering Financial Decision-Making with
Large Language Model, Knowledge Base, and Search Engine
Abstract.
We present WeaverBird, an intelligent dialogue system tailored for the finance sector. Our system harnesses a large language model of GPT architecture that has been tuned using extensive corpora of financial texts. As a result, our system possesses the capability to understand complex financial queries, such as “How should I manage my investments during inflation?”, and provide informed responses. Furthermore, our system integrates a local knowledge base and search engine to retrieve relevant information. The final system responses are conditioned on the web search and include proper references to the sources, thus enjoying enhanced credibility. Comparative evaluations on a wide range of financial question-answering highlight our system’s superior performance over other models. Experience WeaverBird for yourself through our live demo at https://weaverbird.ttic.edu and view our concise 2-minute video at https://www.youtube.com/watch?v=yofgeqnlrMc. Our code has also been made publicly available at https://github.com/ant-research/fin_domain_llm.
1. Introduction
Everyone faces financial decisions from time to time. From questions like “What does the earnings report indicate for stock X?” to “How should I navigate my investments during inflation?”, many individuals prefer consulting financial professionals. Yet, not everyone has the means to access such expert services.
In this paper, we introduce WeaverBird111This name embodies our vision for WeaverBird, symbolizing our commitment to delivering comprehensive and intelligent support in the finance domain. It represents the intricate and skillful weaving of financial knowledge and information, empowering users with a robust and reliable resource., an intelligent dialogue system tailored for the financial sector. Our primary objective is to bridge the disparity in accessing financial advisory services and to democratize financial expertise. Acting as an intelligent financial assistant, WeaverBird is adept at understanding complex economic and financial queries, offering financial perspectives, and enabling users to make informed financial decisions. Through the launch of our system, we aspire to ensure that people across varied financial backgrounds gain access to trustworthy financial advice.
WebGLM (Liu et al., 2023) | BloombergGPT (Wu et al., 2023) | FinGPT (Yang et al., 2023) | FinChat (fin, 2023) | WeaverBird | |
Finance-domain specific LLM | ✗ | ✓ | ✓ | ✓ | ✓ |
Finetuned on bilingual financial data | ✗ | ✗ | ✓ | ✗ | ✓ |
Local knowledge base querying | ✗ | ✗ | ✓ | ✓ | ✓ |
Web search querying | ✓ | ✗ | ✗ | ✓ | ✓ |
Efficiency-optimized search on multi-source knowledge | ✗ | ✗ | ✗ | N/A | ✓ |
Temporal reasoning optimized | ✗ | ✗ | ✗ | N/A | ✓ |
Open access to training data code | ✗ | ✗ | ✓ | ✗ | ✓ |
Open access to deployed chatbot | ✗ | ✗ | ✗ | ✗ | ✓ |
Another key motivation for develo** a finance dialogue system is to support future research in this area. A comparison of our WeaverBird system to other competitive approaches is summarized in Table 1. Throughout the development process of our system, we have amassed extensive collections of finance-centric text documents, which have been leveraged in fine-tuning language foundation models. This resource will be available to the research community for future exploration and analysis. Notably, our collection features a substantial corpus of Chinese financial documents, addressing the gap in the predominantly English-based finance corpora and offering a valuable addition. This equips our WeaverBird system with a distinctive edge in processing Chinese queries and delving into discussions concerning Chinese firms and equities. Furthermore, we have constructed a collection of challenging financial question-answer pairs, which can serve as a benchmark for evaluating the performance of future models and systems. Each question stems from a selected document, such as an industry research report. Consequently, answering these questions correctly requires the discovery of the specific document or equivalent information from the local knowledge base or across the internet. Our WeaverBird system harnesses the latest research advancements in large language models, enhanced by our novel training and inference methods, which we will elaborate on in section 2. This system sets a robust baseline for evaluating future models and systems in the finance domain.
An additional compelling motivation is the opportunity to leverage the knowledge and assets acquired during the development of our system to create intelligent assistants in broader domains, such as education and healthcare.
2. System Design
When presented with a natural language prompt, such as “How many EVs did BYD sell in Q1 2023?”, our system employs a two-phase process. First, it transforms the prompt into queries suitable for the knowledge base and search engine. Next, it analyzes the retrieved information, identifies the most relevant pieces, and generates a comprehensive natural language response, with proper references to the original sources. The Internet’s vast resources enable us to craft responses and insights based on up-to-date information, enhancing the adaptability and reliability of our approach. The overall pipeline of WeaverBird is depicted in Figure 1. While it builds upon the general framework of LangChain (Chase, 2022; LangChain-ChatGLM, 2023), our system integrates our novel training and inference techniques, which significantly enhance its overall performance and efficiency. In this section, we delineate the design of each phase, including the model architecture as well as training and inference paradigms.
2.1. Efficiency-Optimized Search and Update
The first phase is to retrieve information from our local knowledge base and the broader internet (i.e., the first step in Figure 1). Our knowledge base is a collection of financial documents where the number of documents is large. Following Chase (2022), we split each document into multiple paragraphs , where (and below) denotes the index of paragraph for the -th document, and embed each paragraph into a multidimensional embedding through a neural encoder . When a language query comes, it is embedded into a vector through another neural encoder and we retrieve the top relevant paragraphs from the knowledge base according to their cosine similarities, i.e., , where is a hyperparameter.
If the cosine similarity of the most relevant retrieval is larger than a predefined threshold , we skip the internet search and proceed directly to the response generation phase that we present in section 2.3. Otherwise, we start an internet search, which is slower than a local search but may yield higher-quality results. This efficiency-optimized design is not in the original LangChain pipeline and we propose it to strike a balance between performance and efficiency.222Internet search is also not in the original LangChain, but it is in the latest release, concurrent with our work. When the local knowledge base already provides highly relevant information, we can avoid the need to search the internet. Otherwise, we invest additional effort to retrieve internet-based information for enhanced quality.
When an internet search333The search engine we use is Google Search. is used, we select the highest-ranked documents , split each document into paragraphs , and embed each paragraph into a vector through the neural encoder that we have used for the knowledge base. Then we select the top relevant paragraphs according to their cosine similarities. Combined with the local retrievals, we end up with paragraphs and proceed to the next phase.
If any of the web paragraph exhibits a cosine similarity greater than the threshold , our system will automatically add the entire document to which it belongs into our local knowledge base. Users can also manually add any online document. This update ensures that future queries, whether identical or related, will receive more efficient responses. As updates occur, the knowledge base will expand, but the local search is still significantly more efficient than a web search.
2.2. Learning to Embed and Search
We confidently consider a higher similarity to signify a more relevant paragraph due to the training of the encoders and . Their optimization involves maximizing a well-defined objective:
(1) |
where is the embedding of the paragraph known to contain relevant information for the query, and the other embeddings belong to a set of negative paragraphs (see section 3.3 for how they are selected). Intuitively, we want the dot products to be relatively large for the query-paragraph pairs that are actually relevant. Our encoders use Multilingual-E5-base model architecture (Wang et al., 2022).
Once we have trained the encoders, we could determine the threshold by computing the cosine similarities for a holdout set of query-paragraph pairs and analyzing the histogram.444We experimented with dot product and cosine similarity and found that the latter yields a better overall performance even though our training objective has used the former. In practice, we choose the 1% quantile to be our threshold .
Another question is where our query-paragraph pairs come from. We have collected a corpus of query-paragraph pairs by leveraging financial documents and question generation methods, which we will outline in section 3.
2.3. Generation by Large Language Models
In this phase, our system analyzes the search results, fills an appropriate prompt template and generates a response. We finally determine the strategies as follows:
Ranker.
The system firstly first ranks all the search results (or results if the internet search has been skipped as discussed in section 2.1) based on their cosine similarities with the query (i.e., the second step in Figure 1).
Prompt Formulation with Temporal Context.
After receiving the ranking results from the previous step, we plug the top (where ) results into the context part of the predefined prompt template. Since the knowledge of finance domain is dynamically evolving, shown in LABEL:lst:en_template, we ground each retrieved result and question with temporal facts (e.g., published date and question date) to facilitate the language model to adapt to time-evolved knowledge, possibly making the response more meaningful and accurate (** et al., 2023b).
Having temporal metadata for questions and retrieved results allows us to ask questions with relative time specifications (e.g., “3
months ago”), which are often used in finance queries. Moreover, it enables us to ingest new
knowledge periodically, in a streaming setup, and evaluate on questions asked during that period (Liška et al., 2022).
Response Generation.
The large language model organizes the information to form a fluent and informed response, corresponding to the third step in Figure 1. The response also includes accurate references to the sources of the information. As an option, our system can further provide a list of citations for the remaining paragraphs at the end of the response, to complement the main response and provide users with additional recommended reading materials.
Our large language model has undergone adaptation to the finance domain, thus being able to skillfully process the input information and generate appropriate responses. We started with a pre-trained GLM (Du et al., 2022; Zeng et al., 2022) that has been pre-trained using extensive English and Chinese corpora. Then we fine-tuned this model using our collected Chinese and English financial documents as well as expert-annotated financial query-response pairs, which we introduce in section 3.
3. Data Collection and Utilization
In this section, we introduce our corpora of financial documents and their collection process, as well as how we used them to build the local knowledge base and generate training and evaluation query-paragraph pairs.
3.1. Collection of Financial Documents
We collected a large set of financial documents from well-known and accredited sources, most of which are global financial institutes. Our collection includes 240 million Chinese documents and 25 million English documents. They were all published between 2010-Jan-01 and 2023-June-30, and cover diverse topics such as macroeconomics and fixed income. Table 2 shows a break-down into the information sources, and Table 3 shows a break-down into the topics.
Chinese docs | English docs | ||
CICC | Deutsche Bank | ||
CITIC Sec. | CICC | ||
Industrial Sec. | HSBC | ||
Haitong Sec. | CITIC Sec. | ||
Guotaijunan Sec. | Guotaijunan Sec. | ||
Huatai Sec. | Kim Eng Sec. | ||
Guangfa Sec. | BNP Paribas | ||
Changjiang Sec. | Industrial Sec. | ||
CMS | Shenwan Hongyuan | ||
Founderfu | UBS |
Chinese docs | English docs | ||
Financial Engr. | HK Stocks | ||
Company | Company | ||
Industry | Industry | ||
Strategies | Macro econ | ||
HK Stocks | Strategies | ||
Morning report | Morning report | ||
Macro econ | Bond | ||
China GEM | China GEM | ||
Bond | Financial Engr. | ||
Fund | Fund |
For each document, we save into a text file the concatenation of its release time, title, summary, and other metadata (such as topic). By manually analyzing a sampled set of documents and consulting domain experts, we recognized that every document has a summary section that includes nearly all the important information of the document. Therefore, we did not save the full content, which will consume a significantly larger amount of disk space and memory but contribute minimal additional information. After processing, the average token count for Chinese documents is 5000 and that for English documents is 2000. Below are two examples of processed documents:
Example 0 (Chinese).
2023-04-03 00:00:00;海底捞(6862.HK):
2H22净利率7.5%;
门店重启稳步推进;海底捞发布2022年度业绩…(full content omitted);
港股;个股;公司定期报告评述;股票;**香港.
Example 0 (English).
2023-02-07 00:00:00; Feb 223 ECB monetary policy meeting commentary: ECB maintains pace of rate hikes without reinforcing…(full content omitted); macro;oversea;finance;economics research;economics;europe.
The primary use of our large-scale data collection is to build a local knowledge base. As mentioned in section 2.1, we split each document into smaller chunks (about 250 tokens per chunk), and embed each chunk into a vector with our trained neural encoder . We implemented this vector-based knowledge base and its retrieval method using the Faiss (Johnson et al., 2019), a library highly optimized for efficient similarity search and clustering of dense vectors.
3.2. Collection of Query-Response Pairs
Another important use of our data collection is to create query-response pairs for tuning the language model. This step is crucial for the performance of our system: recent research has shown that language models tuned with query-response pairs have demonstrated superior performance in answering open-domain questions and engaging in conversations (Mangrulkar et al., 2022). We collected four kinds of query-response pairs, three of which were created from our collection of financial documents. Table 4 shows a breakdown of the pairs into these four categories. In each category, 60% of the pairs are in Chinese while the rest are in English.
Generation
For each document, we can construct a query-response pair that tunes the language model to perform content generation. To create the query, we fill the template “Please write a [TYPE] on [TOPIC] with the title [TITLE]” where TYPE, TOPIC, and TITLE can be found in the corresponding text file. The SUMMARY part of the file is used as the response. An example of query is “Please write a review report on China economics with the title CHINA POLICY: POLITBURO MAINTAINS PRO-GROWTH STANCE” and its response is “The politburo indicates continuous pro-growth policy without big stimulus…”
Summarization
The reverse of a content generation pair can be used as a content summarization pair. Now the query will be created by filling the template “Please summarize the following [TYPE]: [SUMMARY]” and the response is the TITLE. An example of query is “Please summarize the following report: NIO’s 4Q22 earnings missed our prior forcast…” and its response is “NIO INC. (NIO.US): 4Q22 miss”.
Close-book QA
Given a document, we would like the model to answer any questions that could be answered by directly identifying the relevant information from the document. This is the focus of the close-book QA research (Roberts et al., 2020). To achieve this goal, we need a specific kind of query-response pairs: query is the question and the document; response is the answer. We created such pairs using the question generation method proposed by Rajpurkar et al. (2016). Precisely, given a document, we used a pre-trained question generation model to identify some interesting information in the document and then generate a question whose answer should be that piece of information.
Open QA
We would also like the system to answer open questions: unlike close-book QA, the evidence is not available in the context. We created this kind of Chinese question-answer (i.e., query-response) pairs by scrap** well-known online QA forums such as Zhihu and SinaFinance. On such forums, users post questions, tag questions, answer questions, and upvote or downvote answers. We identified 6,000 highly recognized questions with the “finance” tag; with each question being the query, we took the answer with the most upvotes to be its response. We then randomly sampled 4,000 Enligsh open QA questions from the English finance QA dataset FinQA (Bharti, 2023).
Use case |
% |
# Pairs |
Source |
Generation |
38% |
13000 |
Collected financial documents |
Summarization |
20% |
7000 |
Collected financial documents |
Close-book QA |
12% |
4100 |
Collected financial documents |
Open QA |
30% |
10000 |
Chinese: collected from online QA forums (e.g., Zhihu and SinaFinance) |
English: sampled from FinQA (Bharti, 2023) |
3.3. Data Collection for Training Encoders
Recall from section 2.2 that we need query-paragraph pairs to train the key and query encoders and . The close-book QA subset available in section 3.2 provides a direct resource to fulfill this requirement. First, we categorized all the close-book query-response pairs according to their topics. Then we manually reviewed the data of each topic and selected a total of 1,000 pairs while ensuring a comprehensive coverage of all topics. The selection of each pair was based on our anticipated likelihood that someone would ask this question during interaction with the demo. We used these 1,000 query-response pairs as the positive pairs in training the encoders. For each of them, we randomly sampled five negative responses from the entire pool of paragraphs.
4. Related Work
Large language models such as ChatGPT (Brown et al., 2020) and GPT-4 (OpenAI, 2023) have showcased their remarkable capabilities in engaging in human-like communication and understanding complex queries, bringing a trend of incorporating LLMs in various fields (Anil et al., 2023; Gunasekar et al., 2023; ** et al., 2023a). Recent advancements have further enhanced these models by incorporating external tools, enabling them to search for relevant online information (Nakano et al., 2021), utilize tools (Schick et al., 2023), and create more sophisticated applications (Chase, 2022). In this paper, we harness these advancements to build an intelligent dialogue system that specializes in the finance domain. The most relevant work is BloombergGPT (Wu et al., 2023), a generative Transformer trained with large corpora of finance-related text. While BloombergGPT has only been trained with English corpora, our system is also carefully tailored to handle Chinese, a language spoken by one-fifth of the global population. In addition, BloombergGPT is only trained to predict the next token given the context but not to engage in conversations like our WeaverBird system; nor is it integrated with knowledge bases or search engines.
Our work aligns with research on retrieval-based language generation, which involves enhancing language models by incorporating external datastores. This area has seen advancements in new model architectures that condition the next token generation on retrieved text (Khandelwal et al., 2020; Wu et al., 2022; Borgeaud et al., 2022), as well as developments in training and inference methods aimed at improving the quality of retrieval (Zhong et al., 2022) and the performance in downstream tasks such as question answering (Karpukhin et al., 2020; Izacard et al., 2022b; Shi et al., 2022; Menick et al., 2022) and code generation (Zhou et al., 2023). Our work aims to deliver a system that has superior performance in answering finance-related questions. It uses the standard GPT architecture, and augments the inputs (i.e., queries) with retrieved information (i.e., online documents or KB entries): this design is similar to the approaches of Shi et al. (2023); Ram et al. (2023), but our efficiency-optimized search and update (section 2.1) is novel.
5. Models and Training
5.1. Language Model
Architecture.
Dataset.
We use the collected query-response pairs clarified in section 3.2 for training and evaluation. The whole set of pairs are divided into sets of 17,000 training pairs, 3,000 development pairs, and 4,100 test pairs.
Training.
We apply the LoRA (Hu et al., 2021) strategy to fine-tune the base model to adapt to the finance domain. The LoRA layer details are as follows: the LoRA rank is 16, the LoRA alpha is 32, the dropout is 0.1, and other settings are default.
5.2. Encoders
Architecture.
The key and query encoders and are initialized as Multilingual-E5-base (ME5) model architecture (Wang et al., 2022).
Dataset.
Recall from section 2.2 that we need query-paragraph pairs to train the key and query encoders and . We use the collected query-response pairs clarified in section 2.3 for training and evaluation. The chosen pairs were then divided into sets of 700 training pairs, 100 development pairs, and 200 test pairs.
Training.
We pass query-response pairs to the model to yield a scalar score for each of the pair and maximize the scores for the positive pairs while minimizing the scores for the negative pairs with the cross entropy loss.
5.3. Implementation and Deployment Details.
Knowledge Base and WebUI.
For the implementation of LangChain, we use the code from the public GitHub repository at https://github.com/imClumsyPanda/langchain-ChatGLM (LangChain-ChatGLM, 2023) without any License. For the implementation of WebUI, we use the code from the public GitHub repository at https://github.com/THUDM/WebGLM (Liu et al., 2023) with Apache 2.0 License.
Computation Cost.
All the experiments were conducted on a server with 128G RAM, a 64 logical cores CPU (Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz) and 8 NVIDIA A100 80G Tensor Core GPUs for acceleration. During the fine-tuning process on the finance dataset, we set the train batch size on each device equal to , the gradient accumulation step equal to 100, the max step equal to and the learning rate equal to 0.0001, and the running time of each step is close 72 seconds, the total running time is close to hours.
Deployment Details.
Our system is deployed on a server on Alibaba Cloud with 30G RAM, a 8 logical cores (Intel Xeon(Ice Lake) Platinum 8369B), and a NVIDIA A100 80G Tensor Core GPU. The inference speed depends search speed on the website, searches in local knowledge only take 1 second, and the search on the website takes 10 seconds on average.
6. Experiments
We present the experiments designed to evaluate the performance of the WeaverBird system, including the effectiveness of the trained encoders (section 6.1) and the generation quality (section 6.2).
6.1. Paragraph Retrieval Evaluation
Now we show the results of evaluating the trained encoders against pretrained ME5 and trained Contriver-based encoders (Izacard et al., 2022a).
Experimental Setup.
We use the test set specified in section 5.2 for evaluation. When evaluating a method (i.e., a specific model paired with a specific similarity metric), we take the top highest-ranked retrievals for each given query, and evaluate the mean average precision (MAP) and mean average recall (MAR) on the entire test set of query-paragraph pairs. Intuitively, MAP measures the fraction of the predictions that are correct, while MAR measures the fraction of the actual paragraphs that are covered by the predictions. Higher is better.
Results.
The results are in Figure 3. As we can see, trained encoders work significantly better than pretrained encoders, indicating the effectiveness of our training method. Moreover, our ME5-based encoders significantly outperform the Contriver-based encoders: it is perhaps because a large portion of our test cases are in Chinese and ME5 is a bilingual model. Interestingly, cosine similarity turns out to work the best among all the similarity metrics, even though our training objective uses dot product.
6.2. Response Quality Evaluation
Experimental Setup
. We use the test set specified in section 5.1 for evaluation. Since automatic evaluation for complicated natural language QA still remains an open research question, we rate the response by both human expert and GPT-4 (OpenAI, 2023) due to its efficiency and consistency (Hackl et al., 2023): each response is given a score in by a financial expert and GPT-4, respectively and the final score is an average of the two. Higher is better. We compare WeaverBird with WebGLM, FinGPT and FinChat. BloombergGPT (Wu et al., 2023) is non open-sourced so it is excluded from the baselines. Due to the budget constraint, we evaluate FinChat, a commercial chatbot in finance, on a subset of 300 questions sampled from the test set.
Results.
The strongest results of our WeaverBird system is summarized in Figure 3. Although WebGLM is also bilingual and empowered with search engine, WeaverBird has superior performance mainly due to its domain-finetuned base model, trained encoders on financial documents as well as optimized retrieval on knowledge base and searched results. FinGPT has lower score than WeaverBird and WebGLM because it has not used multi-source knowledge. FinChat performs poorly because it is not billingual and its knowledge base covers a restricted list of stocks.
6.3. More Analysis
Analysis I: Effect of knowledge base and search engine on the response quality.
Firstly, we remove the knowledge base (‘wv-wo-kb’ in Figure 3) and this strategy causes a moderate drop in performance. Secondly, we remove the search engine (‘wv-wo-se’ in Figure 3) and the performance declines with a notable drop, indicating the web search ability, which introduces the ‘global knowledge base’ on the internet, greatly improves the accuracy of responses. To conclude, the searched results contribute most to the response quality while the local knowledge base contributes moderately.
Analysis II: Effect of the retriever on the response quality?
We randomly pick up examples of query-response pairs generated by WeaverBird and group the responses by the accuracies of retrieved documents (annotated by the experts). Not surprisingly, Figure 4 shows that higher accuracy of retrievals brings significant improvement on response qualities.
Analysis III: How to correct flawed cases.
We manually analyze examples of finance-related query-response pairs annotated by the experts. Of these , cases arrived at reasonable responses. However, we also found some flawed output, most of which can be corrected with one of the following classes of modification:
-
•
Increase to have more searched results; increase to have more input for LLM; or increase both and .
-
•
The searched results usually consist of a number of relevant paragraphs instead of the full content of the web page. If needed, by using the functionality provided by our system, the user can choose to scrape the site and save the content to the local knowledge base. Given the full text, more relevant paragraphs are retrieved and better responses are possibly generated.
Analysis IV: Qualitative examples of challenging queries.
In this section, we present two illustrative examples.
The first example poses a challenging question regarding the holdings of a renowned fund manager.
Obviously, WeaverBird outperforms the competitors by delivering a reasonably accurate response, accompanied by citations to credible sources.
The second example is in Chinese.
As we can see, our WeaverBird system exhibits superior performance in answering Chinese finance-related questions, which neither WebGLM nor FinChat could answer well.
7. Conclusion
We presented an open-source, intelligent dialogue system tailored for the finance sector, which outperforms the best available solutions as evidenced by its superior capabilities in solving a wide range of finance-related questions. Our systematic approach contributes to the line of research on building domain-specific large language models. In addition, our training and inference strategies may be useful for develo** retrieval-based dialogue systems in other domains, allowing us to unlock broader real applications.
References
- (1)
- fin (2023) 2023. FinChat. https://finchat.io/
- Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
- Bharti (2023) Gaurang Bharti. 2023. wealth-alpaca. https://github.com/gaurangbharti1/wealth-alpaca
- Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v162/borgeaud22a/borgeaud22a.pdf
- Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- Chase (2022) Harrison Chase. 2022. LangChain. https://github.com/hwchase17/langchain
- Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.longhoe.net/abs/2103.10360
- Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
- Hackl et al. (2023) Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, and Maximilian Sailer. 2023. Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings. arXiv:2308.02575 [cs.CL]
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021). https://arxiv.longhoe.net/pdf/2106.09685.pdf
- Izacard et al. (2022a) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022a. Unsupervised Dense Information Retrieval with Contrastive Learning. Trans. Mach. Learn. Res. 2022 (2022). https://openreview.net/forum?id=jKN1pXi7b0
- Izacard et al. (2022b) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022b. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022). https://arxiv.longhoe.net/abs/2208.03299
- ** et al. (2023a) Ming **, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. 2023a. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. arXiv preprint arXiv:2310.01728 (2023).
- ** et al. (2023b) Ming **, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. 2023b. Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook. arXiv preprint arXiv:2310.10196 (2023).
- Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019). https://arxiv.longhoe.net/abs/1702.08734
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/2004.04906
- Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.longhoe.net/abs/1911.00172
- LangChain-ChatGLM (2023) LangChain-ChatGLM. 2023. LangChain-ChatGLM. https://github.com/imClumsyPanda/langchain-ChatGLM
- Liška et al. (2022) Adam Liška, Tomáš Kočiský, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien de Masson d’Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-McMahon Sophia Austin, Phil Blunsom, and Angeliki Lazaridou. 2022. StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models. arXiv preprint arXiv:2205.11388 (2022). https://arxiv.longhoe.net/abs/2205.11388
- Liu et al. (2023) Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. arXiv:2306.07906 [cs.CL]
- Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft
- Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. 2022. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 (2022). https://arxiv.longhoe.net/abs/2203.11147
- Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021). https://arxiv.longhoe.net/abs/2112.09332
- OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023). https://arxiv.longhoe.net/abs/2303.08774.pdf
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/1606.05250
- Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics (2023). https://arxiv.longhoe.net/abs/2302.00083
- Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5418–5426. https://doi.org/10.18653/v1/2020.emnlp-main.437
- Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023). https://arxiv.longhoe.net/abs/2302.04761
- Shi et al. (2022) Weijia Shi, Julian Michael, Suchin Gururangan, and Luke Zettlemoyer. 2022. Nearest neighbor zero-shot inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://aclanthology.org/2022.emnlp-main.214/
- Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023). https://arxiv.longhoe.net/abs/2301.12652
- Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533 (2022).
- Wu et al. (2023) Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023). https://arxiv.longhoe.net/abs/2303.17564
- Wu et al. (2022) Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing Transformers. In Proceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=TrjbxzRcnf-
- Yang et al. (2023) Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open-Source Financial Large Language Models. arXiv preprint arXiv:2306.06031 (2023).
- Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022). https://arxiv.longhoe.net/abs/2210.02414
- Zhong et al. (2022) Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training language models with memory augmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.longhoe.net/abs/2205.12674
- Zhou et al. (2023) Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig. 2023. Docprompting: Generating code by retrieving the docs. In Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.longhoe.net/abs/2207.05987
Appendix A Ongoing and Future Work
We are currently exploring several extensions to deal with more complex dialogue cases in the finance domain. We are particularly interested in handling:
-
•
Multiple-turn conversations. In such scenarios, the system has to engage in the communication and provide new responses at each turn that are consistent and coherent with previous statements.
-
•
External tools. Users may want our system to interact with other external tools, such as pulling information from commercial data vendors (e.g., Bloomberg Terminal) and placing transaction orders in online brokerage software (e.g., Charles Schwab).
-
•
More user-friendly presentation. Users may desire our system presenting answers in richer formats such as tables and diagrams.
Appendix B Software Interface
The main interface of our WeaverBird system can be seen in Figures 5 and 6. Experience it at https://weaverbird.ttic.edu.