-
Startup success prediction and VC portfolio simulation using CrunchBase data
Authors:
Mark Potanin,
Andrey Chertok,
Konstantin Zorin,
Cyril Shtabtsovsky
Abstract:
Predicting startup success presents a formidable challenge due to the inherently volatile landscape of the entrepreneurial ecosystem. The advent of extensive databases like Crunchbase jointly with available open data enables the application of machine learning and artificial intelligence for more accurate predictive analytics. This paper focuses on startups at their Series B and Series C investmen…
▽ More
Predicting startup success presents a formidable challenge due to the inherently volatile landscape of the entrepreneurial ecosystem. The advent of extensive databases like Crunchbase jointly with available open data enables the application of machine learning and artificial intelligence for more accurate predictive analytics. This paper focuses on startups at their Series B and Series C investment stages, aiming to predict key success milestones such as achieving an Initial Public Offering (IPO), attaining unicorn status, or executing a successful Merger and Acquisition (M\&A). We introduce novel deep learning model for predicting startup success, integrating a variety of factors such as funding metrics, founder features, industry category. A distinctive feature of our research is the use of a comprehensive backtesting algorithm designed to simulate the venture capital investment process. This simulation allows for a robust evaluation of our model's performance against historical data, providing actionable insights into its practical utility in real-world investment contexts. Evaluating our model on Crunchbase's, we achieved a 14 times capital growth and successfully identified on B round high-potential startups including Revolut, DigitalOcean, Klarna, Github and others. Our empirical findings illuminate the importance of incorporating diverse feature sets in enhancing the model's predictive accuracy. In summary, our work demonstrates the considerable promise of deep learning models and alternative unstructured data in predicting startup success and sets the stage for future advancements in this research area.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
DetIE: Multilingual Open Information Extraction Inspired by Object Detection
Authors:
Michael Vasilkovsky,
Anton Alekseev,
Valentin Malykh,
Ilya Shenbin,
Elena Tutubalina,
Dmitriy Salikhov,
Mikhail Stepnov,
Andrey Chertok,
Sergey Nikolenko
Abstract:
State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorith…
▽ More
State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorithms from computer vision. We use an order-agnostic loss based on bipartite matching that forces unique predictions and a Transformer-based encoder-only architecture for sequence labeling. The proposed approach is faster and shows superior or similar performance in comparison with state of the art models on standard benchmarks in terms of both quality metrics and inference time. Our model sets the new state of the art performance of 67.7% F1 on CaRB evaluated as OIE2016 while being 3.35x faster at inference than previous state of the art. We also evaluate the multilingual version of our model in the zero-shot setting for two languages and introduce a strategy for generating synthetic multilingual data to fine-tune the model for each specific language. In this setting, we show performance improvement 15% on multilingual Re-OIE2016, reaching 75% F1 for both Portuguese and Spanish languages. Code and models are available at https://github.com/sberbank-ai/DetIE.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
RuCLIP -- new models and experiments: a technical report
Authors:
Alex Shonenkov,
Andrey Kuznetsov,
Denis Dimitrov,
Tatyana Shavrina,
Daniil Chesakov,
Anastasia Maltseva,
Alena Fenogenova,
Igor Pavlov,
Anton Emelyanov,
Sergey Markov,
Daria Bakshandaeva,
Vera Shybaeva,
Andrey Chertok
Abstract:
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and co…
▽ More
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Handwritten text generation and strikethrough characters augmentation
Authors:
Alex Shonenkov,
Denis Karachev,
Max Novopoltsev,
Mark Potanin,
Denis Dimitrov,
Andrey Chertok
Abstract:
We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting text recognition (HTR) tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix),…
▽ More
We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting text recognition (HTR) tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in HTR tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to HTR. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of HTR models
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Hybrid Graph Embedding Techniques in Estimated Time of Arrival Task
Authors:
Vadim Porvatov,
Natalia Semenova,
Andrey Chertok
Abstract:
Recently, deep learning has achieved promising results in the calculation of Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the start point to a certain place along a given path. ETA plays an essential role in intelligent taxi services or automotive navigation systems. A common practice is to use embedding vectors to represent the elements of a road network…
▽ More
Recently, deep learning has achieved promising results in the calculation of Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the start point to a certain place along a given path. ETA plays an essential role in intelligent taxi services or automotive navigation systems. A common practice is to use embedding vectors to represent the elements of a road network, such as road segments and crossroads. Road elements have their own attributes like length, presence of crosswalks, lanes number, etc. However, many links in the road network are traversed by too few floating cars even in large ride-hailing platforms and affected by the wide range of temporal events. As the primary goal of the research, we explore the generalization ability of different spatial embedding strategies and propose a two-stage approach to deal with such problems.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Authors:
Tatiana Shavrina,
Alena Fenogenova,
Anton Emelyanov,
Denis Shevelev,
Ekaterina Artemova,
Valentin Malykh,
Vladislav Mikhailov,
Maria Tikhonova,
Andrey Chertok,
Andrey Evlampiev
Abstract:
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logi…
▽ More
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.
△ Less
Submitted 2 November, 2020; v1 submitted 29 October, 2020;
originally announced October 2020.
-
SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis
Authors:
Pavel Efimov,
Andrey Chertok,
Leonid Boytsov,
Pavel Braslavski
Abstract:
SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results.
SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results.
△ Less
Submitted 2 May, 2020; v1 submitted 20 December, 2019;
originally announced December 2019.
-
A note on functional limit theorems for compound Cox processes
Authors:
V. Yu. Korolev,
A. V. Chertok,
A. Yu. Korchagin,
E. V. Kossova,
A. I. Zeifman
Abstract:
An improved version of the functional limit theorem is proved establishing weak convergence of random walks generated by compound doubly stochastic Poisson processes (compound Cox processes) to L{é}vy processes in the Skorokhod space under more realistic moment conditions. As corollaries, theorems are proved on convergence of random walks with jumps having finite variances to L{é}vy processes with…
▽ More
An improved version of the functional limit theorem is proved establishing weak convergence of random walks generated by compound doubly stochastic Poisson processes (compound Cox processes) to L{é}vy processes in the Skorokhod space under more realistic moment conditions. As corollaries, theorems are proved on convergence of random walks with jumps having finite variances to L{é}vy processes with variance-mean mixed normal distributions, in particular, to stable L{é}vy processes, generalized hyperbolic and generalized variance-gamma L{é}vy processes.
△ Less
Submitted 9 July, 2015;
originally announced July 2015.
-
Modeling high-frequency order flow imbalance by functional limit theorems for two-sided risk processes
Authors:
V. Yu. Korolev,
A. V. Chertok,
A. Yu. Korchagin,
A. I. Zeifman
Abstract:
A micro-scale model is proposed for the evolution of the limit order book. Within this model, the flows of orders (claims) are described by doubly stochastic Poisson processes taking account of the stochastic character of intensities of bid and ask orders that determine the price discovery mechanism in financial markets. The process of {\it order flow imbalance} (OFI) is studied. This process is a…
▽ More
A micro-scale model is proposed for the evolution of the limit order book. Within this model, the flows of orders (claims) are described by doubly stochastic Poisson processes taking account of the stochastic character of intensities of bid and ask orders that determine the price discovery mechanism in financial markets. The process of {\it order flow imbalance} (OFI) is studied. This process is a sensitive indicator of the current state of the limit order book since time intervals between events in a limit order book are usually so short that price changes are relatively infrequent events. Therefore price changes provide a very coarse and limited description of market dynamics at time micro-scales. The OFI process tracks best bid and ask queues and change much faster than prices. It incorporates information about build-ups and depletions of order queues so that it can be used to interpolate market dynamics between price changes and to track the toxicity of order flows. The {\it two-sided risk processes} are suggested as mathematical models of the OFI process.
△ Less
Submitted 8 December, 2014; v1 submitted 6 October, 2014;
originally announced October 2014.