Search | arXiv e-print repository

Startup success prediction and VC portfolio simulation using CrunchBase data

Authors: Mark Potanin, Andrey Chertok, Konstantin Zorin, Cyril Shtabtsovsky

Abstract: Predicting startup success presents a formidable challenge due to the inherently volatile landscape of the entrepreneurial ecosystem. The advent of extensive databases like Crunchbase jointly with available open data enables the application of machine learning and artificial intelligence for more accurate predictive analytics. This paper focuses on startups at their Series B and Series C investmen… ▽ More Predicting startup success presents a formidable challenge due to the inherently volatile landscape of the entrepreneurial ecosystem. The advent of extensive databases like Crunchbase jointly with available open data enables the application of machine learning and artificial intelligence for more accurate predictive analytics. This paper focuses on startups at their Series B and Series C investment stages, aiming to predict key success milestones such as achieving an Initial Public Offering (IPO), attaining unicorn status, or executing a successful Merger and Acquisition (M\&A). We introduce novel deep learning model for predicting startup success, integrating a variety of factors such as funding metrics, founder features, industry category. A distinctive feature of our research is the use of a comprehensive backtesting algorithm designed to simulate the venture capital investment process. This simulation allows for a robust evaluation of our model's performance against historical data, providing actionable insights into its practical utility in real-world investment contexts. Evaluating our model on Crunchbase's, we achieved a 14 times capital growth and successfully identified on B round high-potential startups including Revolut, DigitalOcean, Klarna, Github and others. Our empirical findings illuminate the importance of incorporating diverse feature sets in enhancing the model's predictive accuracy. In summary, our work demonstrates the considerable promise of deep learning models and alternative unstructured data in predicting startup success and sets the stage for future advancements in this research area. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 13 pages, preprint

ACM Class: I.2.1; J.4

arXiv:2206.12514 [pdf, other]

DetIE: Multilingual Open Information Extraction Inspired by Object Detection

Authors: Michael Vasilkovsky, Anton Alekseev, Valentin Malykh, Ilya Shenbin, Elena Tutubalina, Dmitriy Salikhov, Mikhail Stepnov, Andrey Chertok, Sergey Nikolenko

Abstract: State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorith… ▽ More State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorithms from computer vision. We use an order-agnostic loss based on bipartite matching that forces unique predictions and a Transformer-based encoder-only architecture for sequence labeling. The proposed approach is faster and shows superior or similar performance in comparison with state of the art models on standard benchmarks in terms of both quality metrics and inference time. Our model sets the new state of the art performance of 67.7% F1 on CaRB evaluated as OIE2016 while being 3.35x faster at inference than previous state of the art. We also evaluate the multilingual version of our model in the zero-shot setting for two languages and introduce a strategy for generating synthetic multilingual data to fine-tune the model for each specific language. In this setting, we show performance improvement 15% on multilingual Re-OIE2016, reaching 75% F1 for both Portuguese and Spanish languages. Code and models are available at https://github.com/sberbank-ai/DetIE. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: Accepted to the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

arXiv:2202.10784 [pdf, other]

RuCLIP -- new models and experiments: a technical report

Authors: Alex Shonenkov, Andrey Kuznetsov, Denis Dimitrov, Tatyana Shavrina, Daniil Chesakov, Anastasia Maltseva, Alena Fenogenova, Igor Pavlov, Anton Emelyanov, Sergey Markov, Daria Bakshandaeva, Vera Shybaeva, Andrey Chertok

Abstract: In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and co… ▽ More In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2112.07395 [pdf, other]

Handwritten text generation and strikethrough characters augmentation

Authors: Alex Shonenkov, Denis Karachev, Max Novopoltsev, Mark Potanin, Denis Dimitrov, Andrey Chertok

Abstract: We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting text recognition (HTR) tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix),… ▽ More We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER) beyond best-reported results on handwriting text recognition (HTR) tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in HTR tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to HTR. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of HTR models △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: 16 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2108.11667

MSC Class: 68-04 ACM Class: I.7.5; I.4.6

arXiv:2110.04228 [pdf, ps, other]

Hybrid Graph Embedding Techniques in Estimated Time of Arrival Task

Authors: Vadim Porvatov, Natalia Semenova, Andrey Chertok

Abstract: Recently, deep learning has achieved promising results in the calculation of Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the start point to a certain place along a given path. ETA plays an essential role in intelligent taxi services or automotive navigation systems. A common practice is to use embedding vectors to represent the elements of a road network… ▽ More Recently, deep learning has achieved promising results in the calculation of Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the start point to a certain place along a given path. ETA plays an essential role in intelligent taxi services or automotive navigation systems. A common practice is to use embedding vectors to represent the elements of a road network, such as road segments and crossroads. Road elements have their own attributes like length, presence of crosswalks, lanes number, etc. However, many links in the road network are traversed by too few floating cars even in large ride-hailing platforms and affected by the wide range of temporal events. As the primary goal of the research, we explore the generalization ability of different spatial embedding strategies and propose a two-stage approach to deal with such problems. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: Accepted in ICCNA 2021

arXiv:2010.15925 [pdf, other]

doi 10.18653/v1/2020.emnlp-main.381

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Authors: Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Andrey Evlampiev

Abstract: In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logi… ▽ More In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language. △ Less

Submitted 2 November, 2020; v1 submitted 29 October, 2020; originally announced October 2020.

Comments: to appear in EMNLP 2020

arXiv:1912.09723 [pdf, other]

doi 10.1007/978-3-030-58219-7_1

SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis

Authors: Pavel Efimov, Andrey Chertok, Leonid Boytsov, Pavel Braslavski

Abstract: SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. △ Less

Submitted 2 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

arXiv:1507.02534 [pdf, ps, other]

doi 10.1063/1.4952004

A note on functional limit theorems for compound Cox processes

Authors: V. Yu. Korolev, A. V. Chertok, A. Yu. Korchagin, E. V. Kossova, A. I. Zeifman

Abstract: An improved version of the functional limit theorem is proved establishing weak convergence of random walks generated by compound doubly stochastic Poisson processes (compound Cox processes) to L{é}vy processes in the Skorokhod space under more realistic moment conditions. As corollaries, theorems are proved on convergence of random walks with jumps having finite variances to L{é}vy processes with… ▽ More An improved version of the functional limit theorem is proved establishing weak convergence of random walks generated by compound doubly stochastic Poisson processes (compound Cox processes) to L{é}vy processes in the Skorokhod space under more realistic moment conditions. As corollaries, theorems are proved on convergence of random walks with jumps having finite variances to L{é}vy processes with variance-mean mixed normal distributions, in particular, to stable L{é}vy processes, generalized hyperbolic and generalized variance-gamma L{é}vy processes. △ Less

Submitted 9 July, 2015; originally announced July 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1410.1900

arXiv:1410.1900 [pdf, ps, other]

Modeling high-frequency order flow imbalance by functional limit theorems for two-sided risk processes

Authors: V. Yu. Korolev, A. V. Chertok, A. Yu. Korchagin, A. I. Zeifman

Abstract: A micro-scale model is proposed for the evolution of the limit order book. Within this model, the flows of orders (claims) are described by doubly stochastic Poisson processes taking account of the stochastic character of intensities of bid and ask orders that determine the price discovery mechanism in financial markets. The process of {\it order flow imbalance} (OFI) is studied. This process is a… ▽ More A micro-scale model is proposed for the evolution of the limit order book. Within this model, the flows of orders (claims) are described by doubly stochastic Poisson processes taking account of the stochastic character of intensities of bid and ask orders that determine the price discovery mechanism in financial markets. The process of {\it order flow imbalance} (OFI) is studied. This process is a sensitive indicator of the current state of the limit order book since time intervals between events in a limit order book are usually so short that price changes are relatively infrequent events. Therefore price changes provide a very coarse and limited description of market dynamics at time micro-scales. The OFI process tracks best bid and ask queues and change much faster than prices. It incorporates information about build-ups and depletions of order queues so that it can be used to interpolate market dynamics between price changes and to track the toxicity of order flows. The {\it two-sided risk processes} are suggested as mathematical models of the OFI process. △ Less

Submitted 8 December, 2014; v1 submitted 6 October, 2014; originally announced October 2014.

Showing 1–9 of 9 results for author: Chertok, A