-
Transformer-based classification of user queries for medical consultancy with respect to expert specialization
Authors:
Dmitry Lyutkin,
Andrey Soloviev,
Dmitry Zhukov,
Denis Pozdnyakov,
Muhammad Shahid Iqbal Malik,
Dmitry I. Ignatov
Abstract:
The need for skilled medical support is growing in the era of digital healthcare. This research presents an innovative strategy, utilizing the RuBERT model, for categorizing user inquiries in the field of medical consultation with a focus on expert specialization. By harnessing the capabilities of transformers, we fine-tuned the pre-trained RuBERT model on a varied dataset, which facilitates preci…
▽ More
The need for skilled medical support is growing in the era of digital healthcare. This research presents an innovative strategy, utilizing the RuBERT model, for categorizing user inquiries in the field of medical consultation with a focus on expert specialization. By harnessing the capabilities of transformers, we fine-tuned the pre-trained RuBERT model on a varied dataset, which facilitates precise correspondence between queries and particular medical specialisms. Using a comprehensive dataset, we have demonstrated our approach's superior performance with an F1-score of over 92%, calculated through both cross-validation and the traditional split of test and train datasets. Our approach has shown excellent generalization across medical domains such as cardiology, neurology and dermatology. This methodology provides practical benefits by directing users to appropriate specialists for prompt and targeted medical advice. It also enhances healthcare system efficiency, reduces practitioner burden, and improves patient care quality. In summary, our suggested strategy facilitates the attainment of specific medical knowledge, offering prompt and precise advice within the digital healthcare field.
△ Less
Submitted 2 October, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Reconstructing and grounding narrated instructional videos in 3D
Authors:
Dimitri Zhukov,
Ignacio Rocco,
Ivan Laptev,
Josef Sivic,
Johannes L. Schönberger,
Bugra Tekin,
Marc Pollefeys
Abstract:
Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructiona…
▽ More
Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.
△ Less
Submitted 10 September, 2021; v1 submitted 9 September, 2021;
originally announced September 2021.
-
Training Deep SLAM on Single Frames
Authors:
Igor Slinko,
Anna Vorontsova,
Dmitry Zhukov,
Olga Barinova,
Anton Konushin
Abstract:
Learning-based visual odometry and SLAM methods demonstrate a steady improvement over past years. However, collecting ground truth poses to train these methods is difficult and expensive. This could be resolved by training in an unsupervised mode, but there is still a large gap between performance of unsupervised and supervised methods. In this work, we focus on generating synthetic data for deep…
▽ More
Learning-based visual odometry and SLAM methods demonstrate a steady improvement over past years. However, collecting ground truth poses to train these methods is difficult and expensive. This could be resolved by training in an unsupervised mode, but there is still a large gap between performance of unsupervised and supervised methods. In this work, we focus on generating synthetic data for deep learning-based visual odometry and SLAM methods that take optical flow as an input. We produce training data in a form of optical flow that corresponds to arbitrary camera movement between a real frame and a virtual frame. For synthesizing data we use depth maps either produced by a depth sensor or estimated from stereo pair. We train visual odometry model on synthetic data and do not use ground truth poses hence this model can be considered unsupervised. Also it can be classified as monocular as we do not use depth maps on inference. We also propose a simple way to convert any visual odometry model into a SLAM method based on frame matching and graph optimization. We demonstrate that both the synthetically-trained visual odometry model and the proposed SLAM method build upon this model yields state-of-the-art results among unsupervised methods on KITTI dataset and shows promising results on a challenging EuRoC dataset.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
Measuring robustness of Visual SLAM
Authors:
David Prokhorov,
Dmitry Zhukov,
Olga Barinova,
Anna Vorontsova,
Anton Konushin
Abstract:
Simultaneous localization and map** (SLAM) is an essential component of robotic systems. In this work we perform a feasibility study of RGB-D SLAM for the task of indoor robot navigation. Recent visual SLAM methods, e.g. ORBSLAM2 \cite{mur2017orb}, demonstrate really impressive accuracy, but the experiments in the papers are usually conducted on just a few sequences, that makes it difficult to r…
▽ More
Simultaneous localization and map** (SLAM) is an essential component of robotic systems. In this work we perform a feasibility study of RGB-D SLAM for the task of indoor robot navigation. Recent visual SLAM methods, e.g. ORBSLAM2 \cite{mur2017orb}, demonstrate really impressive accuracy, but the experiments in the papers are usually conducted on just a few sequences, that makes it difficult to reason about the robustness of the methods. Another problem is that all available RGB-D datasets contain the trajectories with very complex camera motions. In this work we extensively evaluate ORBSLAM2 to better understand the state-of-the-art. First, we conduct experiments on the popular publicly available datasets for RGB-D SLAM across the conventional metrics. We perform statistical analysis of the results and find correlations between the metrics and the attributes of the trajectories. Then, we introduce a new large and diverse HomeRobot dataset where we model the motions of a simple home robot. Our dataset is created using physically-based rendering with realistic lighting and contains the scenes composed by human designers. It includes thousands of sequences, that is two orders of magnitude greater than in previous works. We find that while in many cases the accuracy of SLAM is very good, the robustness is still an issue.
△ Less
Submitted 10 October, 2019;
originally announced October 2019.
-
DISCOMAN: Dataset of Indoor SCenes for Odometry, Map** And Navigation
Authors:
Pavel Kirsanov,
Airat Gaskarov,
Filipp Konokhov,
Konstantin Sofiiuk,
Anna Vorontsova,
Igor Slinko,
Dmitry Zhukov,
Sergey Bykov,
Olga Barinova,
Anton Konushin
Abstract:
We present a novel dataset for training and benchmarking semantic SLAM methods. The dataset consists of 200 long sequences, each one containing 3000-5000 data frames. We generate the sequences using realistic home layouts. For that we sample trajectories that simulate motions of a simple home robot, and then render the frames along the trajectories. Each data frame contains a) RGB images generated…
▽ More
We present a novel dataset for training and benchmarking semantic SLAM methods. The dataset consists of 200 long sequences, each one containing 3000-5000 data frames. We generate the sequences using realistic home layouts. For that we sample trajectories that simulate motions of a simple home robot, and then render the frames along the trajectories. Each data frame contains a) RGB images generated using physically-based rendering, b) simulated depth measurements, c) simulated IMU readings and d) ground truth occupancy grid of a house. Our dataset serves a wider range of purposes compared to existing datasets and is the first large-scale benchmark focused on the map** component of SLAM. The dataset is split into train/validation/test parts sampled from different sets of virtual houses. We present benchmarking results forboth classical geometry-based and recent learning-based SLAM algorithms, a baseline map** method, semantic segmentation and panoptic segmentation.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Authors:
Antoine Miech,
Dimitri Zhukov,
Jean-Baptiste Alayrac,
Makarand Tapaswi,
Ivan Laptev,
Josef Sivic
Abstract:
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narration…
▽ More
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.
△ Less
Submitted 31 July, 2019; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Cross-task weakly supervised learning from instructional videos
Authors:
Dimitri Zhukov,
Jean-Baptiste Alayrac,
Ramazan Gokberk Cinbis,
David Fouhey,
Ivan Laptev,
Josef Sivic
Abstract:
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be tra…
▽ More
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.
△ Less
Submitted 29 April, 2019; v1 submitted 19 March, 2019;
originally announced March 2019.
-
Full orbital solution for the binary system in the northern Galactic disc microlensing event Gaia16aye
Authors:
Łukasz Wyrzykowski,
P. Mróz,
K. A. Rybicki,
M. Gromadzki,
Z. Kołaczkowski,
M. Zieliński,
P. Zieliński,
N. Britavskiy,
A. Gomboc,
K. Sokolovsky,
S. T. Hodgkin,
L. Abe,
G. F. Aldi,
A. AlMannaei,
G. Altavilla,
A. Al Qasim,
G. C. Anupama,
S. Awiphan,
E. Bachelet,
V. Bakıs,
S. Baker,
S. Bartlett,
P. Bendjoya,
K. Benson,
I. F. Bikmaev
, et al. (160 additional authors not shown)
Abstract:
Gaia16aye was a binary microlensing event discovered in the direction towards the northern Galactic disc and was one of the first microlensing events detected and alerted to by the Gaia space mission. Its light curve exhibited five distinct brightening episodes, reaching up to I=12 mag, and it was covered in great detail with almost 25,000 data points gathered by a network of telescopes. We presen…
▽ More
Gaia16aye was a binary microlensing event discovered in the direction towards the northern Galactic disc and was one of the first microlensing events detected and alerted to by the Gaia space mission. Its light curve exhibited five distinct brightening episodes, reaching up to I=12 mag, and it was covered in great detail with almost 25,000 data points gathered by a network of telescopes. We present the photometric and spectroscopic follow-up covering 500 days of the event evolution. We employed a full Keplerian binary orbit microlensing model combined with the motion of Earth and Gaia around the Sun to reproduce the complex light curve. The photometric data allowed us to solve the microlensing event entirely and to derive the complete and unique set of orbital parameters of the binary lensing system. We also report on the detection of the first-ever microlensing space-parallax between the Earth and Gaia located at L2. The properties of the binary system were derived from microlensing parameters, and we found that the system is composed of two main-sequence stars with masses 0.57$\pm$0.05 $M_\odot$ and 0.36$\pm$0.03 $M_\odot$ at 780 pc, with an orbital period of 2.88 years and an eccentricity of 0.30. We also predict the astrometric microlensing signal for this binary lens as it will be seen by Gaia as well as the radial velocity curve for the binary system. Events such as Gaia16aye indicate the potential for the microlensing method of probing the mass function of dark objects, including black holes, in directions other than that of the Galactic bulge. This case also emphasises the importance of long-term time-domain coordinated observations that can be made with a network of heterogeneous telescopes.
△ Less
Submitted 28 October, 2019; v1 submitted 22 January, 2019;
originally announced January 2019.