-
Multimodal Reaching-Position Prediction for ADL Support Using Neural Networks
Authors:
Yutaka Takase,
Kimitoshi Yamazaki
Abstract:
This study aimed to develop daily living support robots for patients with hemiplegia and the elderly. To support the daily living activities using robots in ordinary households without imposing physical and mental burdens on users, the system must detect the actions of the user and move appropriately according to their motions.
We propose a reaching-position prediction scheme that targets the mo…
▽ More
This study aimed to develop daily living support robots for patients with hemiplegia and the elderly. To support the daily living activities using robots in ordinary households without imposing physical and mental burdens on users, the system must detect the actions of the user and move appropriately according to their motions.
We propose a reaching-position prediction scheme that targets the motion of lifting the upper arm, which is burdensome for patients with hemiplegia and the elderly in daily living activities.
For this motion, it is difficult to obtain effective features to create a prediction model in environments where large-scale sensor system installation is not feasible and the motion time is short.
We performed motion-collection experiments, revealed the features of the target motion and built a prediction model using the multimodal motion features and deep learning.
The proposed model achieved an accuracy of 93 \% macro average and F1-score of 0.69 for a 9-class classification prediction at 35\% of the motion completion.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model
Authors:
Khoa Vo,
Thinh Phan,
Kashu Yamazaki,
Minh Tran,
Ngan Le
Abstract:
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modaliti…
▽ More
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities.
In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grou** mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments.
Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
△ Less
Submitted 6 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Investigating Self-Supervised Image Denoising with Denaturation
Authors:
Hiroki Waida,
Kimihiro Yamazaki,
Atsushi Tokuhisa,
Mutsuyo Wada,
Yuichiro Wada
Abstract:
Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth th…
▽ More
Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth through theoretical analysis and numerical experiments. Through the theoretical analysis, we discuss that the algorithm finds desired solutions to the optimization problem with the population risk, while the guarantee for the empirical risk depends on the hardness of the denoising task in terms of denaturation levels. We also conduct several experiments to investigate the performance of an extended algorithm in practice. The results indicate that the algorithm training with denatured images works, and the empirical performance aligns with the theoretical results. These results suggest several insights for further improvement of self-supervised image denoising that uses denatured data in future directions.
△ Less
Submitted 8 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Breaching the Bottleneck: Evolutionary Transition from Reward-Driven Learning to Reward-Agnostic Domain-Adapted Learning in Neuromodulated Neural Nets
Authors:
Solvi Arnold,
Reiji Suzuki,
Takaya Arita,
Kimitoshi Yamazaki
Abstract:
Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to…
▽ More
Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Fixed Confidence Best Arm Identification in the Bayesian Setting
Authors:
Kyoungseok Jang,
Junpei Komiyama,
Kazutoshi Yamazaki
Abstract:
We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show t…
▽ More
We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show that the traditional FC-BAI algorithms studied in the frequentist setting, such as track-and-stop and top-two algorithms, result in arbitrarily suboptimal performances in the Bayesian setting. We also obtain a lower bound of the expected number of samples in the Bayesian setting and introduce a variant of successive elimination that has a matching performance with the lower bound up to a logarithmic factor. Simulations verify the theoretical results.
△ Less
Submitted 22 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Open-Fusion: Real-time Open-Vocabulary 3D Map** and Queryable Scene Representation
Authors:
Kashu Yamazaki,
Taisei Hanyu,
Khoa Vo,
Thang Pham,
Minh Tran,
Gianfranco Doretto,
Anh Nguyen,
Ngan Le
Abstract:
Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language found…
▽ More
Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation
Authors:
Kashu Yamazaki,
Taisei Hanyu,
Minh Tran,
Adrian de Luis,
Roy McCann,
Haitao Liao,
Chase Rainwater,
Meredith Adkins,
Jackson Cothren,
Ngan Le
Abstract:
Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at…
▽ More
Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance.
△ Less
Submitted 1 October, 2023; v1 submitted 11 June, 2023;
originally announced June 2023.
-
User-Centric Clustering Under Fairness Scheduling in Cell-Free Massive MIMO
Authors:
Fabian Göttsch,
Noboru Osawa,
Yoshiaki Amano,
Issei Kanno,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider fairness scheduling in a user-centric cell-free massive MIMO network, where $L$ remote radio units, each with $M$ antennas, serve $K_{\rm tot} \approx LM$ user equipments (UEs). Recent results show that the maximum network sum throughput is achieved where $K_{\rm act} \approx \frac{LM}{2}$ UEs are simultaneously active in any given time-frequency slots. However, the number of users…
▽ More
We consider fairness scheduling in a user-centric cell-free massive MIMO network, where $L$ remote radio units, each with $M$ antennas, serve $K_{\rm tot} \approx LM$ user equipments (UEs). Recent results show that the maximum network sum throughput is achieved where $K_{\rm act} \approx \frac{LM}{2}$ UEs are simultaneously active in any given time-frequency slots. However, the number of users $K_{\rm tot}$ in the network is usually much larger. This requires that users are scheduled over the time-frequency resource and achieve a certain throughput rate as an average over the slots. We impose throughput fairness among UEs with a scheduling approach aiming to maximize a concave component-wise non-decreasing network utility function of the per-user throughput rates. In cell-free user-centric networks, the pilot and cluster assignment is usually done for a given set of active users. Combined with fairness scheduling, this requires pilot and cluster reassignment at each scheduling slot, involving an enormous overhead of control signaling exchange between network entities. We propose a fixed pilot and cluster assignment scheme (independent of the scheduling decisions), which outperforms the baseline method in terms of UE throughput, while requiring much less control information exchange between network entities.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Contextual Explainable Video Representation: Human Perception-based Understanding
Authors:
Khoa Vo,
Kashu Yamazaki,
Phong X. Nguyen,
Phat Nguyen,
Khoa Luu,
Ngan Le
Abstract:
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given…
▽ More
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
△ Less
Submitted 17 December, 2022; v1 submitted 12 December, 2022;
originally announced December 2022.
-
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
Authors:
Hyekang Kevin Joo,
Khoa Vo,
Kashu Yamazaki,
Ngan Le
Abstract:
Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C…
▽ More
Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.
△ Less
Submitted 3 July, 2023; v1 submitted 9 December, 2022;
originally announced December 2022.
-
Fairness Scheduling in Dense User-Centric Cell-Free Massive MIMO Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Yoshiaki Amano,
Issei Kanno,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider a user-centric scalable cell-free massive MIMO network with a total of $LM$ distributed remote radio unit antennas serving $K$ user equipments (UEs). Many works in the current literature assume $LM\gg K$, enabling high UE data rates but also leading to a system not operating at its maximum performance in terms of sum throughput. We provide a new perspective on cell-free massive MIMO ne…
▽ More
We consider a user-centric scalable cell-free massive MIMO network with a total of $LM$ distributed remote radio unit antennas serving $K$ user equipments (UEs). Many works in the current literature assume $LM\gg K$, enabling high UE data rates but also leading to a system not operating at its maximum performance in terms of sum throughput. We provide a new perspective on cell-free massive MIMO networks, investigating rate allocation and the UE density regime in which the network makes use of its full capability. The UE density $K$ approximately equal to $\frac{LM}{2}$ is the range in which the system reaches the largest sum throughput. In addition, there is a significant fraction of UEs with relatively low throughput, when serving $K>\frac{LM}{2}$ UEs simultaneously. We propose to reduce the number of active UEs per time slot, such that the system does not operate at ``full load'', and impose throughput fairness among all users via a scheduler designed to maximize a suitably defined concave componentwise non-decreasing network utility function. Our numerical simulations show that we can tune the system such that a desired distribution of the UE throughput, depending on the utility function, is achieved.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Authors:
Kashu Yamazaki,
Khoa Vo,
Sang Truong,
Bhiksha Raj,
Ngan Le
Abstract:
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we fir…
▽ More
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity. Source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
△ Less
Submitted 15 February, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
AISFormer: Amodal Instance Segmentation with Transformer
Authors:
Minh Tran,
Khoa Vo,
Kashu Yamazaki,
Arthur Fernandes,
Michael Kidd,
Ngan Le
Abstract:
Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convoluti…
▽ More
Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer
△ Less
Submitted 17 March, 2024; v1 submitted 12 October, 2022;
originally announced October 2022.
-
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
Authors:
Khoa Vo,
Sang Truong,
Kashu Yamazaki,
Bhiksha Raj,
Minh-Triet Tran,
Ngan Le
Abstract:
Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human per…
▽ More
Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human perceiving process by applying a backbone network into a given video as a black-box. In this paper, we propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net). Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM). Additionally, we introduce adaptive attention mechanism (AAM) in PMR to focus only on main actors (or relevant objects) and model the relationships among them. PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model. BMM module processes the sequence of visual-linguistic features as its input and generates action proposals. Comprehensive experiments and extensive ablation studies on ActivityNet-1.3 and THUMOS-14 datasets show that our proposed AOE-Net outperforms previous state-of-the-art methods with remarkable performance and generalization for both TAPG and temporal action detection. To prove the robustness and effectiveness of AOE-Net, we further conduct an ablation study on egocentric videos, i.e. EPIC-KITCHENS 100 dataset. Source code is available upon acceptance.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Overloaded Pilot Assignment with Pilot Decontamination for Cell-Free Systems
Authors:
Noboru Osawa,
Fabian Göttsch,
Issei Kanno,
Takeo Ohseki,
Yoshiaki Amano,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
The pilot contamination in cell-free massive multiple-input-multiple-output (CF-mMIMO) must be addressed for accommodating a large number of users. In previous works, we have investigated a decontamination method called subspace projection (SP). The SP separates interference from co-pilot users by using the orthogonality of the principal components of the users' channel subspaces. Non-overloaded p…
▽ More
The pilot contamination in cell-free massive multiple-input-multiple-output (CF-mMIMO) must be addressed for accommodating a large number of users. In previous works, we have investigated a decontamination method called subspace projection (SP). The SP separates interference from co-pilot users by using the orthogonality of the principal components of the users' channel subspaces. Non-overloaded pilot assignment (PA), where each radio unit (RU) does not assign the same pilot to different users, limits the spectral efficiency (SE) of the system, since SP channel estimation is able to deal with co-pilot users that have nearly orthogonal subspaces. Motivated by this limitation, this paper introduces overloaded PA methods adjusted for the decontamination in order to improve the sum SE of CF systems. Numerical simulations show that the overloaded PA methods give higher SE than that of non-overloaded PA at a high user density scenario.
△ Less
Submitted 10 October, 2022; v1 submitted 23 July, 2022;
originally announced July 2022.
-
VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
Authors:
Kashu Yamazaki,
Sang Truong,
Khoa Vo,
Michael Kidd,
Chase Rainwater,
Khoa Luu,
Ngan Le
Abstract:
In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human a…
▽ More
In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.
△ Less
Submitted 6 August, 2022; v1 submitted 26 June, 2022;
originally announced June 2022.
-
Recognising Affordances in Predicted Futures to Plan with Consideration of Non-canonical Affordance Effects
Authors:
Solvi Arnold,
Mami Kuroishi,
Tadashi Adachi,
Kimitoshi Yamazaki
Abstract:
We propose a novel system for action sequence planning based on a combination of affordance recognition and a neural forward model predicting the effects of affordance execution. By performing affordance recognition on predicted futures, we avoid reliance on explicit affordance effect definitions for multi-step planning. Because the system learns affordance effects from experience data, the system…
▽ More
We propose a novel system for action sequence planning based on a combination of affordance recognition and a neural forward model predicting the effects of affordance execution. By performing affordance recognition on predicted futures, we avoid reliance on explicit affordance effect definitions for multi-step planning. Because the system learns affordance effects from experience data, the system can foresee not just the canonical effects of an affordance, but also situation-specific side-effects. This allows the system to avoid planning failures due to such non-canonical effects, and makes it possible to exploit non-canonical effects for realising a given goal. We evaluate the system in simulation, on a set of test tasks that require consideration of canonical and non-canonical affordance effects.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Robust PCA for Subspace Estimation in User-Centric Cell-Free Wireless Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider a scalable user-centric cell-free massive MIMO network with distributed remote radio units (RUs), enabling macrodiversity and joint processing. Due to the limited uplink (UL) pilot dimension, multiuser interference in the UL pilot transmission phase makes channel estimation a non-trivial problem. We make use of two types of UL pilot signals, sounding reference signal (SRS) and demodula…
▽ More
We consider a scalable user-centric cell-free massive MIMO network with distributed remote radio units (RUs), enabling macrodiversity and joint processing. Due to the limited uplink (UL) pilot dimension, multiuser interference in the UL pilot transmission phase makes channel estimation a non-trivial problem. We make use of two types of UL pilot signals, sounding reference signal (SRS) and demodulation reference signal (DMRS) pilots, for the estimation of the channel subspace and its instantaneous realization, respectively. The SRS pilots are transmitted over multiple time slots and resource blocks according to a Latin squares based hop** scheme, which aims at averaging out the interference of different SRS co-pilot users. We propose a robust principle component analysis approach for channel subspace estimation from the SRS signal samples, employed at the RUs for each associated user. The estimated subspace is further used at the RUs for DMRS pilot decontamination and instantaneous channel estimation. We provide numerical simulations to compare the system performance using our subspace and channel estimation scheme with the cases of ideal partial subspace/channel knowledge and pilot matching channel estimation. The results show that a system with a properly designed SRS pilot hop** scheme can closely approximate the performance of a genie-aided system.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
Optimal User Load and Energy Efficiency in User-Centric Cell-Free Wireless Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
Cell-free massive MIMO is a variant of multiuser MIMO and massive MIMO, in which the total number of antennas $LM$ is distributed among the $L$ remote radio units (RUs) in the system, enabling macrodiversity and joint processing. Due to pilot contamination and system scalability, each RU can only serve a limited number of users. Obtaining the optimal number of users simultaneously served on one re…
▽ More
Cell-free massive MIMO is a variant of multiuser MIMO and massive MIMO, in which the total number of antennas $LM$ is distributed among the $L$ remote radio units (RUs) in the system, enabling macrodiversity and joint processing. Due to pilot contamination and system scalability, each RU can only serve a limited number of users. Obtaining the optimal number of users simultaneously served on one resource block (RB) by the $L$ RUs regarding the sum spectral efficiency (SE) is not a simple challenge though, as many of the system parameters are intertwined. For example, the dimension $τ_p$ of orthogonal Demodulation Reference Signal (DMRS) pilots limits the number of users that an RU can serve. Thus, depending on $τ_p$, the optimal user load yielding the maximum sum SE will vary. Another key parameter is the users' uplink transmit power $P^{\rm ue}_{\rm tx}$, where a trade-off between users in outage, interference and energy inefficiency exists. We study the effect of multiple parameters in cell-free massive MIMO on the sum SE and user outage, as well as the performance of different levels of RU antenna distribution. We provide extensive numerical investigations to illuminate the behavior of the system SE with respect to the various parameters, including the effect of the system load, i.e., the number of active users to be served on any RB. The results show that in general a system with many RUs and few RU antennas yields the largest sum SE, where the benefits of distributed antennas reduce in very dense networks.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
Meta-Learning of NAS for Few-shot Learning in Medical Image Applications
Authors:
Viet-Khoa Vo-Ho,
Kashu Yamazaki,
Hieu Hoang,
Minh-Triet Tran,
Ngan Le
Abstract:
Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre-/post-procedures. Even though it has…
▽ More
Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre-/post-procedures. Even though it has been shown that network architecture plays a critical role in learning feature representation feature from data and the final performance, searching for the best network architecture is computationally intensive and heavily relies on researchers' experience. Automated machine learning (AutoML) and its advanced techniques i.e. Neural Architecture Search (NAS) have been promoted to address those limitations. Not only in general computer vision tasks, but NAS has also motivated various applications in multiple areas including medical imaging. In medical imaging, NAS has significant progress in improving the accuracy of image classification, segmentation, reconstruction, and more. However, NAS requires the availability of large annotated data, considerable computation resources, and pre-defined tasks. To address such limitations, meta-learning has been adopted in the scenarios of few-shot learning and multiple tasks. In this book chapter, we first present a brief review of NAS by discussing well-known approaches in search space, search strategy, and evaluation strategy. We then introduce various NAS approaches in medical imaging with different applications such as classification, segmentation, detection, reconstruction, etc. Meta-learning in NAS for few-shot learning and multiple tasks is then explained. Finally, we describe several open problems in NAS.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation
Authors:
Khoa Vo,
Kashu Yamazaki,
Sang Truong,
Minh-Triet Tran,
Akihiro Sugimoto,
Ngan Le
Abstract:
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a…
▽ More
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a black-box to the untrimmed videos to extract video visual representation. Therefore, it is beneficial and potentially improve the performance of TAPG if we can capture these interactions between agents and the environment. In this paper, we propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks (i) an Agent-Aware Representation Network to obtain both agent-agent and agents-environment relationships in the video representation, and (ii) a Boundary Generation Network to estimate the confidence score of temporal intervals. In the Agent-Aware Representation Network, the interactions between agents are expressed through local pathway, which operates at a local level to focus on the motions of agents whereas the overall perception of the surroundings are expressed through global pathway, which operates at a global level to perceive the effects of agents-environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks (i.e C3D, SlowFast and Two-Stream) show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG. We further examine the proposal quality by leveraging proposals generated by our method onto temporal action detection (TAD) frameworks and evaluate their detection performances. The source code can be found in this URL https://github.com/vhvkhoa/TAPG-AgentEnvNetwork.git.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Subspace-Based Pilot Decontamination in User-Centric Scalable Cell-Free Wireless Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider a cell-free wireless system operated in Time Division Duplex (TDD) mode with user-centric clusters of remote radio units (RUs). Since the uplink pilot dimensions per channel coherence slot is limited, co-pilot users might incur mutual pilot contamination. In the current literature, it is assumed that the long-term statistical knowledge of all user channels is available. This enables Mi…
▽ More
We consider a cell-free wireless system operated in Time Division Duplex (TDD) mode with user-centric clusters of remote radio units (RUs). Since the uplink pilot dimensions per channel coherence slot is limited, co-pilot users might incur mutual pilot contamination. In the current literature, it is assumed that the long-term statistical knowledge of all user channels is available. This enables Minimum Mean-Square Error channel estimation or simplified dominant subspace projection, which achieves significant pilot decontamination under certain assumptions on the channel covariance matrices. However, estimating the channel covariance matrix or even just its dominant subspace at all RUs forming a user cluster is not an easy task. In fact, if not properly designed, a piloting scheme for such long-term statistics estimation will also be subject to the contamination problem. In this paper, we propose a new channel subspace estimation scheme explicitly designed for cell-free wireless networks. Our scheme is based on 1) a sounding reference signal (SRS) using latin squares wideband frequency hop**, and 2) a subspace estimation method based on robust Principal Component Analysis (R-PCA). The SRS hop** scheme ensures that for any user and any RU participating in its cluster, only a few pilot measurements will contain strong co-pilot interference. These few heavily contaminated measurements are (implicitly) eliminated by R-PCA, which is designed to regularize the estimation and discount the ``outlier'' measurements. Our simulation results show that the proposed scheme achieves almost perfect subspace knowledge, which in turns yields system performance very close to that with ideal channel state information, thus essentially solving the problem of pilot contamination in cell-free user-centric TDD wireless networks.
△ Less
Submitted 17 November, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Uplink-Downlink Duality and Precoding Strategies with Partial CSI in Cell-Free Wireless Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. After having shown the importance of dominant channel subspace information for uplink (UL) pilot decontamination and having examined different UL combining schemes in our previous work, here we investigate precoding strategies for the downlink (DL). Distributed scalable DL p…
▽ More
We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. After having shown the importance of dominant channel subspace information for uplink (UL) pilot decontamination and having examined different UL combining schemes in our previous work, here we investigate precoding strategies for the downlink (DL). Distributed scalable DL precoding and power allocation methods are evaluated for different antenna distributions, user densities and UL pilot dimensions. We compare distributed power allocation methods to a scheme based on a particular form of UL-DL duality which is computable by a central processor based on the available partial channel state information. The new duality method achieves almost symmetric "optimistic ergodic rates" for UL and DL while saving considerable computational complexity since the UL combining vectors are reused as DL precoders.
△ Less
Submitted 17 January, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Energy Efficiency of Uplink Cell-Free Massive MIMO With Transmit Power Control in Measured Propagation Channel
Authors:
Thomas Choi,
Masaaki Ito,
Issei Kanno,
Jorge Gomez-Ponce,
Colton Bullard,
Takeo Ohseki,
Kosuke Yamazaki,
Andreas F. Molisch
Abstract:
Cell-free massive MIMO (CF-mMIMO) provides wireless connectivity for a large number of user equipments (UEs) using access points (APs) distributed across a wide area with high spectral efficiency (SE). The energy efficiency (EE) of the uplink is determined by (i) the transmit power control (TPC) algorithms, (ii) the numbers, configurations, and locations of the APs and the UEs, and (iii) the propa…
▽ More
Cell-free massive MIMO (CF-mMIMO) provides wireless connectivity for a large number of user equipments (UEs) using access points (APs) distributed across a wide area with high spectral efficiency (SE). The energy efficiency (EE) of the uplink is determined by (i) the transmit power control (TPC) algorithms, (ii) the numbers, configurations, and locations of the APs and the UEs, and (iii) the propagation channels between the APs and the UEs. This paper investigates all three aspects, based on extensive (~30,000 possible AP locations and 128 possible UE locations) channel measurement data at 3.5 GHz. We compare three different TPC algorithms, namely maximization of transmit power (max-power), maximization of minimum SE (max-min SE), and maximization of minimum EE (max-min EE) while guaranteeing a target SE. We also compare various antenna arrangements including fully-distributed and semi-distributed systems, where APs can be located on a regular grid or randomly, and the UEs can be placed in clusters or far apart. Overall, we show that the max-min EE TPC is highly effective in improving the uplink EE, especially when no UE within a set of served UEs is in a bad channel condition and when the BS antennas are fully-distributed.
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation
Authors:
Khoa Vo,
Hyekang Joo,
Kashu Yamazaki,
Sang Truong,
Kris Kitani,
Minh-Triet Tran,
Ngan Le
Abstract:
Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforeme…
▽ More
Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforementioned fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of a human by proposing Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation. AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary-matching architectures (i.e., CNN-based and GCN-based) and two classifiers (i.e., Unet and P-GCN). Our AEI robustly outperforms the state-of-the-art methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection.
△ Less
Submitted 24 October, 2021; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey
Authors:
Ngan Le,
Vidhiwar Singh Rathour,
Kashu Yamazaki,
Khoa Luu,
Marios Savvides
Abstract:
Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains including finance, medicine, healthcare, video games, robotics, and computer vision. In this work, we provide a detailed review of recent and state-of-the…
▽ More
Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains including finance, medicine, healthcare, video games, robotics, and computer vision. In this work, we provide a detailed review of recent and state-of-the-art research advances of deep reinforcement learning in computer vision. We start with comprehending the theories of deep learning, reinforcement learning, and deep reinforcement learning. We then propose a categorization of deep reinforcement learning methodologies and discuss their advantages and limitations. In particular, we divide deep reinforcement learning into seven main categories according to their applications in computer vision, i.e. (i)landmark localization (ii) object detection; (iii) object tracking; (iv) registration on both 2D image and 3D image volumetric data (v) image segmentation; (vi) videos analysis; and (vii) other applications. Each of these categories is further analyzed with reinforcement learning techniques, network design, and performance. Moreover, we provide a comprehensive analysis of the existing publicly available datasets and examine source code availability. Finally, we present some open issues and discuss future research directions on deep reinforcement learning in computer vision
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Calibration Method of the Monocular Omnidirectional Stereo Camera
Authors:
Ryota Kawamata,
Keiichi Betsui,
Kazuyoshi Yamazaki,
Rei Sakakibara,
Takeshi Shimano
Abstract:
Compact and low-cost devices are needed for autonomous driving to image and measure distances to objects 360-degree around. We have been develo** an omnidirectional stereo camera exploiting two hyperbolic mirrors and a single set of a lens and sensor, which makes this camera compact and cost efficient. We establish a new calibration method for this camera considering higher-order radial distorti…
▽ More
Compact and low-cost devices are needed for autonomous driving to image and measure distances to objects 360-degree around. We have been develo** an omnidirectional stereo camera exploiting two hyperbolic mirrors and a single set of a lens and sensor, which makes this camera compact and cost efficient. We establish a new calibration method for this camera considering higher-order radial distortion, detailed tangential distortion, an image sensor tilt, and a lens-mirror offset. Our method reduces the calibration error by 6.0 and 4.3 times for the upper- and lower-view images, respectively. The random error of the distance measurement is 4.9% and the systematic error is 5.7% up to objects 14 meters apart, which is improved almost nine times compared to the conventional method. The remaining distance errors is due to a degraded optical resolution of the prototype, which we plan to make further improvements as future work.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
The Impact of Subspace-Based Pilot Decontamination in User-Centric Scalable Cell-Free Wireless Networks
Authors:
Fabian Göttsch,
Noboru Osawa,
Takeo Ohseki,
Kosuke Yamazaki,
Giuseppe Caire
Abstract:
We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. Several options for scalable uplink (UL) processing are examined including: i) cluster size and SNR threshold criterion for cluster formation; ii) UL pilot dimension; iii) local detection and global (per cluster) combining. We use a simple model for the channel vector spatia…
▽ More
We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. Several options for scalable uplink (UL) processing are examined including: i) cluster size and SNR threshold criterion for cluster formation; ii) UL pilot dimension; iii) local detection and global (per cluster) combining. We use a simple model for the channel vector spatial correlation, which captures the fact that the propagation between UEs and RRHs is not isotropic. In particular, we define the ideal performance based on ideal but partial CSI, i.e., the CSI that can be estimated based on the users to antenna heads cluster connectivity. In practice, CSI is estimated from UL pilots, and therefore it is affected by noise and pilot contamination. We show that a very simple subspace projection scheme is able to basically attain the same performance of perfect but partial CSI. This points out that the essential information needed to pilot decontamination reduces effectively to the dominant channel subspaces.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Agent-Environment Network for Temporal Action Proposal Generation
Authors:
Viet-Khoa Vo-Ho,
Ngan Le,
Kashu Yamazaki,
Akihiro Sugimoto,
Minh-Triet Tran
Abstract:
Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction…
▽ More
Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
△ Less
Submitted 16 March, 2022; v1 submitted 17 July, 2021;
originally announced July 2021.
-
Invertible Residual Network with Regularization for Effective Medical Image Segmentation
Authors:
Kashu Yamazaki,
Vidhiwar Singh Rathour,
T. Hoang Ngan Le
Abstract:
Deep Convolutional Neural Networks (CNNs) i.e. Residual Networks (ResNets) have been used successfully for many computer vision tasks, but are difficult to scale to 3D volumetric medical data. Memory is increasingly often the bottleneck when training 3D Convolutional Neural Networks (CNNs). Recently, invertible neural networks have been applied to significantly reduce activation memory footprint w…
▽ More
Deep Convolutional Neural Networks (CNNs) i.e. Residual Networks (ResNets) have been used successfully for many computer vision tasks, but are difficult to scale to 3D volumetric medical data. Memory is increasingly often the bottleneck when training 3D Convolutional Neural Networks (CNNs). Recently, invertible neural networks have been applied to significantly reduce activation memory footprint when training neural networks with backpropagation thanks to the invertible functions that allow retrieving its input from its output without storing intermediate activations in memory to perform the backpropagation.
Among many successful network architectures, 3D Unet has been established as a standard architecture for volumetric medical segmentation. Thus, we choose 3D Unet as a baseline for a non-invertible network and we then extend it with the invertible residual network. In this paper, we proposed two versions of the invertible Residual Network, namely Partially Invertible Residual Network (Partially-InvRes) and Fully Invertible Residual Network (Fully-InvRes). In Partially-InvRes, the invertible residual layer is defined by a technique called additive coupling whereas in Fully-InvRes, both invertible upsampling and downsampling operations are learned based on squeezing (known as pixel shuffle). Furthermore, to avoid the overfitting problem because of less training data, a variational auto-encoder (VAE) branch is added to reconstruct the input volumetric data itself. Our results indicate that by using partially/fully invertible networks as the central workhorse in volumetric segmentation, we not only reduce memory overhead but also achieve compatible segmentation performance compared against the non-invertible 3D Unet. We have demonstrated the proposed networks on various volumetric datasets such as iSeg 2019 and BraTS 2020.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
Cloth Manipulation Planning on Basis of Mesh Representations with Incomplete Domain Knowledge and Voxel-to-Mesh Estimation
Authors:
Solvi Arnold,
Daisuke Tanaka,
Kimitoshi Yamazaki
Abstract:
We consider the problem of open-goal planning for robotic cloth manipulation. Core of our system is a neural network trained as a forward model of cloth behaviour under manipulation, with planning performed through backpropagation. We introduce a neural network-based routine for estimating mesh representations from voxel input, and perform planning in mesh format internally. We address the problem…
▽ More
We consider the problem of open-goal planning for robotic cloth manipulation. Core of our system is a neural network trained as a forward model of cloth behaviour under manipulation, with planning performed through backpropagation. We introduce a neural network-based routine for estimating mesh representations from voxel input, and perform planning in mesh format internally. We address the problem of planning with incomplete domain knowledge by means of an explicit epistemic uncertainty signal. This signal is calculated from prediction divergence between two instances of the forward model network and used to avoid epistemic uncertainty during planning. Finally, we introduce logic for handling restriction of grasp points to a discrete set of candidates, in order to accommodate graspability constraints imposed by robotic hardware. We evaluate the system's mesh estimation, prediction, and planning ability on simulated cloth for sequences of one to three manipulations. Comparative experiments confirm that planning on basis of estimated meshes improves accuracy compared to voxel-based planning, and that epistemic uncertainty avoidance improves performance under conditions of incomplete domain knowledge. Planning time cost is a few seconds. We additionally present qualitative results on robot hardware.
△ Less
Submitted 12 November, 2021; v1 submitted 15 March, 2021;
originally announced March 2021.
-
Offset Curves Loss for Imbalanced Problem in Medical Segmentation
Authors:
Ngan Le,
Trung Le,
Kashu Yamazaki,
Toan Duc Bui,
Khoa Luu,
Marios Savides
Abstract:
Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes int…
▽ More
Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes into account both higher feature level i.e. region inside contour, intermediate feature level i.e. offset curves around the contour and lower feature level i.e. contour. Our proposed Offset Curves (OsC) loss consists of three main fitting terms. The first fitting term focuses on pixel-wise level segmentation whereas the second fitting term acts as attention model which pays attention to the area around the boundaries (offset curves). The third terms plays a role as regularization term which takes the length of boundaries into account. We evaluate our proposed OsC loss on both 2D network and 3D network. Two common medical datasets, i.e. retina DRIVE and brain tumor BRATS 2018 datasets are used to benchmark our proposed loss performance. The experiments have shown that our proposed OsC loss function outperforms other mainstream loss functions such as Cross-Entropy, Dice, Focal on the most common segmentation networks Unet, FCN.
△ Less
Submitted 4 December, 2020;
originally announced December 2020.
-
A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation
Authors:
Ngan Le,
Kashu Yamazaki,
Dat Truong,
Kha Gia Quach,
Marios Savvides
Abstract:
In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belongi…
▽ More
In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belonging to the foreground class (tumor pixel). To address this problem, we propose a multi-task network which is formed as a cascaded structure. Our model consists of two targets, i.e., (i) effectively differentiate the brain tumor regions and (ii) estimate the brain tumor mask. The first objective is performed by our proposed contextual brain tumor detection network, which plays a role of an attention gate and focuses on the region around brain tumor only while ignoring the far neighbor background which is less correlated to the tumor. The second objective is built upon a 3D atrous residual network and under an encode-decode network in order to effectively segment both large and small objects (brain tumor). Our 3D atrous residual network is designed with a skip connection to enables the gradient from the deep layers to be directly propagated to shallow layers, thus, features of different depths are preserved and used for refining each other. In order to incorporate larger contextual information from volume MRI data, our network utilizes the 3D atrous convolution with various kernel sizes, which enlarges the receptive field of filters. Our proposed network has been evaluated on various datasets including BRATS2015, BRATS2017 and BRATS2018 datasets with both validation set and testing set. Our performance has been benchmarked by both region-based metrics and surface-based metrics. We also have conducted comparisons against state-of-the-art approaches.
△ Less
Submitted 3 December, 2020;
originally announced December 2020.
-
Channel Estimation and Equalization for CP-OFDM-based OTFS in Fractional Doppler Channels
Authors:
Noriyuki Hashimoto,
Noboru Osawa,
Kosuke Yamazaki,
Shinsuke Ibi
Abstract:
Orthogonal time frequency and space (OTFS) modulation is a promising technology that satisfies high Doppler requirements for future mobile systems. OTFS modulation encodes information symbols and pilot symbols into the two-dimensional (2D) delay-Doppler (DD) domain. The received symbols suffer from inter-Doppler interference (IDI) in the fading channels with fractional Doppler shifts that are samp…
▽ More
Orthogonal time frequency and space (OTFS) modulation is a promising technology that satisfies high Doppler requirements for future mobile systems. OTFS modulation encodes information symbols and pilot symbols into the two-dimensional (2D) delay-Doppler (DD) domain. The received symbols suffer from inter-Doppler interference (IDI) in the fading channels with fractional Doppler shifts that are sampled at noninteger indices in the DD domain. IDI has been treated as an unavoidable effect because the fractional Doppler shifts cannot be obtained directly from the received pilot symbols. In this paper, we provide a solution to channel estimation for fractional Doppler channels. The proposed estimation provides new insight into the OTFS input-output relation in the DD domain as a 2D circular convolution with a small approximation. According to the input-output relation, we also provide a low-complexity channel equalization method using the estimated channel information. We demonstrate the error performance of the proposed channel estimation and equalization in several channels by simulations. The simulation results show that in high-mobility environments, the total system utilizing the proposed methods outperforms orthogonal frequency division multiplexing (OFDM) with ideal channel estimation and a conventional channel estimation method using a pseudo sequence.
△ Less
Submitted 21 January, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
Model Bridging: Connection between Simulation Model and Neural Network
Authors:
Keiichi Kisamori,
Keisuke Yamazaki,
Yuto Komori,
Hiroshi Tokieda
Abstract:
The interpretability of machine learning, particularly for deep neural networks, is crucial for decision making in real-world applications. One approach is replacing the un-interpretable machine learning model with a surrogate model, which has a simple structure for interpretation. Another approach is understanding the target system by using a simulation modeled by human knowledge with interpretab…
▽ More
The interpretability of machine learning, particularly for deep neural networks, is crucial for decision making in real-world applications. One approach is replacing the un-interpretable machine learning model with a surrogate model, which has a simple structure for interpretation. Another approach is understanding the target system by using a simulation modeled by human knowledge with interpretable simulation parameters. Recently, simulator calibration has been developed based on kernel mean embedding to estimate the simulation parameters as posterior distributions. Our idea is to use a simulation model as an interpretable surrogate model. However, the computational cost of simulator calibration is high owing to the complexity of the simulation model. Thus, we propose a ''model-bridging'' framework to bridge machine learning models with simulation models by a series of kernel mean embeddings to address these difficulties. The proposed framework enables us to obtain predictions and interpretable simulation parameters simultaneously without the computationally expensive calculations of the simulations. In this study, we apply the proposed framework to essential simulations in the manufacturing industry, such as production simulation and fluid dynamics simulation.
△ Less
Submitted 21 July, 2020; v1 submitted 22 June, 2019;
originally announced June 2019.
-
Simulator Calibration under Covariate Shift with Kernels
Authors:
Keiichi Kisamori,
Motonobu Kanagawa,
Keisuke Yamazaki
Abstract:
We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift. Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. Our approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel…
▽ More
We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift. Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. Our approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel for covariate shift adaptation. We provide a theoretical analysis for the proposed method, including a novel theoretical result for conditional mean embedding, as well as empirical investigations suggesting its effectiveness in practice. The experiments include calibration of a widely used simulator for industrial manufacturing processes, where we also demonstrate how the proposed method may be useful for sensitivity analysis of model parameters.
△ Less
Submitted 18 March, 2020; v1 submitted 21 September, 2018;
originally announced September 2018.
-
Asymptotic Accuracy of Bayesian Estimation for a Single Latent Variable
Authors:
Keisuke Yamazaki
Abstract:
In data science and machine learning, hierarchical parametric models, such as mixture models, are often used. They contain two kinds of variables: observable variables, which represent the parts of the data that can be directly measured, and latent variables, which represent the underlying processes that generate the data. Although there has been an increase in research on the estimation accuracy…
▽ More
In data science and machine learning, hierarchical parametric models, such as mixture models, are often used. They contain two kinds of variables: observable variables, which represent the parts of the data that can be directly measured, and latent variables, which represent the underlying processes that generate the data. Although there has been an increase in research on the estimation accuracy for observable variables, the theoretical analysis of estimating latent variables has not been thoroughly investigated. In a previous study, we determined the accuracy of a Bayes estimation for the joint probability of the latent variables in a dataset, and we proved that the Bayes method is asymptotically more accurate than the maximum-likelihood method. However, the accuracy of the Bayes estimation for a single latent variable remains unknown. In the present paper, we derive the asymptotic expansions of the error functions, which are defined by the Kullback-Leibler divergence, for two types of single-variable estimations when the statistical regularity is satisfied. Our results indicate that the accuracies of the Bayes and maximum-likelihood methods are asymptotically equivalent and clarify that the Bayes method is only advantageous for multivariable estimations.
△ Less
Submitted 17 April, 2015; v1 submitted 25 August, 2014;
originally announced August 2014.
-
Stochastic complexity of Bayesian networks
Authors:
Keisuke Yamazaki,
Sumio Watanbe
Abstract:
Bayesian networks are now being used in enormous fields, for example, diagnosis of a system, data mining, clustering and so on. In spite of their wide range of applications, the statistical properties have not yet been clarified, because the models are nonidentifiable and non-regular. In a Bayesian network, the set of its parameter for a smaller model is an analytic set with singularities in the…
▽ More
Bayesian networks are now being used in enormous fields, for example, diagnosis of a system, data mining, clustering and so on. In spite of their wide range of applications, the statistical properties have not yet been clarified, because the models are nonidentifiable and non-regular. In a Bayesian network, the set of its parameter for a smaller model is an analytic set with singularities in the space of large ones. Because of these singularities, the Fisher information matrices are not positive definite. In other words, the mathematical foundation for learning was not constructed. In recent years, however, we have developed a method to analyze non-regular models using algebraic geometry. This method revealed the relation between the models singularities and its statistical properties. In this paper, applying this method to Bayesian networks with latent variables, we clarify the order of the stochastic complexities.Our result claims that the upper bound of those is smaller than the dimension of the parameter space. This means that the Bayesian generalization error is also far smaller than that of regular model, and that Schwarzs model selection criterion BIC needs to be improved for Bayesian networks.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
Performance Evaluation of Treecode Algorithm for N-Body Simulation Using GridRPC System
Authors:
Truong Vinh Truong Duy,
Katsuhiro Yamazaki,
Shigeru Oyanagi
Abstract:
This paper is aimed at improving the performance of the treecode algorithm for N-Body simulation by employing the NetSolve GridRPC programming model to exploit the use of multiple clusters. N-Body is a classical problem, and appears in many areas of science and engineering, including astrophysics, molecular dynamics, and graphics. In the simulation of N-Body, the specific routine for calculating t…
▽ More
This paper is aimed at improving the performance of the treecode algorithm for N-Body simulation by employing the NetSolve GridRPC programming model to exploit the use of multiple clusters. N-Body is a classical problem, and appears in many areas of science and engineering, including astrophysics, molecular dynamics, and graphics. In the simulation of N-Body, the specific routine for calculating the forces on the bodies which accounts for upwards of 90% of the cycles in typical computations is eminently suitable for obtaining parallelism with GridRPC calls. It is divided among the compute nodes by simultaneously calling multiple GridRPC requests to them. The performance of the GridRPC implementation is then compared to that of the MPI version and hybrid MPI-OpenMP version for the treecode algorithm on individual clusters.
△ Less
Submitted 10 November, 2012;
originally announced November 2012.
-
Hybrid MPI-OpenMP Paradigm on SMP Clusters: MPEG-2 Encoder and N-Body Simulation
Authors:
Truong Vinh Truong Duy,
Katsuhiro Yamazaki,
Kosai Ikegami,
Shigeru Oyanagi
Abstract:
Clusters of SMP nodes provide support for a wide diversity of parallel programming paradigms. Combining both shared memory and message passing parallelizations within the same application, the hybrid MPI-OpenMP paradigm is an emerging trend for parallel programming to fully exploit distributed shared-memory architecture. In this paper, we improve the performance of MPEG-2 encoder and n-body simula…
▽ More
Clusters of SMP nodes provide support for a wide diversity of parallel programming paradigms. Combining both shared memory and message passing parallelizations within the same application, the hybrid MPI-OpenMP paradigm is an emerging trend for parallel programming to fully exploit distributed shared-memory architecture. In this paper, we improve the performance of MPEG-2 encoder and n-body simulation by employing the hybrid MPI-OpenMP programming paradigm on SMP clusters. The hierarchical image data structure of the MPEG bit-stream is eminently suitable for the hybrid model to achieve multiple levels of parallelism: MPI for parallelism at the group of pictures level across SMP nodes and OpenMP for parallelism within pictures at the slice level within each SMP node. Similarly, the work load of the force calculation which accounts for upwards of 90% of the cycles in typical computations in the n-body simulation is shared among OpenMP threads after ORB domain decomposition among MPI processes. Besides, loop scheduling of OpenMP threads is adopted with appropriate chunk size to provide better load balance of work, leading to enhanced performance. With the n-body simulation, experimental results demonstrate that the hybrid MPI-OpenMP program outperforms the corresponding pure MPI program by average factors of 1.52 on a 4-way cluster and 1.21 on a 2-way cluster. Likewise, the hybrid model offers a performance improvement of 18% compared to the MPI model for the MPEG-2 encoder.
△ Less
Submitted 10 November, 2012;
originally announced November 2012.
-
Asymptotic Accuracy of Distribution-Based Estimation for Latent Variables
Authors:
Keisuke Yamazaki
Abstract:
Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretic…
▽ More
Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretical but also to practical situations, such as optimal model selection and active learning. There are many studies of generalization errors, which measure the prediction accuracy of the observable variables. However, the accuracy of estimating the latent variables has not yet been elucidated. For a quantitative evaluation of this, the present paper formulates distribution-based functions for the errors in the estimation of the latent variables. The asymptotic behavior is analyzed for both the maximum likelihood and the Bayes methods.
△ Less
Submitted 19 February, 2014; v1 submitted 10 April, 2012;
originally announced April 2012.