Search | arXiv e-print repository

Multimodal Reaching-Position Prediction for ADL Support Using Neural Networks

Authors: Yutaka Takase, Kimitoshi Yamazaki

Abstract: This study aimed to develop daily living support robots for patients with hemiplegia and the elderly. To support the daily living activities using robots in ordinary households without imposing physical and mental burdens on users, the system must detect the actions of the user and move appropriately according to their motions. We propose a reaching-position prediction scheme that targets the mo… ▽ More This study aimed to develop daily living support robots for patients with hemiplegia and the elderly. To support the daily living activities using robots in ordinary households without imposing physical and mental burdens on users, the system must detect the actions of the user and move appropriately according to their motions. We propose a reaching-position prediction scheme that targets the motion of lifting the upper arm, which is burdensome for patients with hemiplegia and the elderly in daily living activities. For this motion, it is difficult to obtain effective features to create a prediction model in environments where large-scale sensor system installation is not feasible and the motion time is short. We performed motion-collection experiments, revealed the features of the target motion and built a prediction model using the multimodal motion features and deep learning. The proposed model achieved an accuracy of 93 \% macro average and F1-score of 0.69 for a 9-class classification prediction at 35\% of the motion completion. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.00307 [pdf, other]

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

Abstract: Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modaliti… ▽ More Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grou** mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query. △ Less

Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

Comments: under submission

arXiv:2405.01124 [pdf, other]

Investigating Self-Supervised Image Denoising with Denaturation

Authors: Hiroki Waida, Kimihiro Yamazaki, Atsushi Tokuhisa, Mutsuyo Wada, Yuichiro Wada

Abstract: Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth th… ▽ More Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth through theoretical analysis and numerical experiments. Through the theoretical analysis, we discuss that the algorithm finds desired solutions to the optimization problem with the population risk, while the guarantee for the empirical risk depends on the hardness of the denoising task in terms of denaturation levels. We also conduct several experiments to investigate the performance of an extended algorithm in practice. The results indicate that the algorithm training with denatured images works, and the empirical performance aligns with the theoretical results. These results suggest several insights for further improvement of self-supervised image denoising that uses denatured data in future directions. △ Less

Submitted 8 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.12631 [pdf]

Breaching the Bottleneck: Evolutionary Transition from Reward-Driven Learning to Reward-Agnostic Domain-Adapted Learning in Neuromodulated Neural Nets

Authors: Solvi Arnold, Reiji Suzuki, Takaya Arita, Kimitoshi Yamazaki

Abstract: Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to… ▽ More Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 9 pages, 5 figures

ACM Class: I.2.6

arXiv:2402.10429 [pdf, ps, other]

Fixed Confidence Best Arm Identification in the Bayesian Setting

Authors: Kyoungseok Jang, Junpei Komiyama, Kazutoshi Yamazaki

Abstract: We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show t… ▽ More We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show that the traditional FC-BAI algorithms studied in the frequentist setting, such as track-and-stop and top-two algorithms, result in arbitrarily suboptimal performances in the Bayesian setting. We also obtain a lower bound of the expected number of samples in the Bayesian setting and introduce a variant of successive elimination that has a matching performance with the lower bound up to a logarithmic factor. Simulations verify the theoretical results. △ Less

Submitted 22 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2310.03923 [pdf, other]

Open-Fusion: Real-time Open-Vocabulary 3D Map** and Queryable Scene Representation

Authors: Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le

Abstract: Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language found… ▽ More Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2306.06842 [pdf, other]

AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation

Authors: Kashu Yamazaki, Taisei Hanyu, Minh Tran, Adrian de Luis, Roy McCann, Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren, Ngan Le

Abstract: Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at… ▽ More Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance. △ Less

Submitted 1 October, 2023; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: under review

arXiv:2305.08363 [pdf, other]

User-Centric Clustering Under Fairness Scheduling in Cell-Free Massive MIMO

Authors: Fabian Göttsch, Noboru Osawa, Yoshiaki Amano, Issei Kanno, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider fairness scheduling in a user-centric cell-free massive MIMO network, where $L$ remote radio units, each with $M$ antennas, serve $K_{\rm tot} \approx LM$ user equipments (UEs). Recent results show that the maximum network sum throughput is achieved where $K_{\rm act} \approx \frac{LM}{2}$ UEs are simultaneously active in any given time-frequency slots. However, the number of users… ▽ More We consider fairness scheduling in a user-centric cell-free massive MIMO network, where $L$ remote radio units, each with $M$ antennas, serve $K_{\rm tot} \approx LM$ user equipments (UEs). Recent results show that the maximum network sum throughput is achieved where $K_{\rm act} \approx \frac{LM}{2}$ UEs are simultaneously active in any given time-frequency slots. However, the number of users $K_{\rm tot}$ in the network is usually much larger. This requires that users are scheduled over the time-frequency resource and achieve a certain throughput rate as an average over the slots. We impose throughput fairness among UEs with a scheduling approach aiming to maximize a concave component-wise non-decreasing network utility function of the per-user throughput rates. In cell-free user-centric networks, the pilot and cluster assignment is usually done for a given set of active users. Combined with fairness scheduling, this requires pilot and cluster reassignment at each scheduling slot, involving an enormous overhead of control signaling exchange between network entities. We propose a fixed pilot and cluster assignment scheme (independent of the scheduling decisions), which outperforms the baseline method in terms of UE throughput, while requiring much less control information exchange between network entities. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2211.15294

arXiv:2212.06206 [pdf, other]

Contextual Explainable Video Representation: Human Perception-based Understanding

Authors: Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan Le

Abstract: Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given… ▽ More Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation. △ Less

Submitted 17 December, 2022; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted in Asilomar Conference 2022

arXiv:2212.05136 [pdf, other]

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Authors: Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le

Abstract: Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C… ▽ More Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA. △ Less

Submitted 3 July, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

Comments: Published at the 30th IEEE International Conference on Image Processing (IEEE ICIP 2023)

arXiv:2211.15294 [pdf, other]

Fairness Scheduling in Dense User-Centric Cell-Free Massive MIMO Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Yoshiaki Amano, Issei Kanno, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider a user-centric scalable cell-free massive MIMO network with a total of $LM$ distributed remote radio unit antennas serving $K$ user equipments (UEs). Many works in the current literature assume $LM\gg K$, enabling high UE data rates but also leading to a system not operating at its maximum performance in terms of sum throughput. We provide a new perspective on cell-free massive MIMO ne… ▽ More We consider a user-centric scalable cell-free massive MIMO network with a total of $LM$ distributed remote radio unit antennas serving $K$ user equipments (UEs). Many works in the current literature assume $LM\gg K$, enabling high UE data rates but also leading to a system not operating at its maximum performance in terms of sum throughput. We provide a new perspective on cell-free massive MIMO networks, investigating rate allocation and the UE density regime in which the network makes use of its full capability. The UE density $K$ approximately equal to $\frac{LM}{2}$ is the range in which the system reaches the largest sum throughput. In addition, there is a significant fraction of UEs with relatively low throughput, when serving $K>\frac{LM}{2}$ UEs simultaneously. We propose to reduce the number of active UEs per time slot, such that the system does not operate at ``full load'', and impose throughput fairness among all users via a scheduler designed to maximize a suitably defined concave componentwise non-decreasing network utility function. Our numerical simulations show that we can tune the system such that a desired distribution of the UE throughput, depending on the utility function, is achieved. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.15103 [pdf, other]

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Authors: Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le

Abstract: Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we fir… ▽ More Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity. Source code is made publicly available at: https://github.com/UARK-AICV/VLTinT. △ Less

Submitted 15 February, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: Accepted to AAAI 2023 Oral

arXiv:2210.06323 [pdf, other]

AISFormer: Amodal Instance Segmentation with Transformer

Authors: Minh Tran, Khoa Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan Le

Abstract: Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convoluti… ▽ More Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer △ Less

Submitted 17 March, 2024; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to BMVC2022

arXiv:2210.02578 [pdf, other]

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Authors: Khoa Vo, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, Ngan Le

Abstract: Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human per… ▽ More Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human perceiving process by applying a backbone network into a given video as a black-box. In this paper, we propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net). Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM). Additionally, we introduce adaptive attention mechanism (AAM) in PMR to focus only on main actors (or relevant objects) and model the relationships among them. PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model. BMM module processes the sequence of visual-linguistic features as its input and generates action proposals. Comprehensive experiments and extensive ablation studies on ActivityNet-1.3 and THUMOS-14 datasets show that our proposed AOE-Net outperforms previous state-of-the-art methods with remarkable performance and generalization for both TAPG and temporal action detection. To prove the robustness and effectiveness of AOE-Net, we further conduct an ablation study on egocentric videos, i.e. EPIC-KITCHENS 100 dataset. Source code is available upon acceptance. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Comments: Accepted for publication in International Journal of Computer Vision

arXiv:2207.11478 [pdf, other]

Overloaded Pilot Assignment with Pilot Decontamination for Cell-Free Systems

Authors: Noboru Osawa, Fabian Göttsch, Issei Kanno, Takeo Ohseki, Yoshiaki Amano, Kosuke Yamazaki, Giuseppe Caire

Abstract: The pilot contamination in cell-free massive multiple-input-multiple-output (CF-mMIMO) must be addressed for accommodating a large number of users. In previous works, we have investigated a decontamination method called subspace projection (SP). The SP separates interference from co-pilot users by using the orthogonality of the principal components of the users' channel subspaces. Non-overloaded p… ▽ More The pilot contamination in cell-free massive multiple-input-multiple-output (CF-mMIMO) must be addressed for accommodating a large number of users. In previous works, we have investigated a decontamination method called subspace projection (SP). The SP separates interference from co-pilot users by using the orthogonality of the principal components of the users' channel subspaces. Non-overloaded pilot assignment (PA), where each radio unit (RU) does not assign the same pilot to different users, limits the spectral efficiency (SE) of the system, since SP channel estimation is able to deal with co-pilot users that have nearly orthogonal subspaces. Motivated by this limitation, this paper introduces overloaded PA methods adjusted for the decontamination in order to improve the sum SE of CF systems. Numerical simulations show that the overloaded PA methods give higher SE than that of non-overloaded PA at a high user density scenario. △ Less

Submitted 10 October, 2022; v1 submitted 23 July, 2022; originally announced July 2022.

Comments: 7 pages, 2 figures, this paper was submitted to IEEE WCNC 2023

arXiv:2206.12972 [pdf, other]

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Authors: Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater, Khoa Luu, Ngan Le

Abstract: In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human a… ▽ More In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics. △ Less

Submitted 6 August, 2022; v1 submitted 26 June, 2022; originally announced June 2022.

Comments: accepted by The 29th IEEE International Conference on Image Processing (IEEE ICIP) 2022

arXiv:2206.10920 [pdf]

Recognising Affordances in Predicted Futures to Plan with Consideration of Non-canonical Affordance Effects

Authors: Solvi Arnold, Mami Kuroishi, Tadashi Adachi, Kimitoshi Yamazaki

Abstract: We propose a novel system for action sequence planning based on a combination of affordance recognition and a neural forward model predicting the effects of affordance execution. By performing affordance recognition on predicted futures, we avoid reliance on explicit affordance effect definitions for multi-step planning. Because the system learns affordance effects from experience data, the system… ▽ More We propose a novel system for action sequence planning based on a combination of affordance recognition and a neural forward model predicting the effects of affordance execution. By performing affordance recognition on predicted futures, we avoid reliance on explicit affordance effect definitions for multi-step planning. Because the system learns affordance effects from experience data, the system can foresee not just the canonical effects of an affordance, but also situation-specific side-effects. This allows the system to avoid planning failures due to such non-canonical effects, and makes it possible to exploit non-canonical effects for realising a given goal. We evaluate the system in simulation, on a set of test tasks that require consideration of canonical and non-canonical affordance effects. △ Less

Submitted 22 June, 2022; originally announced June 2022.

Comments: 8 pages, 8 figures, video: http://youtu.be/4naJ5IghHcg

ACM Class: I.2.9; I.2.6

arXiv:2206.03801 [pdf, other]

Robust PCA for Subspace Estimation in User-Centric Cell-Free Wireless Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider a scalable user-centric cell-free massive MIMO network with distributed remote radio units (RUs), enabling macrodiversity and joint processing. Due to the limited uplink (UL) pilot dimension, multiuser interference in the UL pilot transmission phase makes channel estimation a non-trivial problem. We make use of two types of UL pilot signals, sounding reference signal (SRS) and demodula… ▽ More We consider a scalable user-centric cell-free massive MIMO network with distributed remote radio units (RUs), enabling macrodiversity and joint processing. Due to the limited uplink (UL) pilot dimension, multiuser interference in the UL pilot transmission phase makes channel estimation a non-trivial problem. We make use of two types of UL pilot signals, sounding reference signal (SRS) and demodulation reference signal (DMRS) pilots, for the estimation of the channel subspace and its instantaneous realization, respectively. The SRS pilots are transmitted over multiple time slots and resource blocks according to a Latin squares based hop** scheme, which aims at averaging out the interference of different SRS co-pilot users. We propose a robust principle component analysis approach for channel subspace estimation from the SRS signal samples, employed at the RUs for each associated user. The estimated subspace is further used at the RUs for DMRS pilot decontamination and instantaneous channel estimation. We provide numerical simulations to compare the system performance using our subspace and channel estimation scheme with the cases of ideal partial subspace/channel knowledge and pilot matching channel estimation. The results show that a system with a properly designed SRS pilot hop** scheme can closely approximate the performance of a genie-aided system. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2203.00714

arXiv:2206.03800 [pdf, other]

Optimal User Load and Energy Efficiency in User-Centric Cell-Free Wireless Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Kosuke Yamazaki, Giuseppe Caire

Abstract: Cell-free massive MIMO is a variant of multiuser MIMO and massive MIMO, in which the total number of antennas $LM$ is distributed among the $L$ remote radio units (RUs) in the system, enabling macrodiversity and joint processing. Due to pilot contamination and system scalability, each RU can only serve a limited number of users. Obtaining the optimal number of users simultaneously served on one re… ▽ More Cell-free massive MIMO is a variant of multiuser MIMO and massive MIMO, in which the total number of antennas $LM$ is distributed among the $L$ remote radio units (RUs) in the system, enabling macrodiversity and joint processing. Due to pilot contamination and system scalability, each RU can only serve a limited number of users. Obtaining the optimal number of users simultaneously served on one resource block (RB) by the $L$ RUs regarding the sum spectral efficiency (SE) is not a simple challenge though, as many of the system parameters are intertwined. For example, the dimension $τ_p$ of orthogonal Demodulation Reference Signal (DMRS) pilots limits the number of users that an RU can serve. Thus, depending on $τ_p$, the optimal user load yielding the maximum sum SE will vary. Another key parameter is the users' uplink transmit power $P^{\rm ue}_{\rm tx}$, where a trade-off between users in outage, interference and energy inefficiency exists. We study the effect of multiple parameters in cell-free massive MIMO on the sum SE and user outage, as well as the performance of different levels of RU antenna distribution. We provide extensive numerical investigations to illuminate the behavior of the system SE with respect to the various parameters, including the effect of the system load, i.e., the number of active users to be served on any RB. The results show that in general a system with many RUs and few RU antennas yields the largest sum SE, where the benefits of distributed antennas reduce in very dense networks. △ Less

Submitted 8 June, 2022; originally announced June 2022.

arXiv:2203.08951 [pdf, other]

Meta-Learning of NAS for Few-shot Learning in Medical Image Applications

Authors: Viet-Khoa Vo-Ho, Kashu Yamazaki, Hieu Hoang, Minh-Triet Tran, Ngan Le

Abstract: Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre-/post-procedures. Even though it has… ▽ More Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre-/post-procedures. Even though it has been shown that network architecture plays a critical role in learning feature representation feature from data and the final performance, searching for the best network architecture is computationally intensive and heavily relies on researchers' experience. Automated machine learning (AutoML) and its advanced techniques i.e. Neural Architecture Search (NAS) have been promoted to address those limitations. Not only in general computer vision tasks, but NAS has also motivated various applications in multiple areas including medical imaging. In medical imaging, NAS has significant progress in improving the accuracy of image classification, segmentation, reconstruction, and more. However, NAS requires the availability of large annotated data, considerable computation resources, and pre-defined tasks. To address such limitations, meta-learning has been adopted in the scenarios of few-shot learning and multiple tasks. In this book chapter, we first present a brief review of NAS by discussing well-known approaches in search space, search strategy, and evaluation strategy. We then introduce various NAS approaches in medical imaging with different applications such as classification, segmentation, detection, reconstruction, etc. Meta-learning in NAS for few-shot learning and multiple tasks is then explained. Finally, we describe several open problems in NAS. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: book chapter, in Meta-Learning with Medical Imaging and Health Informatics Applications

arXiv:2203.08942 [pdf, other]

doi 10.1109/ACCESS.2021.3110973

ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation

Authors: Khoa Vo, Kashu Yamazaki, Sang Truong, Minh-Triet Tran, Akihiro Sugimoto, Ngan Le

Abstract: Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a… ▽ More Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a black-box to the untrimmed videos to extract video visual representation. Therefore, it is beneficial and potentially improve the performance of TAPG if we can capture these interactions between agents and the environment. In this paper, we propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks (i) an Agent-Aware Representation Network to obtain both agent-agent and agents-environment relationships in the video representation, and (ii) a Boundary Generation Network to estimate the confidence score of temporal intervals. In the Agent-Aware Representation Network, the interactions between agents are expressed through local pathway, which operates at a local level to focus on the motions of agents whereas the overall perception of the surroundings are expressed through global pathway, which operates at a global level to perceive the effects of agents-environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks (i.e C3D, SlowFast and Two-Stream) show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG. We further examine the proposal quality by leveraging proposals generated by our method onto temporal action detection (TAD) frameworks and evaluate their detection performances. The source code can be found in this URL https://github.com/vhvkhoa/TAPG-AgentEnvNetwork.git. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: Accepted in the journal of IEEE Access Vol. 9

arXiv:2203.00714 [pdf, other]

Subspace-Based Pilot Decontamination in User-Centric Scalable Cell-Free Wireless Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider a cell-free wireless system operated in Time Division Duplex (TDD) mode with user-centric clusters of remote radio units (RUs). Since the uplink pilot dimensions per channel coherence slot is limited, co-pilot users might incur mutual pilot contamination. In the current literature, it is assumed that the long-term statistical knowledge of all user channels is available. This enables Mi… ▽ More We consider a cell-free wireless system operated in Time Division Duplex (TDD) mode with user-centric clusters of remote radio units (RUs). Since the uplink pilot dimensions per channel coherence slot is limited, co-pilot users might incur mutual pilot contamination. In the current literature, it is assumed that the long-term statistical knowledge of all user channels is available. This enables Minimum Mean-Square Error channel estimation or simplified dominant subspace projection, which achieves significant pilot decontamination under certain assumptions on the channel covariance matrices. However, estimating the channel covariance matrix or even just its dominant subspace at all RUs forming a user cluster is not an easy task. In fact, if not properly designed, a piloting scheme for such long-term statistics estimation will also be subject to the contamination problem. In this paper, we propose a new channel subspace estimation scheme explicitly designed for cell-free wireless networks. Our scheme is based on 1) a sounding reference signal (SRS) using latin squares wideband frequency hop**, and 2) a subspace estimation method based on robust Principal Component Analysis (R-PCA). The SRS hop** scheme ensures that for any user and any RU participating in its cluster, only a few pilot measurements will contain strong co-pilot interference. These few heavily contaminated measurements are (implicitly) eliminated by R-PCA, which is designed to regularize the estimation and discount the ``outlier'' measurements. Our simulation results show that the proposed scheme achieves almost perfect subspace knowledge, which in turns yields system performance very close to that with ideal channel state information, thus essentially solving the problem of pilot contamination in cell-free user-centric TDD wireless networks. △ Less

Submitted 17 November, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

arXiv:2201.04922 [pdf, other]

Uplink-Downlink Duality and Precoding Strategies with Partial CSI in Cell-Free Wireless Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. After having shown the importance of dominant channel subspace information for uplink (UL) pilot decontamination and having examined different UL combining schemes in our previous work, here we investigate precoding strategies for the downlink (DL). Distributed scalable DL p… ▽ More We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. After having shown the importance of dominant channel subspace information for uplink (UL) pilot decontamination and having examined different UL combining schemes in our previous work, here we investigate precoding strategies for the downlink (DL). Distributed scalable DL precoding and power allocation methods are evaluated for different antenna distributions, user densities and UL pilot dimensions. We compare distributed power allocation methods to a scheme based on a particular form of UL-DL duality which is computable by a central processor based on the available partial channel state information. The new duality method achieves almost symmetric "optimistic ergodic rates" for UL and DL while saving considerable computational complexity since the UL combining vectors are reused as DL precoders. △ Less

Submitted 17 January, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

Comments: arXiv admin note: text overlap with arXiv:2108.04579

arXiv:2111.02514 [pdf, ps, other]

doi 10.1109/OJCAS.2021.3125894

Energy Efficiency of Uplink Cell-Free Massive MIMO With Transmit Power Control in Measured Propagation Channel

Authors: Thomas Choi, Masaaki Ito, Issei Kanno, Jorge Gomez-Ponce, Colton Bullard, Takeo Ohseki, Kosuke Yamazaki, Andreas F. Molisch

Abstract: Cell-free massive MIMO (CF-mMIMO) provides wireless connectivity for a large number of user equipments (UEs) using access points (APs) distributed across a wide area with high spectral efficiency (SE). The energy efficiency (EE) of the uplink is determined by (i) the transmit power control (TPC) algorithms, (ii) the numbers, configurations, and locations of the APs and the UEs, and (iii) the propa… ▽ More Cell-free massive MIMO (CF-mMIMO) provides wireless connectivity for a large number of user equipments (UEs) using access points (APs) distributed across a wide area with high spectral efficiency (SE). The energy efficiency (EE) of the uplink is determined by (i) the transmit power control (TPC) algorithms, (ii) the numbers, configurations, and locations of the APs and the UEs, and (iii) the propagation channels between the APs and the UEs. This paper investigates all three aspects, based on extensive (~30,000 possible AP locations and 128 possible UE locations) channel measurement data at 3.5 GHz. We compare three different TPC algorithms, namely maximization of transmit power (max-power), maximization of minimum SE (max-min SE), and maximization of minimum EE (max-min EE) while guaranteeing a target SE. We also compare various antenna arrangements including fully-distributed and semi-distributed systems, where APs can be located on a regular grid or randomly, and the UEs can be placed in clusters or far apart. Overall, we show that the max-min EE TPC is highly effective in improving the uplink EE, especially when no UE within a set of served UEs is in a bad channel condition and when the BS antennas are fully-distributed. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 12 pages, 12 figures, IEEE Open Journal of Circuits and Systems. arXiv admin note: text overlap with arXiv:2108.02130

Journal ref: 2021 IEEE Open Journal of Circuits and Systems

arXiv:2110.11474 [pdf, other]

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Authors: Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, Ngan Le

Abstract: Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforeme… ▽ More Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforementioned fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of a human by proposing Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation. AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary-matching architectures (i.e., CNN-based and GCN-based) and two classifiers (i.e., Unet and P-GCN). Our AEI robustly outperforms the state-of-the-art methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection. △ Less

Submitted 24 October, 2021; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: Accepted in BMVC 2021 (Oral Session)

arXiv:2108.11510 [pdf, other]

Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey

Authors: Ngan Le, Vidhiwar Singh Rathour, Kashu Yamazaki, Khoa Luu, Marios Savvides

Abstract: Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains including finance, medicine, healthcare, video games, robotics, and computer vision. In this work, we provide a detailed review of recent and state-of-the… ▽ More Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains including finance, medicine, healthcare, video games, robotics, and computer vision. In this work, we provide a detailed review of recent and state-of-the-art research advances of deep reinforcement learning in computer vision. We start with comprehending the theories of deep learning, reinforcement learning, and deep reinforcement learning. We then propose a categorization of deep reinforcement learning methodologies and discuss their advantages and limitations. In particular, we divide deep reinforcement learning into seven main categories according to their applications in computer vision, i.e. (i)landmark localization (ii) object detection; (iii) object tracking; (iv) registration on both 2D image and 3D image volumetric data (v) image segmentation; (vi) videos analysis; and (vii) other applications. Each of these categories is further analyzed with reinforcement learning techniques, network design, and performance. Moreover, we provide a comprehensive analysis of the existing publicly available datasets and examine source code availability. Finally, we present some open issues and discuss future research directions on deep reinforcement learning in computer vision △ Less

Submitted 25 August, 2021; originally announced August 2021.

arXiv:2108.07936 [pdf]

Calibration Method of the Monocular Omnidirectional Stereo Camera

Authors: Ryota Kawamata, Keiichi Betsui, Kazuyoshi Yamazaki, Rei Sakakibara, Takeshi Shimano

Abstract: Compact and low-cost devices are needed for autonomous driving to image and measure distances to objects 360-degree around. We have been develo** an omnidirectional stereo camera exploiting two hyperbolic mirrors and a single set of a lens and sensor, which makes this camera compact and cost efficient. We establish a new calibration method for this camera considering higher-order radial distorti… ▽ More Compact and low-cost devices are needed for autonomous driving to image and measure distances to objects 360-degree around. We have been develo** an omnidirectional stereo camera exploiting two hyperbolic mirrors and a single set of a lens and sensor, which makes this camera compact and cost efficient. We establish a new calibration method for this camera considering higher-order radial distortion, detailed tangential distortion, an image sensor tilt, and a lens-mirror offset. Our method reduces the calibration error by 6.0 and 4.3 times for the upper- and lower-view images, respectively. The random error of the distance measurement is 4.9% and the systematic error is 5.7% up to objects 14 meters apart, which is improved almost nine times compared to the conventional method. The remaining distance errors is due to a degraded optical resolution of the prototype, which we plan to make further improvements as future work. △ Less

Submitted 17 August, 2021; originally announced August 2021.

Comments: 8 pages, 8 figures, 2 tables, accepted for publication in International Journal of Automotive Engineering

arXiv:2108.04579 [pdf, other]

The Impact of Subspace-Based Pilot Decontamination in User-Centric Scalable Cell-Free Wireless Networks

Authors: Fabian Göttsch, Noboru Osawa, Takeo Ohseki, Kosuke Yamazaki, Giuseppe Caire

Abstract: We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. Several options for scalable uplink (UL) processing are examined including: i) cluster size and SNR threshold criterion for cluster formation; ii) UL pilot dimension; iii) local detection and global (per cluster) combining. We use a simple model for the channel vector spatia… ▽ More We consider a scalable user-centric wireless network with dynamic cluster formation as defined by Björnsson and Sanguinetti. Several options for scalable uplink (UL) processing are examined including: i) cluster size and SNR threshold criterion for cluster formation; ii) UL pilot dimension; iii) local detection and global (per cluster) combining. We use a simple model for the channel vector spatial correlation, which captures the fact that the propagation between UEs and RRHs is not isotropic. In particular, we define the ideal performance based on ideal but partial CSI, i.e., the CSI that can be estimated based on the users to antenna heads cluster connectivity. In practice, CSI is estimated from UL pilots, and therefore it is affected by noise and pilot contamination. We show that a very simple subspace projection scheme is able to basically attain the same performance of perfect but partial CSI. This points out that the essential information needed to pilot decontamination reduces effectively to the dominant channel subspaces. △ Less

Submitted 10 August, 2021; originally announced August 2021.

arXiv:2107.08323 [pdf, other]

Agent-Environment Network for Temporal Action Proposal Generation

Authors: Viet-Khoa Vo-Ho, Ngan Le, Kashu Yamazaki, Akihiro Sugimoto, Minh-Triet Tran

Abstract: Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction… ▽ More Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network. △ Less

Submitted 16 March, 2022; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: Accepted in ICASSP 2021

arXiv:2103.09042 [pdf, ps, other]

Invertible Residual Network with Regularization for Effective Medical Image Segmentation

Authors: Kashu Yamazaki, Vidhiwar Singh Rathour, T. Hoang Ngan Le

Abstract: Deep Convolutional Neural Networks (CNNs) i.e. Residual Networks (ResNets) have been used successfully for many computer vision tasks, but are difficult to scale to 3D volumetric medical data. Memory is increasingly often the bottleneck when training 3D Convolutional Neural Networks (CNNs). Recently, invertible neural networks have been applied to significantly reduce activation memory footprint w… ▽ More Deep Convolutional Neural Networks (CNNs) i.e. Residual Networks (ResNets) have been used successfully for many computer vision tasks, but are difficult to scale to 3D volumetric medical data. Memory is increasingly often the bottleneck when training 3D Convolutional Neural Networks (CNNs). Recently, invertible neural networks have been applied to significantly reduce activation memory footprint when training neural networks with backpropagation thanks to the invertible functions that allow retrieving its input from its output without storing intermediate activations in memory to perform the backpropagation. Among many successful network architectures, 3D Unet has been established as a standard architecture for volumetric medical segmentation. Thus, we choose 3D Unet as a baseline for a non-invertible network and we then extend it with the invertible residual network. In this paper, we proposed two versions of the invertible Residual Network, namely Partially Invertible Residual Network (Partially-InvRes) and Fully Invertible Residual Network (Fully-InvRes). In Partially-InvRes, the invertible residual layer is defined by a technique called additive coupling whereas in Fully-InvRes, both invertible upsampling and downsampling operations are learned based on squeezing (known as pixel shuffle). Furthermore, to avoid the overfitting problem because of less training data, a variational auto-encoder (VAE) branch is added to reconstruct the input volumetric data itself. Our results indicate that by using partially/fully invertible networks as the central workhorse in volumetric segmentation, we not only reduce memory overhead but also achieve compatible segmentation performance compared against the non-invertible 3D Unet. We have demonstrated the proposed networks on various volumetric datasets such as iSeg 2019 and BraTS 2020. △ Less

Submitted 16 March, 2021; originally announced March 2021.

arXiv:2103.08137 [pdf]

Cloth Manipulation Planning on Basis of Mesh Representations with Incomplete Domain Knowledge and Voxel-to-Mesh Estimation

Authors: Solvi Arnold, Daisuke Tanaka, Kimitoshi Yamazaki

Abstract: We consider the problem of open-goal planning for robotic cloth manipulation. Core of our system is a neural network trained as a forward model of cloth behaviour under manipulation, with planning performed through backpropagation. We introduce a neural network-based routine for estimating mesh representations from voxel input, and perform planning in mesh format internally. We address the problem… ▽ More We consider the problem of open-goal planning for robotic cloth manipulation. Core of our system is a neural network trained as a forward model of cloth behaviour under manipulation, with planning performed through backpropagation. We introduce a neural network-based routine for estimating mesh representations from voxel input, and perform planning in mesh format internally. We address the problem of planning with incomplete domain knowledge by means of an explicit epistemic uncertainty signal. This signal is calculated from prediction divergence between two instances of the forward model network and used to avoid epistemic uncertainty during planning. Finally, we introduce logic for handling restriction of grasp points to a discrete set of candidates, in order to accommodate graspability constraints imposed by robotic hardware. We evaluate the system's mesh estimation, prediction, and planning ability on simulated cloth for sequences of one to three manipulations. Comparative experiments confirm that planning on basis of estimated meshes improves accuracy compared to voxel-based planning, and that epistemic uncertainty avoidance improves performance under conditions of incomplete domain knowledge. Planning time cost is a few seconds. We additionally present qualitative results on robot hardware. △ Less

Submitted 12 November, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Comments: 27 pages, 13 figures

arXiv:2012.02463 [pdf, other]

Offset Curves Loss for Imbalanced Problem in Medical Segmentation

Authors: Ngan Le, Trung Le, Kashu Yamazaki, Toan Duc Bui, Khoa Luu, Marios Savides

Abstract: Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes int… ▽ More Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes into account both higher feature level i.e. region inside contour, intermediate feature level i.e. offset curves around the contour and lower feature level i.e. contour. Our proposed Offset Curves (OsC) loss consists of three main fitting terms. The first fitting term focuses on pixel-wise level segmentation whereas the second fitting term acts as attention model which pays attention to the area around the boundaries (offset curves). The third terms plays a role as regularization term which takes the length of boundaries into account. We evaluate our proposed OsC loss on both 2D network and 3D network. Two common medical datasets, i.e. retina DRIVE and brain tumor BRATS 2018 datasets are used to benchmark our proposed loss performance. The experiments have shown that our proposed OsC loss function outperforms other mainstream loss functions such as Cross-Entropy, Dice, Focal on the most common segmentation networks Unet, FCN. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: ICPR 2020

arXiv:2012.02073 [pdf, other]

A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation

Authors: Ngan Le, Kashu Yamazaki, Dat Truong, Kha Gia Quach, Marios Savvides

Abstract: In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belongi… ▽ More In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belonging to the foreground class (tumor pixel). To address this problem, we propose a multi-task network which is formed as a cascaded structure. Our model consists of two targets, i.e., (i) effectively differentiate the brain tumor regions and (ii) estimate the brain tumor mask. The first objective is performed by our proposed contextual brain tumor detection network, which plays a role of an attention gate and focuses on the region around brain tumor only while ignoring the far neighbor background which is less correlated to the tumor. The second objective is built upon a 3D atrous residual network and under an encode-decode network in order to effectively segment both large and small objects (brain tumor). Our 3D atrous residual network is designed with a skip connection to enables the gradient from the deep layers to be directly propagated to shallow layers, thus, features of different depths are preserved and used for refining each other. In order to incorporate larger contextual information from volume MRI data, our network utilizes the 3D atrous convolution with various kernel sizes, which enlarges the receptive field of filters. Our proposed network has been evaluated on various datasets including BRATS2015, BRATS2017 and BRATS2018 datasets with both validation set and testing set. Our performance has been benchmarked by both region-based metrics and surface-based metrics. We also have conducted comparisons against state-of-the-art approaches. △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: Accepted in ICPR 2020

arXiv:2010.15396 [pdf, ps, other]

Channel Estimation and Equalization for CP-OFDM-based OTFS in Fractional Doppler Channels

Authors: Noriyuki Hashimoto, Noboru Osawa, Kosuke Yamazaki, Shinsuke Ibi

Abstract: Orthogonal time frequency and space (OTFS) modulation is a promising technology that satisfies high Doppler requirements for future mobile systems. OTFS modulation encodes information symbols and pilot symbols into the two-dimensional (2D) delay-Doppler (DD) domain. The received symbols suffer from inter-Doppler interference (IDI) in the fading channels with fractional Doppler shifts that are samp… ▽ More Orthogonal time frequency and space (OTFS) modulation is a promising technology that satisfies high Doppler requirements for future mobile systems. OTFS modulation encodes information symbols and pilot symbols into the two-dimensional (2D) delay-Doppler (DD) domain. The received symbols suffer from inter-Doppler interference (IDI) in the fading channels with fractional Doppler shifts that are sampled at noninteger indices in the DD domain. IDI has been treated as an unavoidable effect because the fractional Doppler shifts cannot be obtained directly from the received pilot symbols. In this paper, we provide a solution to channel estimation for fractional Doppler channels. The proposed estimation provides new insight into the OTFS input-output relation in the DD domain as a 2D circular convolution with a small approximation. According to the input-output relation, we also provide a low-complexity channel equalization method using the estimated channel information. We demonstrate the error performance of the proposed channel estimation and equalization in several channels by simulations. The simulation results show that in high-mobility environments, the total system utilizing the proposed methods outperforms orthogonal frequency division multiplexing (OFDM) with ideal channel estimation and a conventional channel estimation method using a pseudo sequence. △ Less

Submitted 21 January, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

arXiv:1906.09391 [pdf, other]

Model Bridging: Connection between Simulation Model and Neural Network

Authors: Keiichi Kisamori, Keisuke Yamazaki, Yuto Komori, Hiroshi Tokieda

Abstract: The interpretability of machine learning, particularly for deep neural networks, is crucial for decision making in real-world applications. One approach is replacing the un-interpretable machine learning model with a surrogate model, which has a simple structure for interpretation. Another approach is understanding the target system by using a simulation modeled by human knowledge with interpretab… ▽ More The interpretability of machine learning, particularly for deep neural networks, is crucial for decision making in real-world applications. One approach is replacing the un-interpretable machine learning model with a surrogate model, which has a simple structure for interpretation. Another approach is understanding the target system by using a simulation modeled by human knowledge with interpretable simulation parameters. Recently, simulator calibration has been developed based on kernel mean embedding to estimate the simulation parameters as posterior distributions. Our idea is to use a simulation model as an interpretable surrogate model. However, the computational cost of simulator calibration is high owing to the complexity of the simulation model. Thus, we propose a ''model-bridging'' framework to bridge machine learning models with simulation models by a series of kernel mean embeddings to address these difficulties. The proposed framework enables us to obtain predictions and interpretable simulation parameters simultaneously without the computationally expensive calculations of the simulations. In this study, we apply the proposed framework to essential simulations in the manufacturing industry, such as production simulation and fluid dynamics simulation. △ Less

Submitted 21 July, 2020; v1 submitted 22 June, 2019; originally announced June 2019.

arXiv:1809.08159 [pdf, other]

Simulator Calibration under Covariate Shift with Kernels

Authors: Keiichi Kisamori, Motonobu Kanagawa, Keisuke Yamazaki

Abstract: We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift. Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. Our approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel… ▽ More We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift. Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. Our approach is based on Bayesian inference with kernel mean embedding of distributions, and on the use of an importance-weighted reproducing kernel for covariate shift adaptation. We provide a theoretical analysis for the proposed method, including a novel theoretical result for conditional mean embedding, as well as empirical investigations suggesting its effectiveness in practice. The experiments include calibration of a widely used simulator for industrial manufacturing processes, where we also demonstrate how the proposed method may be useful for sensitivity analysis of model parameters. △ Less

Submitted 18 March, 2020; v1 submitted 21 September, 2018; originally announced September 2018.

arXiv:1408.5661 [pdf, ps, other]

Asymptotic Accuracy of Bayesian Estimation for a Single Latent Variable

Authors: Keisuke Yamazaki

Abstract: In data science and machine learning, hierarchical parametric models, such as mixture models, are often used. They contain two kinds of variables: observable variables, which represent the parts of the data that can be directly measured, and latent variables, which represent the underlying processes that generate the data. Although there has been an increase in research on the estimation accuracy… ▽ More In data science and machine learning, hierarchical parametric models, such as mixture models, are often used. They contain two kinds of variables: observable variables, which represent the parts of the data that can be directly measured, and latent variables, which represent the underlying processes that generate the data. Although there has been an increase in research on the estimation accuracy for observable variables, the theoretical analysis of estimating latent variables has not been thoroughly investigated. In a previous study, we determined the accuracy of a Bayes estimation for the joint probability of the latent variables in a dataset, and we proved that the Bayes method is asymptotically more accurate than the maximum-likelihood method. However, the accuracy of the Bayes estimation for a single latent variable remains unknown. In the present paper, we derive the asymptotic expansions of the error functions, which are defined by the Kullback-Leibler divergence, for two types of single-variable estimations when the statistical regularity is satisfied. Our results indicate that the accuracies of the Bayes and maximum-likelihood methods are asymptotically equivalent and clarify that the Bayes method is only advantageous for multivariable estimations. △ Less

Submitted 17 April, 2015; v1 submitted 25 August, 2014; originally announced August 2014.

Comments: 28 pages, 3 figures

arXiv:1212.2511 [pdf]

Stochastic complexity of Bayesian networks

Authors: Keisuke Yamazaki, Sumio Watanbe

Abstract: Bayesian networks are now being used in enormous fields, for example, diagnosis of a system, data mining, clustering and so on. In spite of their wide range of applications, the statistical properties have not yet been clarified, because the models are nonidentifiable and non-regular. In a Bayesian network, the set of its parameter for a smaller model is an analytic set with singularities in the… ▽ More Bayesian networks are now being used in enormous fields, for example, diagnosis of a system, data mining, clustering and so on. In spite of their wide range of applications, the statistical properties have not yet been clarified, because the models are nonidentifiable and non-regular. In a Bayesian network, the set of its parameter for a smaller model is an analytic set with singularities in the space of large ones. Because of these singularities, the Fisher information matrices are not positive definite. In other words, the mathematical foundation for learning was not constructed. In recent years, however, we have developed a method to analyze non-regular models using algebraic geometry. This method revealed the relation between the models singularities and its statistical properties. In this paper, applying this method to Bayesian networks with latent variables, we clarify the order of the stochastic complexities.Our result claims that the upper bound of those is smaller than the dimension of the parameter space. This means that the Bayesian generalization error is also far smaller than that of regular model, and that Schwarzs model selection criterion BIC needs to be improved for Bayesian networks. △ Less

Submitted 19 October, 2012; originally announced December 2012.

Comments: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

Report number: UAI-P-2003-PG-592-599

arXiv:1211.2293 [pdf]

Performance Evaluation of Treecode Algorithm for N-Body Simulation Using GridRPC System

Authors: Truong Vinh Truong Duy, Katsuhiro Yamazaki, Shigeru Oyanagi

Abstract: This paper is aimed at improving the performance of the treecode algorithm for N-Body simulation by employing the NetSolve GridRPC programming model to exploit the use of multiple clusters. N-Body is a classical problem, and appears in many areas of science and engineering, including astrophysics, molecular dynamics, and graphics. In the simulation of N-Body, the specific routine for calculating t… ▽ More This paper is aimed at improving the performance of the treecode algorithm for N-Body simulation by employing the NetSolve GridRPC programming model to exploit the use of multiple clusters. N-Body is a classical problem, and appears in many areas of science and engineering, including astrophysics, molecular dynamics, and graphics. In the simulation of N-Body, the specific routine for calculating the forces on the bodies which accounts for upwards of 90% of the cycles in typical computations is eminently suitable for obtaining parallelism with GridRPC calls. It is divided among the compute nodes by simultaneously calling multiple GridRPC requests to them. The performance of the GridRPC implementation is then compared to that of the MPI version and hybrid MPI-OpenMP version for the treecode algorithm on individual clusters. △ Less

Submitted 10 November, 2012; originally announced November 2012.

Comments: 4 pages, 9 figures

arXiv:1211.2292 [pdf]

Hybrid MPI-OpenMP Paradigm on SMP Clusters: MPEG-2 Encoder and N-Body Simulation

Authors: Truong Vinh Truong Duy, Katsuhiro Yamazaki, Kosai Ikegami, Shigeru Oyanagi

Abstract: Clusters of SMP nodes provide support for a wide diversity of parallel programming paradigms. Combining both shared memory and message passing parallelizations within the same application, the hybrid MPI-OpenMP paradigm is an emerging trend for parallel programming to fully exploit distributed shared-memory architecture. In this paper, we improve the performance of MPEG-2 encoder and n-body simula… ▽ More Clusters of SMP nodes provide support for a wide diversity of parallel programming paradigms. Combining both shared memory and message passing parallelizations within the same application, the hybrid MPI-OpenMP paradigm is an emerging trend for parallel programming to fully exploit distributed shared-memory architecture. In this paper, we improve the performance of MPEG-2 encoder and n-body simulation by employing the hybrid MPI-OpenMP programming paradigm on SMP clusters. The hierarchical image data structure of the MPEG bit-stream is eminently suitable for the hybrid model to achieve multiple levels of parallelism: MPI for parallelism at the group of pictures level across SMP nodes and OpenMP for parallelism within pictures at the slice level within each SMP node. Similarly, the work load of the force calculation which accounts for upwards of 90% of the cycles in typical computations in the n-body simulation is shared among OpenMP threads after ORB domain decomposition among MPI processes. Besides, loop scheduling of OpenMP threads is adopted with appropriate chunk size to provide better load balance of work, leading to enhanced performance. With the n-body simulation, experimental results demonstrate that the hybrid MPI-OpenMP program outperforms the corresponding pure MPI program by average factors of 1.52 on a 4-way cluster and 1.21 on a 2-way cluster. Likewise, the hybrid model offers a performance improvement of 18% compared to the MPI model for the MPEG-2 encoder. △ Less

Submitted 10 November, 2012; originally announced November 2012.

Comments: 8 pages, 9 figures, 6 tables

arXiv:1204.2069 [pdf, ps, other]

Asymptotic Accuracy of Distribution-Based Estimation for Latent Variables

Authors: Keisuke Yamazaki

Abstract: Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretic… ▽ More Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretical but also to practical situations, such as optimal model selection and active learning. There are many studies of generalization errors, which measure the prediction accuracy of the observable variables. However, the accuracy of estimating the latent variables has not yet been elucidated. For a quantitative evaluation of this, the present paper formulates distribution-based functions for the errors in the estimation of the latent variables. The asymptotic behavior is analyzed for both the maximum likelihood and the Bayes methods. △ Less

Submitted 19 February, 2014; v1 submitted 10 April, 2012; originally announced April 2012.

Comments: 25pages, 2 figures

Showing 1–41 of 41 results for author: Yamazaki, K