Naoto Yokoya

Yokoya Naoto



University of Tokyo




remote sensing


pattern recognition


data fusion


My research interests include image processing for computational imaging and image analysis based on computer vision and machine learning. In particular, my research is focused on intelligent information processing to automatically extract map information, such as land cover labels and elevation models, from remote sensing images acquired by spaceborne and airborne sensors. My work is motivated by applications in Earth observation in order to respond to global challenges through disaster management and environmental assessment for decision making.

Image Processing for Computational Imaging

Computational imaging, which integrates sensing and computation, allows us to acquire information that cannot be obtained by hardware alone and to overcome hardware limitations, such as resolution and noise. Based on machine learning, optimization, and signal processing, we build mathematical models and develop algorithms to recover unknown original signals from incomplete observation data.

Related publications

X. Dong, N. Yokoya, L. Wang, and T. Uezato, ”Learning mutual modulation for self-supervised cross-modal super-resolution,” Proc. ECCV, 2022.
PDF    Quick Abstract

Abstract: Self-supervised cross-modal super-resolution (SR) can overcome the difficulty of acquiring paired training data, but is challenging because only low-resolution (LR) source and high-resolution (HR) guide images from different modalities are available. Existing methods utilize pseudo or weak supervision in LR space and thus deliver results that are blurry or not faithful to the source modality. To address this issue, we present a mutual modulation SR (MMSR) model, which tackles the task by a mutual modulation strategy, including a source-to-guide modulation and a guide-to-source modulation. In these modulations, we develop cross-domain adaptive filters to fully exploit cross-modal spatial dependency and help induce the source to emulate the resolution of the guide and induce the guide to mimic the modality characteristics of the source. Moreover, we adopt a cycle consistency constraint to train MMSR in a fully self-supervised manner. Experiments on various tasks demonstrate the state-of-the-art performance of our MMSR.

W. He, N. Yokoya, X. Yuan, ”Fast hyperspectral image recovery via non-iterative fusion of dual-camera compressive hyperspectral imaging,” IEEE Transactions on Image Processing, vol. 30, pp. 7170-7183, 2021.
PDF    Quick Abstract

Abstract: Coded aperture snapshot spectral imaging (CASSI) is a promising technique to capture the three-dimensional hyperspectral image (HSI) using a single coded two-dimensional (2D) measurement, in which algorithms are used to perform the inverse problem. Due to the ill-posed nature, various regularizers have been exploited to reconstruct the 3D data from the 2D measurement. Unfortunately, the accuracy and computational complexity are unsatisfied. One feasible solution is to utilize additional information such as the RGB measurement in CASSI. Considering the combined CASSI and RGB measurement, in this paper, we propose a new fusion model for the HSI reconstruction. We investigate the spectral low-rank property of HSI composed of a spectral basis and spatial coefficients. Specifically, the RGB measurement is utilized to estimate the coefficients, meanwhile the CASSI measurement is adopted to provide the orthogonal spectral basis. We further propose a patch processing strategy to enhance the spectral low-rank property of HSI. The proposed model neither requires non-local processing or iteration, nor the spectral sensing matrix of the RGB detector. Extensive experiments on both simulated and real HSI dataset demonstrate that our proposed method outperforms previous state-of-the-art not only in quality but also speeds up the reconstruction more than 5000 times.

T. Uezato, D. Hong, N. Yokoya, and W. He, “Guided deep decoder: Unsupervised image pair fusion,” Proc. ECCV (spotlight), 2020.
PDF    Supmat    Code    Quick Abstract

Abstract: The fusion of input and guidance images that have a tradeoff in their information (e.g., hyperspectral and RGB image fusion or pansharpening) can be interpreted as one general problem. However, previous studies applied a task-specific handcrafted prior and did not address the problems with a unified approach. To address this limitation, in this study, we propose a guided deep decoder network as a general prior. The proposed network is composed of an encoder-decoder network that exploits multi-scale features of a guidance image and a deep decoder network that generates an output image. The two networks are connected by feature refinement units to embed the multi-scale features of the guidance image into the deep decoder network. The proposed network allows the network parameters to be optimized in an unsupervised way without training data. Our results show that the proposed network can achieve state-of-the-art performance in various image fusion problems.

Remote Sensing Image Analysis

Remote sensing enables us to observe places that are inaccessible to humans; however, it is difficult to collect enough training data due to the limitations of field surveys and visual interpretation. We work on mapping and 3D reconstruction by using synthetic data from simulations and inaccurate labels with low collection costs as training data. We also work on data fusion based on deep learning to handle multimodal data obtained from different spaceborne sensors in an integrated manner.

Related publications

C. Robinson, K. Malkin, N. Jojic, H. Chen, R. Qin, C. Xiao, M. Schmitt, P. Ghamisi, R. Haensch, and N. Yokoya, ”Global land cover mapping with weak supervision: Outcome of the 2020 IEEE GRSS Data Fusion Contest,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 3185-3199, 2021.
PDF    Quick Abstract

Abstract: This paper presents the scientific outcomes of the 2020 Data Fusion Contest organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society. The 2020 Contest addressed the problem of automatic global land-cover mapping with weak supervision, i.e. estimating high-resolution semantic maps while only low-resolution reference data is available during training. Two separate competitions were organized to assess two different scenarios: 1) high-resolution labels are not available at all and 2) a small amount of high-resolution labels are available additionally to low-resolution reference data. In this paper we describe the DFC2020 dataset that remains available for further evaluation of corresponding approaches and report the results of the best-performing methods during the contest.

D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang, ”More diverse means better: Multimodal deep learning meets remote-sensing imagery classification,” IEEE Transactions on Geoscience and Remote Sensing, (early access), 2020.
PDF    Code    Quick Abstract

Abstract: Classification and identification of the materials lying over or beneath the Earth's surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS) and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on ``what'', ``where'', and ``how'' to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS datasets. Furthermore, the codes and datasets will be available at: https://github.com/danfenghong/IEEE_TGRS_MDL-RS, contributing to the RS community.

S. Kunwar, H. Chen, M. Lin, H. Zhang, P. D'Angelo, D. Cerra, S. M. Azimi, M. Brown, G. Hager, N. Yokoya, R. Haensch, and B. Le Saux, ”Large-Scale Semantic 3D Reconstruction: Outcome of the 2019 IEEE GRSS Data Fusion Contest - Part A,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 922-935, 2020.
PDF    Quick Abstract

Abstract: In this paper, we present the scientific outcomes of the 2019 Data Fusion Contest organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society. The 2019 Contest addressed the problem of 3D reconstruction and 3D semantic understanding on a large scale. Several competitions were organized to assess specific issues, such as elevation estimation and semantic mapping from a single view, two views, or multiple views. In this Part A, we report the results of the best-performing approaches for semantic 3D reconstruction according to these various set-ups, while 3D point cloud semantic mapping is discussed in Part B.

Towards a Sustainable Future

We promote projects to solve global issues, including environmental problems, climate change, large-scale natural disasters, and food problems. Our goal is to contribute globally to the realization of the SDGs by solving real-world problems, such as assessing building damage during disasters, estimating biomass and carbon stocks in forests, and mapping crop types, in collaboration with related institutions and researchers in Japan and overseas.

Related publications

B. Adriano, N. Yokoya, J. Xia, H. Miura, W. Liu, M. Matsuoka, and S. Koshimura, ”Learning from multimodal and multitemporal earth observation data for building damage mapping,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 175, pp. 132-143, 2021.
PDF    Quick Abstract

Abstract: Earth observation technologies, such as optical imaging and synthetic aperture radar (SAR), provide excellent means to monitor ever-growing urban environments continuously. Notably, in the case of large-scale disasters (e.g., tsunamis and earthquakes), in which a response is highly time-critical, images from both data modalities can complement each other to accurately convey the full damage condition in the disaster's aftermath. However, due to several factors, such as weather and satellite coverage, it is often uncertain which data modality will be the first available for rapid disaster response efforts. Hence, novel methodologies that can utilize all accessible EO datasets are essential for disaster management. In this study, we have developed a global multisensor and multitemporal dataset for building damage mapping. We included building damage characteristics from three disaster types, namely, earthquakes, tsunamis, and typhoons, and considered three building damage categories. The global dataset contains high-resolution optical imagery and high-to-moderate-resolution multiband SAR data acquired before and after each disaster. Using this comprehensive dataset, we analyzed five data modality scenarios for damage mapping: single-mode (optical and SAR datasets), cross-modal (pre-disaster optical and post-disaster SAR datasets), and mode fusion scenarios. We defined a damage mapping framework for the semantic segmentation of damaged buildings based on a deep convolutional neural network algorithm. We compare our approach to another state-of-the-art baseline model for damage mapping. The results indicated that our dataset, together with a deep learning network, enabled acceptable predictions for all the data modality scenarios.

N. Yokoya, K. Yamanoi, W. He, G. Baier, B. Adriano, H. Miura, and S. Oishi, ”Breaking limits of remote sensing by deep learning from simulated data for flood and debris flow mapping,” IEEE Transactions on Geoscience and Remote Sensing, (early access), 2020.
PDF    Code    Quick Abstract

Abstract: We propose a framework that estimates inundation depth (maximum water level) and debris-flow-induced topographic deformation from remote sensing imagery by integrating deep learning and numerical simulation. A water and debris flow simulator generates training data for various artificial disaster scenarios. We show that regression models based on Attention U-Net and LinkNet architectures trained on such synthetic data can predict the maximum water level and topographic deformation from a remote sensing-derived change detection map and a digital elevation model. The proposed framework has an inpainting capability, thus mitigating the false negatives that are inevitable in remote sensing image analysis. Our framework breaks limits of remote sensing and enables rapid estimation of inundation depth and topographic deformation, essential information for emergency response, including rescue and relief activities. We conduct experiments with both synthetic and real data for two disaster events that caused simultaneous flooding and debris flows and demonstrate the effectiveness of our approach quantitatively and qualitatively. Our code and datasets are available at https://github.com/nyokoya/dlsim.

T. D. Pham, N. Yokoya, T. T. T. Nguyen, N. N. Le, N. T. Ha, J. Xia, W. Takeuchi, T. D. Pham, ”Improvement of mangrove soil carbon stocks estimation in North Vietnam using Sentinel-2 data and machine learning approach,” GIScience & Remote Sensing, vol. 58, no. 1, pp. 68-87, 2021.
PDF    Quick Abstract

Abstract: Quantifying total carbon (TC) stocks in soil across various mangrove ecosystems is key to understanding the global carbon cycle to reduce greenhouse gas emissions. Estimating mangrove TC at a large scale remains challenging due to the difficulty and high cost of soil carbon measurements when the number of samples is high. In the present study, we investigated the capability of Sentinel-2 multispectral data together with a state-of-the-art machine learning (ML) technique, which is a combination of CatBoost regression (CBR) and a genetic algorithm (GA) for feature selection and optimization (the CBR-GA model) to estimate the mangrove soil C stocks across the mangrove ecosystems in North Vietnam. We used the field survey data collected from 177 soil cores. We compared the performance of the proposed model with those of the four ML algorithms, i.e., the extreme gradient boosting regression (XGBR), the light gradient boosting machine regression (LGBMR), the support vector regression (SVR), and the random forest regression (RFR) models. Our proposed model estimated the TC level in the soil as 35.06–166.83 Mg ha−1 (average = 92.27 Mg ha−1) with satisfactory accuracy (R 2 = 0.665, RMSE = 18.41 Mg ha−1) and yielded the best prediction performance among all the ML techniques. We conclude that the Sentinel-2 data combined with the CBR-GA model can improve estimates of the mangrove TC at 10 m spatial resolution in tropical areas. The effectiveness of the proposed approach should be further evaluated for different mangrove soils of the other mangrove ecosystems in tropical and semi-tropical regions.