Specifically, the proposed DSNet has two novel dynamic selection modules, namely, the validness migratable convolution (VMC) and regional composite normalization (RCN) modules, which share a dynamic selection mechanism that helps utilize valid pixels better. By replacing vanilla convolution with the VMC module, spatial sampling locations are dynamically selected in the convolution phase, resulting in a more flexible feature extraction process. Besides, the RCN module not only combines several normalization methods but also normalizes the feature regions selectively. Therefore, the proposed DSNet can illustrate realistic and fine-detailed images by adaptively selecting features and normalization styles. Experimental results on three public datasets show that our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively.Image-text matching aims to measure the similarities between images and textual descriptions, which has made great progress recently. The key to this cross-modal matching task is to build the latent semantic alignment between visual objects and words. Due to the widespread variations of sentence structures, it is very difficult to learn the latent semantic alignment using only global cross-modal features. Many previous methods attempt to learn the aligned image-text representations by the attention mechanism but generally ignore the relationships within textual descriptions which determine whether the words belong to the same visual object. In this paper, we propose a graph attentive relational network (GARN) to learn the aligned image-text representations by modeling the relationships between noun phrases in a text for the identity-aware image-text matching. In the GARN, we first decompose images and texts into regions and noun phrases, respectively. Then a skip graph neural network (skip-GNN) is proposed to learn effective textual representations which are a mixture of textual features and relational features. selleck Finally, a graph attention network is further proposed to obtain the probabilities that the noun phrases belong to the image regions by modeling the relationships between noun phrases. We perform extensive experiments on the CUHK Person Description dataset (CUHK-PEDES), Caltech-UCSD Birds dataset (CUB), Oxford-102 Flowers dataset and Flickr30K dataset to verify the effectiveness of each component in our model. Experimental results show that our approach achieves the state-of-the-art results on these four benchmark datasets.Nowadays, with the rapid development of data collection sources and feature extraction methods, multi-view data are getting easy to obtain and have received increasing research attention in recent years, among which, multi-view clustering (MVC) forms a mainstream research direction and is widely used in data analysis. However, existing MVC methods mainly assume that each sample appears in all the views, without considering the incomplete view case due to data corruption, sensor failure, equipment malfunction, etc. In this study, we design and build a generative partial multi-view clustering model with adaptive fusion and cycle consistency, named as GP-MVC, to solve the incomplete multi-view problem by explicitly generating the data of missing views. The main idea of GP-MVC lies in two-fold. First, multi-view encoder networks are trained to learn common low-dimensional representations, followed by a clustering layer to capture the shared cluster structure across multiple views. Second, view-specific generative adversarial networks with multi-view cycle consistency are developed to generate the missing data of one view conditioning on the shared representation given by other views. These two steps could be promoted mutually, where the learned common representation facilitates data imputation and the generated data could further explores the view consistency. Moreover, an weighted adaptive fusion scheme is implemented to exploit the complementary information among different views. Experimental results on four benchmark datasets are provided to show the effectiveness of the proposed GP-MVC over the state-of-the-art methods.Rain is a common weather phenomenon that affects environmental monitoring and surveillance systems. According to an established rain model (Garg and Nayar, 2007), the scene visibility in the rain varies with the depth from the camera, where objects faraway are visually blocked more by the fog than by the rain streaks. However, existing datasets and methods for rain removal ignore these physical properties, thus limiting the rain removal efficiency on real photos. In this work, we analyze the visual effects of rain subject to scene depth and formulate a rain imaging model that collectively considers rain streaks and fog. Also, we prepare a dataset called RainCityscapes on real outdoor photos. Furthermore, we design a novel real-time end-to-end deep neural network, for which we train to learn the depth-guided non-local features and to regress a residual map to produce a rain-free output image. We performed various experiments to visually and quantitatively compare our method with several state-of-the-art methods to show its superiority over others.Fine-grained 3D shape classification is important for shape understanding and analysis, which poses a challenging research problem. However, the studies on the fine-grained 3D shape classification have rarely been explored, due to the lack of fine-grained 3D shape benchmarks. To address this issue, we first introduce a new 3D shape dataset (named FG3D dataset) with fine-grained class labels, which consists of three categories including airplane, car and chair. Each category consists of several subcategories at a fine-grained level. According to our experiments under this fine-grained dataset, we find that state-of-the-art methods are significantly limited by the small variance among subcategories in the same category. To resolve this problem, we further propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Specifically, we first train a Region Proposal Network (RPN) to detect the generally semantic parts inside multiple views under the benchmark of generally semantic part detection. Then, we design a hierarchical part-view attention aggregation module to learn a global shape representation by aggregating generally semantic part features, which preserves the local details of 3D shapes. The part-view attention module hierarchically leverages part-level and view-level attention to increase the discriminability of our features. The part-level attention highlights the important parts in each view while the view-level attention highlights the discriminative views among all the views of the same object. In addition, we integrate a Recurrent Neural Network (RNN) to capture the spatial relationships among sequential views from different viewpoints. link2 Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods. The FG3D dataset is available at https//github.com/liuxinhai/FG3D-Net.Semantic segmentation is a challenging task that needs to handle large scale variations, deformations, and different viewpoints. In this paper, we develop a novel network named Gated Path Selection Network (GPSNet), which aims to adaptively select receptive fields while maintaining the dense sampling capability. In GPSNet, we first design a two-dimensional SuperNet, which densely incorporates features from growing receptive fields. And then, a Comparative Feature Aggregation (CFA) module is introduced to dynamically aggregate discriminative semantic context. In contrast to previous works that focus on optimizing sparse sampling locations on regular grids, GPSNet can adaptively harvest free form dense semantic context information. The derived adaptive receptive fields and dense sampling locations are data-dependent and flexible which can model various contexts of objects. On two representative semantic segmentation datasets, i.e., Cityscapes and ADE20K, we show that the proposed approach consistently outperforms previous methods without bells and whistles.Obtaining a high-quality frontal face image from a low-resolution (LR) non-frontal face image is primarily important for many facial analysis applications. However, mainstreams either focus on super-resolving near-frontal LR faces or frontalizing non-frontal high-resolution (HR) faces. It is desirable to perform both tasks seamlessly for daily-life unconstrained face images. In this paper, we present a novel Vivid Face Hallucination Generative Adversarial Network (VividGAN) for simultaneously super-resolving and frontalizing tiny non-frontal face images. VividGAN consists of coarse-level and fine-level Face Hallucination Networks (FHnet) and two discriminators, i.e., Coarse-D and Fine-D. The coarse-level FHnet generates a frontal coarse HR face and then the fine-level FHnet makes use of the facial component appearance prior, i.e., fine-grained facial components, to attain a frontal HR face image with authentic details. In the fine-level FHnet, we also design a facial component-aware module that adopts the facial geometry guidance as clues to accurately align and merge the frontal coarse HR face and prior information. Meanwhile, two-level discriminators are designed to capture both the global outline of a face image as well as detailed facial characteristics. The Coarse-D enforces the coarsely hallucinated faces to be upright and complete while the Fine-D focuses on the fine hallucinated ones for sharper details. link3 Extensive experiments demonstrate that our VividGAN achieves photo-realistic frontal HR faces, reaching superior performance in downstream tasks, i.e., face recognition and expression classification, compared with other state-of-the-art methods.Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a tool for obtaining improved certainty estimates and explanations for deep learning models. We provide detailed empirical analysis for the visual question answering task on all standard benchmarks and comparison with state of the art methods.Integrating deep learning techniques into the video coding framework gains significant improvement compared to the standard compression techniques, especially applying super-resolution (up-sampling) to down-sampling based video coding as post-processing. However, besides up-sampling degradation, the various artifacts brought from compression make super-resolution problem more difficult to solve. The straightforward solution is to integrate the artifact removal techniques before super-resolution. However, some helpful features may be removed together, degrading the super-resolution performance. To address this problem, we proposed an end-to-end restoration-reconstruction deep neural network (RR-DnCNN) using the degradation-aware technique, which entirely solves degradation from compression and sub-sampling. Besides, we proved that the compression degradation produced by Random Access configuration is rich enough to cover other degradation types, such as Low Delay P and All Intra, for training. Since the straightforward network RR-DnCNN with many layers as a chain has poor learning capability suffering from the gradient vanishing problem, we redesign the network architecture to let reconstruction leverages the captured features from restoration using up-sampling skip connections.