Weakly supervised action localization is a challenging problem in video understanding and action recognition. Existing models usually formulate the training process as direct classification using video-level supervision. They tend to only locate the most ...
Making each modality in multi-modal data contribute is of vital importance to learning a versatile multi-modal model. Existing methods, however, are often dominated by one or few of modalities during model training, resulting in sub-optimal performance. ...
Image denoising is a fundamental problem in computer vision and multimedia computation. Non-local filters are effective for image denoising. But existing deep learning methods that use non-local computation structures are mostly designed for high-level ...
Person search is a time-consuming computer vision task that entails locating and recognizing query people in scenic pictures. Body components are commonly mismatched during matching due to position variation, occlusions, and partially absent body parts, ...
In this article, we present two algorithms that discover the discriminative structures of sketches, given pairs of sketches and photos in sketch-based image retrieval (SBIR) scenarios. Unlike the existing approaches, we aim at the few-shot and domain ...
Texture and shape in fashion, constituting essential elements of garments, characterize the body and surface of the fabric and outline the silhouette of clothing, respectively. The selection of texture and shape plays a critical role in the design process,...
Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. Those works are mostly based on the Multiple Instance Learning framework, where temporal pooling is an indispensable part that usually ...
Single-view three-dimensional (3D) object reconstruction has always been a long-term challenging task. Objects with complex topologies are hard to accurately reconstruct, which makes existing methods suffer from blurring of shape boundaries between ...
In recent years, thanks to the inherent powerful feature representation and learning abilities of the convolutional neural network (CNN), deep CNN-steered single image super-resolution approaches have achieved remarkable performance improvements. However, ...
Deep neural networks have achieved remarkable success in HEVC compressed video quality enhancement. However, most existing multiframe-based methods either deliver unsatisfactory results or consume a significant amount of resources to leverage temporal ...
It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. ...
Single-label facial expression recognition (FER), which aims to classify single expression for facial images, usually suffers from the label noisy and incomplete problem, where manual annotations for partial training images exist wrong or incomplete ...
During the COVID-19 coronavirus epidemic, wearing masks has become increasingly popular. Traditional occlusion face recognition algorithms are almost ineffective for such heavy mask occlusion. Therefore, it is urgent to improve the recognition performance ...
Deep convolutional neural networks have been demonstrated to be effective for single-image super-resolution in recent years. On the one hand, residual connections and dense connections have been used widely to ease forward information and backward ...
The purpose of image multi-label classification is to predict all the object categories presented in an image. Some recent works exploit graph convolution network to capture the correlation between labels. Although promising results have been reported, ...
Feature refinement and feature fusion are two key steps in convolutional neural networks–based salient object detection (SOD). In this article, we investigate how to utilize multiple guidance mechanisms to better refine and fuse extracted multi-level ...
Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based proposal generation methods can generate proposals with precise boundary but often suffer from the inferior quality of confidence ...
Video processing and analysis have become an urgent task, as a huge amount of videos (e.g., YouTube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is important in video processing and analysis since it ...
Residual image and illumination estimation have been proven to be helpful for image enhancement. In this article, we propose a general framework, called RI-GAN, that exploits residual and illumination using generative adversarial networks (GANs). The ...
Image recoloring is an emerging editing technique that can change the color style of an image by modifying pixel values without altering the original image content. With the rapid proliferation of social network and image editing techniques, recolored ...
Face reenactment aims to generate an animation of a source face using the poses and expressions from a target face. Although recent methods have made remarkable progress by exploiting generative adversarial networks, they are limited in generating high-...
In recent years, many model intellectual property (IP) proof methods for IP protection have been proposed, such as model watermarking and model fingerprinting. However, with the increasing number of models transmitted and deployed on the Internet, quickly ...
To generate the corresponding talking face from a speech audio and a face image, it is essential to match the variations in the facial appearance with the speech audio in subtle movements of different face regions. Nevertheless, the facial movements ...