# Computer Vision and Pattern Recognition

## New submissions

[ total of 58 entries: 1-58 ]
[ showing up to 2000 entries per page: fewer | more ]

### New submissions for Fri, 23 Mar 18

[1]
Title: Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Many classical Computer Vision problems, such as essential matrix computation and pose estimation from 3D to 2D correspondences, can be solved by finding the eigenvector corresponding to the smallest, or zero, eigenvalue of a matrix representing a linear system. Incorporating this in deep learning frameworks would allow us to explicitly encode known notions of geometry, instead of having the network implicitly learn them from data. However, performing eigendecomposition within a network requires the ability to differentiate this operation. Unfortunately, while theoretically doable, this introduces numerical instability in the optimization process in practice.
In this paper, we introduce an eigendecomposition-free approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix predicted by the network. We demonstrate on several tasks, including keypoint matching and 3D pose estimation, that our approach is much more robust than explicit differentiation of the eigendecomposition, It has better convergence properties and yields state-of-the-art results on both tasks.

[2]
Title: Probabilistic Video Generation using Holistic Attribute Control
Comments: arXiv admin note: The affiliation of Andreas Lehrmann should be Disney Research
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Videos express highly structured spatio-temporal patterns of visual data. A video can be thought of as being governed by two factors: (i) temporally invariant (e.g., person identity), or slowly varying (e.g., activity), attribute-induced appearance, encoding the persistent content of each frame, and (ii) an inter-frame motion or scene dynamics (e.g., encoding evolution of the person ex-ecuting the action). Based on this intuition, we propose a generative framework for video generation and future prediction. The proposed framework generates a video (short clip) by decoding samples sequentially drawn from a latent space distribution into full video frames. Variational Autoencoders (VAEs) are used as a means of encoding/decoding frames into/from the latent space and RNN as a wayto model the dynamics in the latent space. We improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. As a result, given attributes and/orthe first frame, our model is able to generate diverse but highly consistent sets ofvideo sequences, accounting for the inherent uncertainty in the prediction task. Experimental results on Chair CAD, Weizmann Human Action, and MIT-Flickr datasets, along with detailed comparison to the state-of-the-art, verify effectiveness of the framework.

[3]
Title: T-RECS: Training for Rate-Invariant Embeddings by Controlling Speed for Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

An action should remain identifiable when modifying its speed: consider the contrast between an expert chef and a novice chef each chopping an onion. Here, we expect the novice chef to have a relatively measured and slow approach to chopping when compared to the expert. In general, the speed at which actions are performed, whether slower or faster than average, should not dictate how they are recognized. We explore the erratic behavior caused by this phenomena on state-of-the-art deep network-based methods for action recognition in terms of maximum performance and stability in recognition accuracy across a range of input video speeds. By observing the trends in these metrics and summarizing them based on expected temporal behaviour w.r.t. variations in input video speeds, we find two distinct types of network architectures. In this paper, we propose a preprocessing method named T-RECS, as a way to extend deep-network-based methods for action recognition to explicitly account for speed variability in the data. We do so by adaptively resampling the inputs to a given model. T-RECS is agnostic to the specific deep-network model; we apply it to four state-of-the-art action recognition architectures, C3D, I3D, TSN, and ConvNet+LSTM. On HMDB51 and UCF101, T-RECS-based I3D models show a peak improvement of at least 2.9% in performance over the baseline while T-RECS-based C3D models achieve a maximum improvement in stability by 59% over the baseline, on the HMDB51 dataset.

[4]
Title: A Unified Framework for Multi-View Multi-Class Object Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

One core challenge in object pose estimation is to ensure accurate and robust performance for large numbers of diverse foreground objects amidst complex background clutter. In this work, we present a scalable framework for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views. To learn discriminative pose features, we integrate three new capabilities into a deep Convolutional Neural Network (CNN): an inference scheme that combines both classification and pose regression based on a uniform tessellation of SE(3), fusion of a class prior into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. Further, an efficient multi-view framework is formulated to address single-view ambiguity. We show this consistently improves the performance of the single-view network. We evaluate our method on three large-scale benchmarks: YCB-Video, JHUScene-50 and ObjectNet-3D. Our approach achieves competitive or superior performance over the current state-of-the-art methods.

[5]
Title: Fisher Pruning of Deep Nets for Facial Trait Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Although deep nets have resulted in high accuracies for various visual tasks, their computational and space requirements are prohibitively high for inclusion on devices without high-end GPUs. In this paper, we introduce a neuron/filter level pruning framework based on Fisher's LDA which leads to high accuracies for a wide array of facial trait classification tasks, while significantly reducing space/computational complexities. The approach is general and can be applied to convolutional, fully-connected, and module-based deep structures, in all cases leveraging the high decorrelation of neuron activations found in the pre-decision layer and cross-layer deconv dependency. Experimental results on binary and multi-category facial traits from the LFWA and Adience datasets illustrate the framework's comparable/better performance to state-of-the-art pruning approaches and compact structures (e.g. SqueezeNet, MobileNet). Ours successfully maintains comparable accuracies even after discarding most parameters (98%-99% for VGG-16, 82% for GoogLeNet) and with significant FLOP reductions (83% for VGG-16, 64% for GoogLeNet).

[6]
Title: Robust Blind Deconvolution via Mirror Descent
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Numerical Analysis (cs.NA); Machine Learning (stat.ML)

We revisit the Blind Deconvolution problem with a focus on understanding its robustness and convergence properties. Provable robustness to noise and other perturbations is receiving recent interest in vision, from obtaining immunity to adversarial attacks to assessing and describing failure modes of algorithms in mission critical applications. Further, many blind deconvolution methods based on deep architectures internally make use of or optimize the basic formulation, so a clearer understanding of how this sub-module behaves, when it can be solved, and what noise injection it can tolerate is a first order requirement. We derive new insights into the theoretical underpinnings of blind deconvolution. The algorithm that emerges has nice convergence guarantees and is provably robust in a sense we formalize in the paper. Interestingly, these technical results play out very well in practice, where on standard datasets our algorithm yields results competitive with or superior to the state of the art. Keywords: blind deconvolution, robust continuous optimization

[7]
Title: Extended depth-of-field in holographic image reconstruction using deep learning based auto-focusing and phase-recovery
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Optics (physics.optics)

Holography encodes the three dimensional (3D) information of a sample in the form of an intensity-only recording. However, to decode the original sample image from its hologram(s), auto-focusing and phase-recovery are needed, which are in general cumbersome and time-consuming to digitally perform. Here we demonstrate a convolutional neural network (CNN) based approach that simultaneously performs auto-focusing and phase-recovery to significantly extend the depth-of-field (DOF) in holographic image reconstruction. For this, a CNN is trained by using pairs of randomly de-focused back-propagated holograms and their corresponding in-focus phase-recovered images. After this training phase, the CNN takes a single back-propagated hologram of a 3D sample as input to rapidly achieve phase-recovery and reconstruct an in focus image of the sample over a significantly extended DOF. This deep learning based DOF extension method is non-iterative, and significantly improves the algorithm time-complexity of holographic image reconstruction from O(nm) to O(1), where n refers to the number of individual object points or particles within the sample volume, and m represents the focusing search space within which each object point or particle needs to be individually focused. These results highlight some of the unique opportunities created by data-enabled statistical image reconstruction methods powered by machine learning, and we believe that the presented approach can be broadly applicable to computationally extend the DOF of other imaging modalities.

[8]
Title: Deep Pose Consensus Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we address the problem of estimating a 3D human pose from a single image, which is important but difficult to solve due to many reasons, such as self-occlusions, wild appearance changes, and inherent ambiguities of 3D estimation from a 2D cue. These difficulties make the problem ill-posed, which have become requiring increasingly complex estimators to enhance the performance. On the other hand, most existing methods try to handle this problem based on a single complex estimator, which might not be good solutions. In this paper, to resolve this issue, we propose a multiple-partial-hypothesis-based framework for the problem of estimating 3D human pose from a single image, which can be fine-tuned in an end-to-end fashion. We first select several joint groups from a human joint model using the proposed sampling scheme, and estimate the 3D poses of each joint group separately based on deep neural networks. After that, they are aggregated to obtain the final 3D poses using the proposed robust optimization formula. The overall procedure can be fine-tuned in an end-to-end fashion, resulting in better performance. In the experiments, the proposed framework shows the state-of-the-art performances on popular benchmark data sets, namely Human3.6M and HumanEva, which demonstrate the effectiveness of the proposed framework.

[9]
Title: Single-Shot Bidirectional Pyramid Networks for High-Quality Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent years have witnessed many exciting achievements for object detection using deep learning techniques. Despite achieving significant progresses, most existing detectors are designed to detect objects with relatively low-quality prediction of locations, i.e., often trained with the threshold of Intersection over Union (IoU) set to 0.5 by default, which can yield low-quality or even noisy detections. It remains an open challenge for how to devise and train a high-quality detector that can achieve more precise localization (i.e., IoU$>$0.5) without sacrificing the detection performance. In this paper, we propose a novel single-shot detection framework of Bidirectional Pyramid Networks (BPN) towards high-quality object detection, which consists of two novel components: (i) a Bidirectional Feature Pyramid structure for more effective and robust feature representations; and (ii) a Cascade Anchor Refinement to gradually refine the quality of predesigned anchors for more effective training. Our experiments showed that the proposed BPN achieves the best performances among all the single-stage object detectors on both PASCAL VOC and MS COCO datasets, especially for high-quality detections.

[10]
Title: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model
Comments: Person detection and pose estimation, segmentation and grouping
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.

[11]
Title: Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The task of three-dimensional (3D) human pose estimation from a single image can be divided into two parts: (1) Two-dimensional (2D) human joint detection from the image and (2) estimating a 3D pose from the 2D joints. Herein, we focus on the second part, i.e., a 3D pose estimation from 2D joint locations. The problem with existing methods is that they require either (1) a 3D pose dataset or (2) 2D joint locations in consecutive frames taken from a video sequence. We aim to solve these problems. For the first time, we propose a method that learns a 3D human pose without any 3D datasets. Our method can predict a 3D pose from 2D joint locations in a single image. Our system is based on the generative adversarial networks, and the networks are trained in an unsupervised manner. Our primary idea is that, if the network can predict a 3D human pose correctly, the 3D pose that is projected onto a 2D plane should not collapse even if it is rotated perpendicularly. We evaluated the performance of our method using Human3.6M and the MPII dataset and showed that our network can predict a 3D pose well even if the 3D dataset is not available during training.

[12]
Title: Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The aim of image captioning is to generate similar captions by machine as human do to describe image contents. Despite many efforts, generating discriminative captions for images remains non-trivial. Most traditional approaches imitate the language structure patterns, thus tend to fall into a stereotype of replicating frequent phrases or sentences and neglect unique aspects of each image. In this work, we propose an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions. It brings unique advantages: (1) the self-retrieval guidance can act as a metric and an evaluator of caption discriminativeness to assure the quality of generated captions. (2) The correspondence between generated captions and images are naturally incorporated in the generation process without human annotations, and hence our approach could utilize a large amount of unlabeled images to boost captioning performance with no additional laborious annotations. We demonstrate the effectiveness of the proposed retrieval-guided method on MS-COCO and Flickr30k captioning datasets, and show its superior captioning performance with more discriminative captions.

[13]
Title: Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (\textit{visible heatmaps}, \textit{occluded heatmaps}, \textit{part affinity fields} and \textit{temporal affinity fields}) fed by a \textit{time linker} feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, more than 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

[14]
Title: Prioritized Multi-View Stereo Depth Map Generation Using Confidence Prediction
Comments: This paper was accepted to ISPRS Journal of Photogrammetry and Remote Sensing (this https URL) on March 21, 2018. The official version will be made available on ScienceDirect (this https URL)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this work, we propose a novel approach to prioritize the depth map computation of multi-view stereo (MVS) to obtain compact 3D point clouds of high quality and completeness at low computational cost. Our prioritization approach operates before the MVS algorithm is executed and consists of two steps. In the first step, we aim to find a good set of matching partners for each view. In the second step, we rank the resulting view clusters (i.e. key views with matching partners) according to their impact on the fulfillment of desired quality parameters such as completeness, ground resolution and accuracy. Additional to geometric analysis, we use a novel machine learning technique for training a confidence predictor. The purpose of this confidence predictor is to estimate the chances of a successful depth reconstruction for each pixel in each image for one specific MVS algorithm based on the RGB images and the image constellation. The underlying machine learning technique does not require any ground truth or manually labeled data for training, but instead adapts ideas from depth map fusion for providing a supervision signal. The trained confidence predictor allows us to evaluate the quality of image constellations and their potential impact to the resulting 3D reconstruction and thus builds a solid foundation for our prioritization approach. In our experiments, we are thus able to reach more than 70% of the maximal reachable quality fulfillment using only 5% of the available images as key views. For evaluating our approach within and across different domains, we use two completely different scenarios, i.e. cultural heritage preservation and reconstruction of single family houses.

[15]
Title: Dichromatic Gray Pixel for Camera-agnostic Color Constancy
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a novel statistical color constancy method, especially suitable for the Camera-agnostic Color Constancy, i.e. the scenario where nothing is known a priori about the capturing devices. The method, called Dichromatic Gray Pixel, or DGP, relies on a novel gray pixel detection algorithm derived using the Dichromatic Reflection Model. DGP is suitable for camera-agnostic color constancy since varying devices are set to make achromatic pixels look gray under standard neutral illumination. In the camera-agnostic scenario, the proposed method outperforms on standard benchmarks, both state-of-the-art learning-based and statistical methods. DGP is simple, literally dozens of lines of code, and fast, processing a 1080p image in 0.4 seconds with unoptimized MATLAB code running in a CPU Intel i7 2.5 GHz.

[16]
Title: What do Deep Networks Like to See?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We propose a novel way to measure and understand convolutional neural networks by quantifying the amount of input signal they let in. To do this, an autoencoder (AE) was fine-tuned on gradients from a pre-trained classifier with fixed parameters. We compared the reconstructed samples from AEs that were fine-tuned on a set of image classifiers (AlexNet, VGG16, ResNet-50, and Inception~v3) and found substantial differences. The AE learns which aspects of the input space to preserve and which ones to ignore, based on the information encoded in the backpropagated gradients. Measuring the changes in accuracy when the signal of one classifier is used by a second one, a relation of total order emerges. This order depends directly on each classifier's input signal but it does not correlate with classification accuracy or network size. Further evidence of this phenomenon is provided by measuring the normalized mutual information between original images and auto-encoded reconstructions from different fine-tuned AEs. These findings break new ground in the area of neural network understanding, opening a new way to reason, debug, and interpret their results. We present four concrete examples in the literature where observations can now be explained in terms of the input signal that a model uses.

[17]
Title: Found a good match: should I keep searching? - Accuracy and Performance in Iris Matching Using 1-to-First Search
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Iris recognition is used in many applications around the world, with enrollment sizes as large as over one billion persons in India's Aadhaar program. Large enrollment sizes can require special optimizations in order to achieve fast database searches. One such optimization that has been used in some operational scenarios is 1:First search. In this approach, instead of scanning the entire database, the search is terminated when the first sufficiently good match is found. This saves time, but ignores potentially better matches that may exist in the unexamined portion of the enrollments. At least one prominent and successful border-crossing program used this approach for nearly a decade, in order to allow users a fast "token-free" search. Our work investigates the search accuracy of 1:First and compares it to the traditional 1:N search. Several different scenarios are considered trying to emulate real environments as best as possible: a range of enrollment sizes, closed- and open-set configurations, two iris matchers, and different permutations of the galleries. Results confirm the expected accuracy degradation using 1:First search, and also allow us to identify acceptable working parameters where significant search time reduction is achieved, while maintaining accuracy similar to 1:N search.

[18]
Title: Densely Connected Pyramid Dehazing Network
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

We propose a new end-to-end single image dehazing method, called Densely Connected Pyramid Dehazing Network (DCPDN), which can jointly learn the transmission map, atmospheric light and dehazing all together. The end-to-end learning is achieved by directly embedding the atmospheric scattering model into the network, thereby ensuring that the proposed method strictly follows the physics-driven scattering model for dehazing. Inspired by the dense network that can maximize the information flow along features from different levels, we propose a new edge-preserving densely connected encoder-decoder structure with multi-level pyramid pooling module for estimating the transmission map. This network is optimized using a newly introduced edge-preserving loss function. To further incorporate the mutual structural information between the estimated transmission map and the dehazed result, we propose a joint-discriminator based on generative adversarial network framework to decide whether the corresponding dehazed image and the estimated transmission map are real or fake. An ablation study is conducted to demonstrate the effectiveness of each module evaluated at both estimated transmission map and dehazed result. Extensive experiments demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods. Code will be made available at: https://github.com/hezhangsprinter

[19]
Title: PlaneMatch: Patch Coplanarity Prediction for Robust RGB-D Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce a novel RGB-D patch descriptor designed for detecting coplanar surfaces in SLAM reconstruction. The core of our method is a deep convolutional neural net that takes in RGB, depth, and normal information of a planar patch in an image and outputs a descriptor that can be used to find coplanar patches from other images.We train the network on 10 million triplets of coplanar and non-coplanar patches, and evaluate on a new coplanarity benchmark created from commodity RGB-D scans. Experiments show that our learned descriptor outperforms alternatives extended for this new task by a significant margin. In addition, we demonstrate the benefits of coplanarity matching in a robust RGBD reconstruction formulation.We find that coplanarity constraints detected with our method are sufficient to get reconstruction results comparable to state-of-the-art frameworks on most scenes, but outperform other methods on standard benchmarks when combined with a simple keypoint method.

[20]
Title: A Smoke Removal Method for Laparoscopic Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In laparoscopic surgery, image quality can be severely degraded by surgical smoke, which not only introduces error for the image processing (used in image guided surgery), but also reduces the visibility of the surgeons. In this paper, we propose to enhance the laparoscopic images by decomposing them into unwanted smoke part and enhanced part using a variational approach. The proposed method relies on the observation that smoke has low contrast and low inter-channel differences. A cost function is defined based on this prior knowledge and is solved using an augmented Lagrangian method. The obtained unwanted smoke component is then subtracted from the original degraded image, resulting in the enhanced image. The obtained quantitative scores in terms of FADE, JNBM and RE metrics show that our proposed method performs rather well. Furthermore, the qualitative visual inspection of the results show that it removes smoke effectively from the laparoscopic images.

[21]
Title: Group Sparsity Residual with Non-Local Samples for Image Denoising
Journal-ref: International Conference on Acoustics, Speech and Signal Processing 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Inspired by group-based sparse coding, recently proposed group sparsity residual (GSR) scheme has demonstrated superior performance in image processing. However, one challenge in GSR is to estimate the residual by using a proper reference of the group-based sparse coding (GSC), which is desired to be as close to the truth as possible. Previous researches utilized the estimations from other algorithms (i.e., GMM or BM3D), which are either not accurate or too slow. In this paper, we propose to use the Non-Local Samples (NLS) as reference in the GSR regime for image denoising, thus termed GSR-NLS. More specifically, we first obtain a good estimation of the group sparse coefficients by the image nonlocal self-similarity, and then solve the GSR model by an effective iterative shrinkage algorithm. Experimental results demonstrate that the proposed GSR-NLS not only outperforms many state-of-the-art methods, but also delivers the competitive advantage of speed.

[22]
Title: Buried object detection from B-scan ground penetrating radar data using Faster-RCNN
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we adapt the Faster-RCNN framework for the detection of underground buried objects (i.e. hyperbola reflections) in B-scan ground penetrating radar (GPR) images. Due to the lack of real data for training, we propose to incorporate more simulated radargrams generated from different configurations using the gprMax toolbox. Our designed CNN is first pre-trained on the grayscale Cifar-10 database. Then, the Faster-RCNN framework based on the pre-trained CNN is trained and fine-tuned on both real and simulated GPR data. Preliminary detection results show that the proposed technique can provide significant improvements compared to classical computer vision methods and hence becomes quite promising to deal with this kind of specific GPR data even with few training samples.

[23]
Title: Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep generative models have shown success in automatically synthesizing missing image regions using surrounding context. However, users cannot directly decide what content to synthesize with such approaches. We propose an end-to-end network for image inpainting that uses a different image to guide the synthesis of new content to fill the hole. A key challenge addressed by our approach is synthesizing new content in regions where the guidance image and the context of the original image are inconsistent. We conduct four studies that demonstrate our results yield more realistic image inpainting results over seven baselines.

[24]
Title: A Comprehensive Analysis of Deep Regression
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning revolutionized data science, and recently, its popularity has grown exponentially, as did the amount of papers employing deep networks. Vision tasks such as human pose estimation did not escape this methodological change. The large number of deep architectures lead to a plethora of methods that are evaluated under different experimental protocols. Moreover, small changes in the architecture of the network, or in the data pre-processing procedure, together with the stochastic nature of the optimization methods, lead to notably different results, making extremely difficult to sift methods that significantly outperform others. Therefore, when proposing regression algorithms, practitioners proceed by trial-and-error. This situation motivated the current study, in which we perform a systematic evaluation and a statistical analysis of the performance of vanilla deep regression -- short for convolutional neural networks with a linear regression top layer --. Up to our knowledge this is the first comprehensive analysis of deep regression techniques. We perform experiments on three vision problems and report confidence intervals for the median performance as well as the statistical significance of the results, if any. Surprisingly, the variability due to different data pre-processing procedures generally eclipses the variability due to modifications in the network architecture.

[25]
Title: Clustering-driven Deep Embedding with Pairwise Constraints
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, there has been increasing interest to leverage the competence of neural networks to analyze data. In particular, new clustering methods that employ deep embeddings have been presented. In this paper, we depart from centroid-based models and suggest a new framework, called Clustering-driven deep embedding with PAirwise Constraints (CPAC), for non-parametric clustering using a neural network. We present a clustering-driven embedding based on a Siamese network that encourages pairs of data points to output similar representations in the latent space. Our pair-based model allows augmenting the information with labeled pairs to constitute a semi-supervised framework. Our approach is based on analyzing the losses associated with each pair to refine the set of constraints. We show that clustering performance increases when using this scheme, even with a limited amount of user queries. We present state-of-the-art results on different types of datasets and compare our performance to parametric and non-parametric techniques.

[26]
Title: Towards Universal Representation for Unseen Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Learning (cs.LG); Multimedia (cs.MM)

Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples. While previous methods focus on inner-dataset seen/unseen splits, this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple-Instance Learning (GMIL) problem and discover 'building-blocks' from the large-scale ActivityNet dataset using distribution kernels. Essential visual and semantic components are preserved in a shared space to achieve the UR that can efficiently generalise to new datasets. Predicted UR exemplars can be improved by a simple semantic adaptation, and then an unseen action can be directly recognised using UR during the test. Without further training, extensive experiments manifest significant improvements over the UCF101 and HMDB51 benchmarks.

[27]
Title: Branched Generative Adversarial Networks for Multi-Scale Image Manifold Learning
Comments: Submitted to ECCV 2018. 17 pages, 12 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce BranchGAN, a novel training method that enables unconditioned generative adversarial networks (GANs) to learn image manifolds at multiple scales. What is unique about BranchGAN is that it is trained in multiple branches, progressively covering both the breadth and depth of the network, as resolutions of the training images increase to reveal finer-scale features. Specifically, each noise vector, as input to the generator network, is explicitly split into several sub-vectors, each corresponding to and trained to learn image representations at a particular scale. During training, we progressively "de-freeze" the sub-vectors, one at a time, as a new set of higher-resolution images is employed for training and more network layers are added. A consequence of such an explicit sub-vector designation is that we can directly manipulate and even combine latent (sub-vector) codes that are associated with specific feature scales. Experiments demonstrate the effectiveness of our training method in multi-scale, disentangled learning of image manifolds and synthesis, without any extra labels and without compromising quality of the synthesized high-resolution images. We further demonstrate two new applications enabled by BranchGAN.

[28]
Title: KonIQ-10k: Towards an ecologically valid and large-scale IQA database
Comments: Image database, image quality assessment, diversity sampling, crowdsourcing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The main challenge in applying state-of-the-art deep learning methods to predict image quality in-the-wild is the relatively small size of existing quality scored datasets. The reason for the lack of larger datasets is the massive resources required in generating diverse and publishable content. We present a new systematic and scalable approach to create large-scale, authentic and diverse image datasets for Image Quality Assessment (IQA). We show how we built an IQA database, KonIQ-10k, consisting of 10,073 images, on which we performed very large scale crowdsourcing experiments in order to obtain reliable quality ratings from 1,467 crowd workers (1.2 million ratings). We argue for its ecological validity by analyzing the diversity of the dataset, by comparing it to state-of-the-art IQA databases, and by checking the reliability of our user studies.

[29]
Title: Group Normalization
Authors: Yuxin Wu, Kaiming He
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform or compete with its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

[30]
Title: Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Learning (cs.LG)

We present a method for generating colored 3D shapes from natural language. To this end, we first learn joint embeddings of freeform text descriptions and colored 3D shapes. Our model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and physical properties of 3D shapes such as color and shape. To evaluate our approach, we collect a large dataset of natural language descriptions for physical 3D objects in the ShapeNet dataset. With this learned joint embedding we demonstrate text-to-shape retrieval that outperforms baseline approaches. Using our embeddings with a novel conditional Wasserstein GAN framework, we generate colored 3D shapes from text. Our method is the first to connect natural language text with realistic 3D objects exhibiting rich variations in color, texture, and shape detail. See video at https://youtu.be/zraPvRdl13Q

[31]
Title: Generalized Scene Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A new passive approach called Generalized Scene Reconstruction (GSR) enables "generalized scenes" to be effectively reconstructed. Generalized scenes are defined to be "boundless" spaces that include non-Lambertian, partially transmissive, textureless and finely-structured matter. A new data structure called a plenoptic octree is introduced to enable efficient (database-like) light and matter field reconstruction in devices such as mobile phones, augmented reality (AR) glasses and drones. To satisfy threshold requirements for GSR accuracy, scenes are represented as systems of partially polarized light, radiometrically interacting with matter. To demonstrate GSR, a prototype imaging polarimeter is used to reconstruct (in generalized light fields) highly reflective, hail-damaged automobile body panels. Follow-on GSR experiments are described.

### Cross-lists for Fri, 23 Mar 18

[32]  arXiv:1803.08207 (cross-list from q-bio.QM) [pdf]
Title: Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs
Comments: 17 pages, 5 figures, to appear in the RECOMB 2018 conference proceedings
Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Cryo-electron microscopy (cryoEM) is fast becoming the preferred method for protein structure determination. Particle picking is a significant bottleneck in the solving of protein structures from single particle cryoEM. Hand labeling sufficient numbers of particles can take months of effort and current computationally based approaches are often ineffective. Here, we frame particle picking as a positive-unlabeled classification problem in which we seek to learn a convolutional neural network (CNN) to classify micrograph regions as particle or background from a small number of labeled positive examples and many unlabeled examples. However, model fitting with very few labeled data points is a challenging machine learning problem. To address this, we develop a novel objective function, GE-binomial, for learning model parameters in this context. This objective uses a newly-formulated generalized expectation criteria to learn effectively from unlabeled data when using minibatched stochastic gradient descent optimizers. On a high-quality publicly available cryoEM dataset and a difficult unpublished dataset supplied by the Shapiro lab, we show that CNNs trained with this objective classify particles accurately with very few positive training examples and outperform EMAN2's byRef method by a large margin even with fewer labeled training examples. Furthermore, we show that incorporating an autoencoder improves generalization when very few labeled data points are available. We also compare our GE-binomial method with other positive-unlabeled learning methods never before applied to particle picking. We expect our particle picking tool, Topaz, based on CNNs trained with GE-binomial, to be an essential component of single particle cryoEM analysis and our GE-binomial objective function to be widely applicable to positive-unlabeled classification problems.

[33]  arXiv:1803.08375 (cross-list from cs.NE) [pdf, other]
Title: Deep Learning using Rectified Linear Units (ReLU)
Comments: 7 pages, 11 figures, 9 tables
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

We introduce the use of rectified linear units (ReLU) as the classification function in a deep neural network (DNN). Conventionally, ReLU is used as an activation function in DNNs, with Softmax function as their classification function. However, there have been several studies on using a classification function other than Softmax, and this study is an addition to those. We accomplish this by taking the activation of the penultimate layer $h_{n - 1}$ in a neural network, then multiply it by weight parameters $\theta$ to get the raw scores $o_{i}$. Afterwards, we threshold the raw scores $o_{i}$ by $0$, i.e. $f(o) = \max(0, o_{i})$, where $f(o)$ is the ReLU function. We provide class predictions $\hat{y}$ through argmax function, i.e. argmax $f(x)$.

[34]  arXiv:1803.08420 (cross-list from cs.HC) [pdf, other]
Title: Incremental Color Quantization for Color-Vision-Deficient Observers Using Mobile Gaming Data
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

The sizes of compressed images depend on their spatial resolution (number of pixels) and on their color resolution (number of color quantization levels). We introduce DaltonQuant, a new color quantization technique for image compression that cloud services can apply to images destined for a specific user with known color vision deficiencies. DaltonQuant improves compression in a user-specific but reversible manner thereby improving a user's network bandwidth and data storage efficiency. DaltonQuant quantizes image data to account for user-specific color perception anomalies, using a new method for incremental color quantization based on a large corpus of color vision acuity data obtained from a popular mobile game. Servers that host images can revert DaltonQuant's image requantization and compression when those images must be transmitted to a different user, making the technique practical to deploy on a large scale. We evaluate DaltonQuant's compression performance on the Kodak PC reference image set and show that it improves compression by an additional 22%-29% over the state-of-the-art compressors TinyPNG and pngquant.

### Replacements for Fri, 23 Mar 18

[35]  arXiv:1610.03151 (replaced) [pdf, other]
Title: FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in Virtual Reality
Comments: Video: this https URL Presented at Siggraph'18
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[36]  arXiv:1708.02288 (replaced) [pdf, other]
Title: Beyond Low-Rank Representations: Orthogonal Clustering Basis Reconstruction with Optimized Graph Structure for Multi-view Spectral Clustering
Authors: Yang Wang, Lin Wu
Comments: Accepted to appear in Neural Networks, Elsevier, on 9th March 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[37]  arXiv:1710.01218 (replaced) [pdf, ps, other]
Title: Reducing Complexity of HEVC: A Deep Learning Approach
Comments: 17 pages, with 12 figures and 7 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[38]  arXiv:1711.04226 (replaced) [pdf, other]
Title: AON: Towards Arbitrarily-Oriented Text Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[39]  arXiv:1711.05908 (replaced) [pdf, other]
Title: NISP: Pruning Networks using Neuron Importance Score Propagation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[40]  arXiv:1711.06454 (replaced) [pdf, other]
Title: Separating Style and Content for Generalized Style Transfer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[41]  arXiv:1711.06721 (replaced) [pdf, other]
Title: Learning SO(3) Equivariant Representations with Spherical CNNs
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[42]  arXiv:1712.00617 (replaced) [pdf, other]
Title: From Pixels to Object Sequences: Recurrent Semantic Instance Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[43]  arXiv:1712.01358 (replaced) [pdf, other]
Title: Long-Term Visual Object Tracking Benchmark
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[44]  arXiv:1801.04261 (replaced) [pdf, other]
Title: Deep saliency: What is learnt by a deep network about saliency?
Comments: Accepted paper in 2nd Workshop on Visualisation for Deep Learning in the 34th International Conference On Machine Learning
Journal-ref: 2nd Workshop on Visualisation for Deep Learning, ICML 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[45]  arXiv:1801.04651 (replaced) [pdf, other]
Title: Deep Net Triage: Analyzing the Importance of Network Layers via Structural Compression
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[46]  arXiv:1801.06397 (replaced) [pdf, other]
Title: What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[47]  arXiv:1801.07892 (replaced) [pdf, other]
Title: Generative Image Inpainting with Contextual Attention
Comments: Accepted in CVPR 2018; add CelebA-HQ results; open sourced; interactive demo available: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[48]  arXiv:1802.01873 (replaced) [pdf, other]
Title: Every Smile is Unique: Landmark-Guided Diverse Smile Generation
Comments: IEEE International Conference on Computer Vision and Pattern Recognition, 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[49]  arXiv:1802.02679 (replaced) [pdf, other]
Title: A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels
Journal-ref: IEEE Winter Conf. on Applications of Computer Vision 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[50]  arXiv:1802.08936 (replaced) [pdf, other]
Title: A Dataset To Evaluate The Representations Learned By Video Prediction Models
Comments: Accepted to ICLR 2018 Workshop Track. Fixed Figure 2
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[51]  arXiv:1803.00391 (replaced) [src]
Title: Image Dataset for Visual Objects Classification in 3D Printing
Comments: It is not accepted and the work needed major reversion and improvement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[52]  arXiv:1803.04108 (replaced) [pdf, other]
Title: Style Aggregated Network for Facial Landmark Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[53]  arXiv:1803.05753 (replaced) [pdf, other]
Title: What Catches the Eye? Visualizing and Understanding Deep Saliency Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[54]  arXiv:1803.06252 (replaced) [pdf, other]
Title: Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model
Comments: To appear in IAPR International Workshop on Document Analysis Systems 2018 (DAS 2018)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
[55]  arXiv:1803.07015 (replaced) [pdf]
Title: Live Target Detection with Deep Learning Neural Network and Unmanned Aerial Vehicle on Android Mobile Device
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[56]  arXiv:1803.07125 (replaced) [pdf, other]
Title: Local Binary Pattern Networks
Comments: 14 pages, 10 figures, 6 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[57]  arXiv:1803.07624 (replaced) [pdf, other]
Title: Dynamic Sampling Convolutional Neural Networks