Computer Vision and Pattern Recognition

New submissions

[ total of 39 entries: 1-39 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 23 Mar 17

[1]
Title: Simple Online and Realtime Tracking with a Deep Association Metric
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

[2]
Title: IOD-CNN: Integrating Object Detection Networks for Event Recognition
Comments: submitted to IEEE International Conference on Image Processing 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Many previous methods have showed the importance of considering semantically relevant objects for performing event recognition, yet none of the methods have exploited the power of deep convolutional neural networks to directly integrate relevant object information into a unified network. We present a novel unified deep CNN architecture which integrates architecturally different, yet semantically-related object detection networks to enhance the performance of the event recognition task. Our architecture allows the sharing of the convolutional layers and a fully connected layer which effectively integrates event recognition, rigid object detection and non-rigid object detection.

[3]
Title: No Fuss Distance Metric Learning using Proxies
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -- an anchor point $x$ is similar to a set of positive points $Y$, and dissimilar to a set of negative points $Z$, and a loss defined over these distances is minimized.
While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc, but even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.

[4]
Title: Episode-Based Active Learning with Bayesian Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

We investigate different strategies for active learning with Bayesian deep neural networks. We focus our analysis on scenarios where new, unlabeled data is obtained episodically, such as commonly encountered in mobile robotics applications. An evaluation of different strategies for acquisition, updating, and final training on the CIFAR-10 dataset shows that incremental network updates with final training on the accumulated acquisition set are essential for best performance, while limiting the amount of required human labeling labor.

[5]
Title: PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding
Comments: 10 pages, submitted to ICCV 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite the fact that many 3D human activity benchmarks being proposed, most existing action datasets focus on the action recognition tasks for the segmented videos. There is a lack of standard large-scale benchmarks, especially for current popular data-hungry deep learning based methods. In this paper, we introduce a new large scale benchmark (PKU-MMD) for continuous multi-modality 3D human action understanding and cover a wide range of complex human activities with well annotated information. PKU-MMD contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Our dataset also provides multi-modality data sources, including RGB, depth, Infrared Radiation and Skeleton. With different modalities, we conduct extensive experiments on our dataset in terms of two scenarios and evaluate different methods by various metrics, including a new proposed evaluation protocol 2D-AP. We believe this large-scale dataset will benefit future researches on action detection for the community.

[6]
Title: Spatially-Varying Blur Detection Based on Multiscale Fused and Sorted Transform Coefficients of Gradient Magnitudes
Comments: Paper got accepted in CVPR 2017
Journal-ref: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The detection of spatially-varying blur without having any information about the blur type is a challenging task. In this paper, we propose a novel effective approach to address the blur detection problem from a single image without requiring any knowledge about the blur type, level, or camera settings. Our approach computes blur detection maps based on a novel High-frequency multiscale Fusion and Sort Transform (HiFST) of gradient magnitudes. The evaluations of the proposed approach on a diverse set of blurry images with different blur types, levels, and contents demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods qualitatively and quantitatively.

[7]
Title: Knowledge Transfer for Melanoma Screening with Deep Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Knowledge transfer impacts the performance of deep learning -- the state of the art for image classification tasks, including automated melanoma screening. Deep learning's greed for large amounts of training data poses a challenge for medical tasks, which we can alleviate by recycling knowledge from models trained on different tasks, in a scheme called transfer learning. Although much of the best art on automated melanoma screening employs some form of transfer learning, a systematic evaluation was missing. Here we investigate the presence of transfer, from which task the transfer is sourced, and the application of fine tuning (i.e., retraining of the deep learning model after transfer). We also test the impact of picking deeper (and more expensive) models. Our results favor deeper models, pre-trained over ImageNet, with fine-tuning, reaching an AUC of 80.7% and 84.5% for the two skin-lesion datasets evaluated.

[8]
Title: Deep Photo Style Transfer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces a deep-learning approach to photographic style transfer that handles a large variety of image content while faithfully transferring the reference style. Our approach builds upon recent work on painterly transfer that separates style from the content of an image by considering different layers of a neural network. However, as is, this approach is not suitable for photorealistic style transfer. Even when both the input and reference images are photographs, the output still exhibits distortions reminiscent of a painting. Our contribution is to constrain the transformation from the input to the output to be locally affine in colorspace, and to express this constraint as a custom CNN layer through which we can backpropagate. We show that this approach successfully suppresses distortion and yields satisfying photorealistic style transfers in a broad variety of scenarios, including transfer of the time of day, weather, season, and artistic edits.

[9]
Title: Video Frame Interpolation via Adaptive Convolution
Comments: CVPR 2017, this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video frame interpolation typically involves two steps: motion estimation and pixel synthesis. Such a two-step approach heavily depends on the quality of motion estimation. This paper presents a robust video frame interpolation method that combines these two steps into a single process. Specifically, our method considers pixel synthesis for the interpolated frame as local convolution over two input frames. The convolution kernel captures both the local motion between the input frames and the coefficients for pixel synthesis. Our method employs a deep fully convolutional neural network to estimate a spatially-adaptive convolution kernel for each pixel. This deep neural network can be directly trained end to end using widely available video data without any difficult-to-obtain ground-truth data like optical flow. Our experiments show that the formulation of video interpolation as a single convolution process allows our method to gracefully handle challenges like occlusion, blur, and abrupt brightness change and enables high-quality video frame interpolation.

[10]
Title: Joint Intermodal and Intramodal Label Transfers for Extremely Rare or Unseen Classes
Comments: The paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. It will apear in a future issue
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a label transfer model from texts to images for image classification tasks. The problem of image classification is often much more challenging than text classification. On one hand, labeled text data is more widely available than the labeled images for classification tasks. On the other hand, text data tends to have natural semantic interpretability, and they are often more directly related to class labels. On the contrary, the image features are not directly related to concepts inherent in class labels. One of our goals in this paper is to develop a model for revealing the functional relationships between text and image features as to directly transfer intermodal and intramodal labels to annotate the images. This is implemented by learning a transfer function as a bridge to propagate the labels between two multimodal spaces. However, the intermodal label transfers could be undermined by blindly transferring the labels of noisy texts to annotate images. To mitigate this problem, we present an intramodal label transfer process, which complements the intermodal label transfer by transferring the image labels instead when relevant text is absent from the source corpus. In addition, we generalize the inter-modal label transfer to zero-shot learning scenario where there are only text examples available to label unseen classes of images without any positive image examples. We evaluate our algorithm on an image classification task and show the effectiveness with respect to the other compared algorithms.

[11]
Title: Deeply-Supervised CNN for Prostate Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Prostate segmentation from Magnetic Resonance (MR) images plays an important role in image guided interven- tion. However, the lack of clear boundary specifically at the apex and base, and huge variation of shape and texture between the images from different patients make the task very challenging. To overcome these problems, in this paper, we propose a deeply supervised convolutional neural network (CNN) utilizing the convolutional information to accurately segment the prostate from MR images. The proposed model can effectively detect the prostate region with additional deeply supervised layers compared with other approaches. Since some information will be abandoned after convolution, it is necessary to pass the features extracted from early stages to later stages. The experimental results show that significant segmentation accuracy improvement has been achieved by our proposed method compared to other reported approaches.

[12]
Title: Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a novel approach, called Deep MANTA (Deep Many-Tasks), for many-task vehicle analysis from a given image. A robust convolutional network is introduced for simultaneous vehicle detection, part localization, visibility characterization and 3D dimension estimation. Its architecture is based on a new coarse-to-fine object proposal that boosts the vehicle detection. Moreover, the Deep MANTA network is able to localize vehicle parts even if these parts are not visible. In the inference, the network's outputs are used by a real time robust pose estimation algorithm for fine orientation estimation and 3D vehicle localization. We show in experiments that our method outperforms monocular state-of-the-art approaches on vehicle detection, orientation and 3D location tasks on the very challenging KITTI benchmark.

[13]
Title: An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose an end-to-end approach to the natural language object retrieval task, which localizes an object within an image according to a natural language description, i.e., referring expression. Previous works divide this problem into two independent stages: first, compute region proposals from the image without the exploration of the language description; second, score the object proposals with regard to the referring expression and choose the top-ranked proposals. The object proposals are generated independently from the referring expression, which makes the proposal generation redundant and even irrelevant to the referred object. In this work, we train an agent with deep reinforcement learning, which learns to move and reshape a bounding box to localize the object according to the referring expression. We incorporate both the spatial and temporal context information into the training procedure. By simultaneously exploiting local visual information, the spatial and temporal context and the referring language a priori, the agent selects an appropriate action to take at each time. A special action is defined to indicate when the agent finds the referred object, and terminate the procedure. We evaluate our model on various datasets, and our algorithm significantly outperforms the compared algorithms. Notably, the accuracy improvement of our method over the recent method GroundeR and SCRC on the ReferItGame dataset are 7.67% and 18.25%, respectively.

[14]
Title: Can you tell where in India I am from? Comparing humans and computers on fine-grained race face classification
Comments: 9 pages, 5 figure, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Faces form the basis for a rich variety of judgments in humans, yet the underlying features remain poorly understood. Although fine-grained distinctions within a race might more strongly constrain possible facial features used by humans than in case of coarse categories such as race or gender, such fine grained distinctions are relatively less studied. Fine-grained race classification is also interesting because even humans may not be perfectly accurate on these tasks. This allows us to compare errors made by humans and machines, in contrast to standard object detection tasks where human performance is nearly perfect. We have developed a novel face database of close to 1650 diverse Indian faces labeled for fine-grained race (South vs North India) as well as for age, weight, height and gender. We then asked close to 130 human subjects who were instructed to categorize each face as belonging toa Northern or Southern state in India. We then compared human performance on this task with that of computational models trained on the ground-truth labels. Our main results are as follows: (1) Humans are highly consistent (average accuracy: 63.6%), with some faces being consistently classified with > 90% accuracy and others consistently misclassified with < 30% accuracy; (2) Models trained on ground-truth labels showed slightly worse performance (average accuracy: 62%) but showed higher accuracy (72.2%) on faces classified with > 80% accuracy by humans. This was true for models trained on simple spatial and intensity measurements extracted from faces as well as deep neural networks trained on race or gender classification; (3) Using overcomplete banks of features derived from each face part, we found that mouth shape was the single largest contributor towards fine-grained race classification, whereas distances between face parts was the strongest predictor of gender.

[15]
Title: Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we approach the problem of segmentation-free query-by-string word spotting for handwritten documents. In other words, we use methods inspired from computer vision and machine learning to search for words in large collections of digitized manuscripts. In particular, we are interested in historical handwritten texts, which are often far more challenging than modern printed documents. This task is important, as it provides people with a way to quickly find what they are looking for in large collections that are tedious and difficult to read manually. To this end, we introduce an end-to-end trainable model based on deep neural networks that we call Ctrl-F-Net. Given a full manuscript page, the model simultaneously generates region proposals, and embeds these into a distributed word embedding space, where searches are performed. We evaluate the model on common benchmarks for handwritten word spotting, outperforming the previous state-of-the-art segmentation-free approaches by a large margin, and in some cases even segmentation-based approaches. One interesting real-life application of our approach is to help historians to find and count specific words in court records that are related to women's sustenance activities and division of labor. We provide promising preliminary experiments that validate our method on this task.

[16]
Title: Predicting Deeper into the Future of Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we focus on predicting semantic segmentations of future frames. More precisely, given a sequence of semantically segmented video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future. We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames. Our results on the Cityscapes dataset show that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames. Our models predict trajectories of cars and pedestrians much more accurately (25%) than baselines that copy the most recent semantic segmentation or warp it using optical flow. Prediction results up to half a second in the future are visually convincing, the mean IoU of predicted segmentations reaching two thirds of the real future segmentations.

[17]
Title: Classifying Symmetrical Differences and Temporal Change in Mammography Using Deep Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We investigate the addition of symmetry and temporal context information to a deep Convolutional Neural Network (CNN) with the purpose of detecting malignant soft tissue lesions in mammography. We employ a simple linear mapping that takes the location of a mass candidate and maps it to either the contra-lateral or prior mammogram and Regions Of Interest (ROI) are extracted around each location. We subsequently explore two different architectures (1) a fusion model employing two datastreams were both ROIs are fed to the network during training and testing and (2) a stage-wise approach where a single ROI CNN is trained on the primary image and subsequently used as feature extractor for both primary and symmetrical or prior ROIs. A 'shallow' Gradient Boosted Tree (GBT) classifier is then trained on the concatenation of these features and used to classify the joint representation. Results shown a significant increase in performance using the first architecture and symmetry information, but only marginal gains in performance using temporal data and the other setting. We feel results are promising and can greatly be improved when more temporal data becomes available.

[18]
Title: In Defense of the Triplet Loss for Person Re-Identification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

In the past few years, the field of computer vision has gone through a revolution fueled mainly by the advent of large datasets and the adoption of deep convolutional neural networks for end-to-end learning. The person re-identification subfield is no exception to this, thanks to the notable publication of the Market-1501 and MARS datasets and several strong deep learning approaches. Unfortunately, a prevailing belief in the community seems to be that the triplet loss is inferior to using surrogate losses (classification, verification) followed by a separate metric learning step. We show that, for models trained from scratch as well as pretrained ones, using a variant of the triplet loss to perform end-to-end deep metric learning outperforms any other published method by a large margin.

Cross-lists for Thu, 23 Mar 17

[19]  arXiv:1703.07655 (cross-list from cs.NE) [pdf, other]
Title: ASP: Learning to Forget with Adaptive Synaptic Plasticity in Spiking Neural Networks
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

A fundamental feature of learning in animals is the "ability to forget" that allows an organism to perceive, model and make decisions from disparate streams of information and adapt to changing environments. Against this backdrop, we present a novel unsupervised learning mechanism ASP (Adaptive Synaptic Plasticity) for improved recognition with Spiking Neural Networks (SNNs) for real time on-line learning in a dynamic environment. We incorporate an adaptive weight decay mechanism with the traditional Spike Timing Dependent Plasticity (STDP) learning to model adaptivity in SNNs. The leak rate of the synaptic weights is modulated based on the temporal correlation between the spiking patterns of the pre- and post-synaptic neurons. This mechanism helps in gradual forgetting of insignificant data while retaining significant, yet old, information. ASP, thus, maintains a balance between forgetting and immediate learning to construct a stable-plastic self-adaptive SNN for continuously changing inputs. We demonstrate that the proposed learning methodology addresses catastrophic forgetting while yielding significantly improved accuracy over the conventional STDP learning method for digit recognition applications. Additionally, we observe that the proposed learning model automatically encodes selective attention towards relevant features in the input data while eliminating the influence of background noise (or denoising) further improving the robustness of the ASP learning.

Replacements for Thu, 23 Mar 17

[20]  arXiv:1610.02391 (replaced) [pdf, other]
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Learning (cs.LG)
[21]  arXiv:1611.07890 (replaced) [pdf, other]
Title: Image-based localization using LSTMs for structured feature correlation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[22]  arXiv:1612.00534 (replaced) [pdf, other]
Title: Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[23]  arXiv:1612.06096 (replaced) [pdf, other]
Title: X-ray In-Depth Decomposition: Revealing The Latent Structures
Comments: Under review at MICCAI 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[24]  arXiv:1612.07528 (replaced) [pdf, other]
Title: Handwriting recognition using Cohort of LSTM and lexicon verification with extremely large lexicon
Comments: 28 pages, paper submitted to Pattern Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[25]  arXiv:1701.02470 (replaced) [pdf]
Title: Methods for Mapping Forest Disturbance and Degradation from Optical Earth Observation Data: a Review
Comments: This is the Authors' accepted version only! The final version of this paper can be located at Springer.com as part of the Current Forestry Reports (2017) 3: 32. doi:10.1007/s40725-017-0047-2
Journal-ref: Current Forestry Reports 2017
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[26]  arXiv:1702.00505 (replaced) [pdf, other]
Title: Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications Using HyperMapper
Comments: 10 pages, Keywords: design space exploration, machine learning, computer vision, SLAM, embedded systems, GPU, crowd-sourcing
Journal-ref: 31st IEEE International Parallel and Distributed Processing Symposium May 29 - June 2, 2017 Orlando, Florida USA
Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG); Performance (cs.PF)
[27]  arXiv:1702.00783 (replaced) [pdf, other]
Title: Pixel Recursive Super Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
[28]  arXiv:1703.01976 (replaced) [pdf, other]
Title: Incorporating the Knowledge of Dermatologists to Convolutional Neural Networks for the Diagnosis of Skin Lesions
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[29]  arXiv:1703.02437 (replaced) [pdf, other]
Title: PathTrack: Fast Trajectory Annotation with Path Supervision
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Multimedia (cs.MM)
[30]  arXiv:1703.04590 (replaced) [pdf, other]
Title: Learning Background-Aware Correlation Filters for Visual Tracking
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[31]  arXiv:1703.05830 (replaced) [pdf, other]
Title: Automatically identifying wild animals in camera trap images with deep learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
[32]  arXiv:1703.05884 (replaced) [pdf, other]
Title: Need for Speed: A Benchmark for Higher Frame Rate Object Tracking
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[33]  arXiv:1703.06211 (replaced) [pdf, other]
Title: Deformable Convolutional Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[34]  arXiv:1703.06246 (replaced) [pdf, other]
Title: Towards Context-aware Interaction Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[35]  arXiv:1703.06935 (replaced) [pdf, other]
Title: Fast Spectral Ranking for Similarity Search
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[36]  arXiv:1703.07255 (replaced) [pdf, other]
Title: ZM-Net: Real-time Zero-shot Image Manipulation Network
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Learning (cs.LG); Machine Learning (stat.ML)
[37]  arXiv:1607.03961 (replaced) [pdf, other]
Title: Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays
Subjects: Data Structures and Algorithms (cs.DS); Computer Vision and Pattern Recognition (cs.CV)
[38]  arXiv:1608.04644 (replaced) [pdf, other]
Title: Towards Evaluating the Robustness of Neural Networks
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
[39]  arXiv:1609.05993 (replaced) [pdf, other]
Title: Reducing Drift in Visual Odometry by Inferring Sun Direction Using a Bayesian Convolutional Neural Network
Comments: To appear in the proceedings of the International Conference on Robotics and Automation, Singapore, May 29 to June 3, 2017
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[ total of 39 entries: 1-39 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)