Highly Accurate and Efficient Scene Understanding via Mixture of Neural Sub-Network Architecture Search


Grant Data
Project Title
Highly Accurate and Efficient Scene Understanding via Mixture of Neural Sub-Network Architecture Search
Principal Investigator
Professor Luo, Ping   (Principal Investigator (PI))
Duration
42
Start Date
2020-07-01
Amount
1067005
Conference Title
Highly Accurate and Efficient Scene Understanding via Mixture of Neural Sub-Network Architecture Search
Keywords
Deep Neural Network, Differentiable Learning, Learning-to-meta-learn, Neural Architecture Search, Scene Image Understanding
Discipline
VisionSignal and Image Processing
Panel
Engineering (E)
HKU Project Code
27208720
Grant Type
Early Career Scheme (ECS)
Funding Year
2020
Status
Completed
Objectives
1  Develop New Approaches for Hardware- and Domain-aware Scene Understanding. We’ll design highly accurate and fast algorithms for scene understanding such as scene segmentation and dense object detection. These tasks are important in many vision systems such as visual navigation system (VNS). For example, a VNS enables locating geographical position, ""2 Queen’s Road, HK"", from a single image of Cheung Kong Center by identifying streets and buildings therein. Scene understanding is crucial in many hardwares (e.g. autonomous vehicles, robots, mobilephones) and data domains (e.g. streets in different cities). As different domains have large appearance variations, and different hardwares have different resources (e.g. FLOPS, memory/storage), an existing CNN is infeasible to achieve both high accuracy and low runtime latency in all domains and hardwares, because small CNN reduces accuracy, but large CNN decreases efficiency. Existing CNNs have a severe shortage: once a CNN has been trained on a specific domain and hardware, it’s fixed and cannot be fitted into others. Previous work asked human experts to manually design different CNN architectures for different domains/hardwares, making them redundant, suboptimal and inefficient. Our Goal is to automatically learn a CNN architecture for scene understanding, which can be trained only once and re-allocate its computations and memory in the test stage, to adapt to multiple hardwares and domains, enabling high accuracy and low latency. Our approach has significant differences from existing CNNs that are restricted to a single domain and hardware after training, it will fundamentally change pipeline of CNNs, making this project had substantial impact on both applications and theory.  2  Develop New Methods for Neural Architecture Search (NAS). Existing CNNs have three critical problems, making them incapable when applied to scene understanding in challenging scenarios. (1) Different CNNs are designed by humans for different domains and hardwares, being redundant, inaccurate and inefficient. (2) Although prior NAS methods can jointly learn network architecture and parameters, they mainly searched wirings between basic blocks, where each block contains a convolutional layer, a normalization layer and an activation function, hindering CNN's representation capacity. (3) The foundation of CNN is feed-forward computation, including a forward flow to produce segmentation and a backward flow to update network parameters. But these flows are fixed in testing after NAS training, preventing the learned CNN architecture from adapting to new scenario. Our Goal is to develop a novel NAS algorithm, termed mixture of Subnetwork Architecture Search (mixSAS). It enables us to automatically learn a single CNN architecture, which contains many subnetworks for different domains and hardwares. These subnetworks are fully shared to avoid redundancy. In testing stage, an optimal subnetwork can be activated unsupervisedly, improving efficiency and accuracy. The speedup in segmentation could be >80 times on large image of 1024×1024. A new basic block will be developed to learn arbitrary convolutions (e.g. dilated, depthwise convolution), normalizations, and poolings in a differentiable end-to-end manner. It improves CNN’s representation power and reduces runtime and memory by order of magnitude. The learned architecture allows different data samples to use different operations (i.e. instance-based), which are unattainable by previous NASs.  3  Build a Benchmark for Scene Understanding cross Data Domains and Hardwares. Existing benchmarks focused on single domain (e.g. Cityscape built in Europe). We'll build a new benchmark by combining prior datasets from multiple domains to evaluate existing methods and the proposed methods. We'll also increase data-scale by collecting more data, cleaning and labeling them. The runtime and accuracy of different deep neural networks and operations will be evaluated in different hardwares using the proposed benchmark. This benchmark will facilitate more innovative researches.  4  Perform Theoretical Analysis for mixSAS. We'll analyze convergence and generalization ability of the proposed mixSAS, by using Partial Differential Equations (PDEs) and Statistical Mechanics. The results will have wide interests because they could explain what operations to use in a CNN with respect to certain data distributions (e.g. Gaussian mixture) and objective functions (e.g. cross-entropy), shedding light on neural network design and its explainability.  5  System Integration. We'll build a scene segmentation system and a dense object detection system that are highly accurate, fast, and adaptable to CPU, GPU, and mobilephone, as well as data captured in Asia and Europe. The outcomes will be a major breakthrough on deep learning applications, and will break the accuracy and efficiency bottlenecks of CNNs when applied to different hardwares and data domains. Since we can fundamentally change forward- and backward-computation of deep neural networks, our work will inspire many more innovations in related researches.  6  Extension. The proposed mixSAS will be extended to learn deep network architectures for medical imaging analysis such as organ segmentation, to evaluate their generalization ability to new tasks.  7  Publications and Codes. We plan to publish 10~12 journal/conference papers such as TPAMI/ICML/ICLR/NIPS/CVPR, apply 2~3 patents, and deliver 1~2 software libraries.