The last few years have witnessed a tremendous growth of the demand for wireless services and a significant increase of the number of mobile subscribers. A recent data traffic forecast from Cisco reported that the global mobile data traffic reached 1.2 zettabytes per year in 2016, and the global IP traffic will increase nearly threefold over the next 5 years. Based on these predictions, a 127-fold increase of the IP traffic is expected from 2005 to 2021. It is also anticipated that the mobile data traffic will reach 3.3 zettabytes per year by 2021, and that the number of mobile-connected devices will reach 3.5 per capita.
With such demands for higher data rates and for better quality of service (QoS), fifth generation (5G) standardization initiatives, whose initial phase was specified in June 2018 under the umbrella of Long Term Evolution (LTE) Release 15, have been under vibrant investigation. In particular, the International Telecommunication Union (ITU) has identified three usage scenarios (service categories) for 5G wireless networks: (i) enhanced mobile broadband (eMBB), (ii) ultra-reliable and low latency communications (uRLLC), and (iii) massive machine type communications (mMTC). The vast variety of applications for beyond 5G wireless networks has motivated the necessity of novel and more flexible physical layer (PHY) technologies, which are capable of providing higher spectral and energy efficiencies, as well as reduced transceiver implementations.
1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
We present an image captioning framework that generates captions under a given topic. The topic candidates are extracted from the caption corpus. A given image’s topics are then selected from these candidates by a CNN-based multi-label classifier. The input to the caption generation model is an image-topic pair, and the output is a caption of the image. For this purpose, a cross-modal embedding method is learned for the images, topics, and captions. In the proposed framework, the topic, caption, and image are organized in a hierarchical structure, which is preserved in the embedding space by using the order-embedding method. The caption embedding is upper bounded by the corresponding image embedding and lower bounded by the topic embedding. The lower bound pushes the images and captions about the same topic closer together in the embedding space. A bidirectional caption-image retrieval task is conducted on the learned embedding space and achieves the state-of-the-art performance on the MS-COCO and Flickr30K datasets, demonstrating the effectiveness of the embedding method. To generate a caption for an image, an embedding vector is sampled from the region bounded by the embeddings of the image and the topic, then a language model decodes it to a sentence as the output. The lower bound set by the topic shrinks the output space of the language model, which may help the model to learn to match images and captions better. Experiments on the image captioning task on the MS-COCO and Flickr30K datasets validate the usefulness of this framework by showing that the different given topics can lead to different captions describing specific aspects of the given image and that the quality of generated captions is higher than the control model without a topic as input. In addition, the proposed method is competitive with many state-of-the-art methods in terms of standard evaluation metrics.
© Copyright 2019 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.