1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding;
Emotional voice conversion (VC) aims to convert a neutral voice to an emotional one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech information (such as content, speaker identity, etc.) is the key to achieving promising performance. Some recent attempts of speech representation decoupling on the neutral speech cannot work well on the emotional speech, due to the more complex entanglement of acoustic properties in the latter.
Detection of speech and music signals in isolated and overlapped conditions is an essential preprocessing step for many audio applications. Speech signals have wavy and continuous harmonics, while music signals exhibit horizontally linear and discontinuous harmonic patterns. Music signals also contain more percussive components than speech signals, manifested as vertical striations in the spectrograms.
A key challenge of image splicing detection is how to localize integral tampered regions without false alarm. Although current forgery detection approaches have achieved promising performance, the integrality and false alarm are overlooked. In this paper, we argue that the insufficient use of splicing boundary is a main reason for poor accuracy. To tackle this problem, we propose an Edge-enhanced Transformer (ET) for tampered region localization. Specifically, to capture rich tampering traces, a two-branch edge-aware transformer is built to integrate the splicing edge clues into the forgery localization network, generating forgery features and edge features.
In this letter, we propose a novel solution to the problem of single image super-resolution at multiple scaling factors, with a single network architecture. In applications where only a detail needs to be super-resolved, traditional solutions must choose to use as input either the low-resolution detail, thus losing the information about the context, or the whole low-resolution image and then crop the desired output detail, which is quite wasteful in terms of computations and storage.
Active reconfigurable intelligent surfaces (RISs) are a novel and promising technology that allows controlling the radio propagation environment while compensating for the product path loss along the RIS-assisted path. In this letter, we consider the classical radar detection problem and propose to use an active RIS to get a second independent look at a prospective target illuminated by the radar transmitter.
Model selection is an omnipresent problem in signal processing applications. The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the most commonly used solutions to this problem. These criteria have been found to have satisfactory performance in many cases and had a dominant role in the model selection literature since their introduction several decades ago, despite numerous attempts to dethrone them. Model selection can be viewed as a multiple hypothesis testing problem.
The algorithms based on the technique of optimal
Mask-based lensless cameras offer a novel design for imaging systems by replacing the lens in a conventional camera with a layer of coded mask. Each pixel of the lensless camera encodes the information of the entire 3D scene. Existing methods for 3D reconstruction from lensless measurements suffer from poor spatial and depth resolution.
Recently, self-supervised learning (SSL) from unlabelled speech data has gained increased attention in the automatic speech recognition (ASR) community. Typical SSL methods include autoregressive predictive coding (APC), Wav2vec2.0, and hidden unit BERT (HuBERT). However, SSL models are biased to the pretraining data. When SSL models are finetuned with data from another domain, domain shifting occurs and might cause limited knowledge transfer for downstream tasks.
Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge.
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
The papers in this special section focus on self-supervised learning for speech and audio processing. A current trend in the machine learning community is the adoption of self-supervised approaches to pretrain deep networks. Self-supervised learning utilizes proxy-supervised learning tasks (or pretext tasks) - for example, distinguishing parts of the input signal from distractors or reconstructing masked input segments conditioned on unmasked segments—to obtain training data from unlabeled corpora.