Machine Learning Modems: How ML will change how we specify and design next generation communication systems

CTN Issue: March 2018

So, I hear you all ask, are we all out of a job? Are communication engineers another victim of the rise of the machine? Well the answer in this article is thankfully no, not yet. But Nathan, Ben and Tim point us towards a fascinating new way to specify and design communication systems that may change forever the way we standardize, design and field optimize our products. Time to take that “basics of ML” online course you have been putting off for the last year! But first, read this article. Comments, and recommendations for good classes in ML, welcome as always in the comments section at the end of the article.

Alan Gatherer,  Editor-in-Chief

Adapting the Wireless Physical Layer from End-to-End using Channel Autoencoders and Deep Learning

Nathan West
Nathan West, Principal Engineer at DeepSig Inc.
Ben Hilburn
Ben Hilburn, Director of Engineering at DeepSig Inc.
Tim O’Shea
Dr. Tim O’Shea, CTO at DeepSig Inc. & Research Scientist at Virginia Tech


The Growth of Degrees of Freedom in Communications Systems

Communications systems have evolved rapidly over the last 40 years, continually advancing techniques for modulation, error correction, and other physical layer techniques to improve system performance.  As algorithm complexity, number of devices, number of antennas, number of bands, resource allocation and scheduling, complexity of integrated radio chips, and total number of system parameters have all continued to steadily increase, the task of globally optimizing across all of these degrees of freedom has become increasingly daunting.

Radio algorithm design today is often a highly complex and compartmentalized process, where error correction codes, massive MIMO algorithms, synchronization and estimation routines and other techniques are often analyzed one at a time during the design process, and then concatenated to form complete modem systems.  Optimizing complete modem systems jointly across subcomponents can be very difficult if not intractable today using strictly analytic probabilistic methods, especially when seeking to consider complex channel models and effects such as hardware non-linearities, structured interference, and other effects which aren’t easily modeled using traditional noise or fading models.  More rigorous models such as specified for test of ITU/3GPP systems still greatly simplify the effects of fading and do not accurately capture many effects in modern deployments [7], yet they are prohibitively complex and tedious when attempting to consider their combined effects on cascades of signal processing functions analytically, and seeking to jointly adapt and optimize each of these subsystems. 

In general these trends have led to high complexity in communications systems today, where we come very close to the single-user Shannon capacity under simple channel assumptions, but where we have not yet fully attained the remaining performance gains available when considering all available degrees of freedom.  Several key areas where potential exists for improvement include, computational complexity reductions through merging and approximating existing functions, improved performance by leveraging error feedback through the PHY, better compensation for interference from structured sources of distortion and interference, improvements in algorithms for maximizing multi-user and multi-antenna capacity, and improvements in scheduling and assigning limited resources within shared systems.

How to Cope with Complexity: A Learning Approach to Physical Layer Design

Jointly optimizing the full chain of signal processing algorithms in the physical layer between transmit bits and receive bits (end-to-end optimization), in the presence of realistic radio effects, distortion, interference and other impairments has never quite been practical.  Subsets of this have been heavily considered, such as joint optimization of receiver functions (e.g. synchronization, equalization, symbol estimation and decoding) using iterative methods [10] to improve performance, but have done so generally at high computational cost and with relatively simplified effects models.

In contrast, in recent years, primarily in the field of computer vision, the trend of end-to-end optimization of highly complex cascades of image processing algorithms based on data and the use of global loss functions has actually become tractable and state of the art, largely through the collection of machine learning (ML) techniques collectively known as deep learning.  The joint estimation of entire decision manifolds relating pixel level vision tasks, object detection and labeling, and regression of tasks such as lane tracking and much more have all been demonstrated to be feasible and state of the art in applications such as self driving cars.  This use of data, simulation, and experience to guide the end-to-end synthesis of efficient solutions to highly complex software and algorithmic tasks has become increasingly adopted in a number of critical information processing fields, and was recently termed “Software 2.0”, in an eloquently written article [11] by Andrej Karpathy, Director of AI at Tesla.

Two years ago in [2] we applied this same ideology to the problem of physical layer communications system synthesis, casting the “fundamental problem of communication” [13] as an end-to-end optimized machine learning task where we simply seek to optimize a large nonlinear neural network with many degrees of freedom for a simple high level objective, reconstructing a random input message, transmitted and received over a channel, as well as possible at the output of a receiver.  This task of input reconstruction within a neural network, is what's known as an autoencoder (AE).

Autoencoders and Modern Nonlinear Neural Network Structures

Recently in the field of deep learning, autoencoders have been used to accomplish similar sounding tasks on images, where a series of non-linear layers of neurons comprising an encoder and decoder network are optimized to transform each set of input values into a learned hidden representation (typically of a lower dimension), and then back into a reconstruction of the original input values.   When the hidden layer representation takes the form of a set of distributions rather than values, this is referred to as a variational autoencoder (VAE).  These approaches can have many nice properties such as denoising effects which remove non-structural components of an input example, and can produce very useful nonlinear sparse representations of signals for both compression and the learning of latent features which represent complex higher dimensional properties of the input examples.   Neural network architectures used within the encoder and decoder networks vary wildly, but at their simplest form, they comprise a set of sequential multiply-accumulate operations with random weight (w) and bias (b) values, interleaved with nonlinear activation functions such as the rectified linear unit ( given by f(x) = max(0,x) ).  One simple such network is shown below, where several fully connected layers propagate values “forward” from input to output, using a set of network parameters (θ), comprised in this case of w and b, which are updated iteratively during training.

The process of training the network parameters involves the computation of a loss function (L), in this case which computes a distance between input values and output values, and then the minimization of this loss function with regard to all of the parameters in the network.  This can be accomplished using stochastic gradient descent (SGD), or one of the many variations thereof (such as Adam [13]) which often converge more quickly for large networks.  Distance metrics used may often consider a mean-squared error (MSE), but for categorical problems such as the choice of transmitted codewords or other discrete classes, cross-entropy can be an effective choice when combined with appropriate output activation functions (e.g. SoftMax).  This process of iteratively updating network parameters over a series of iterations or epochs, continually computes the gradient dL/dθ for each element in theta, and updates the value with some learning rate (η), where parameters at epoch i are equal to, θi = θi-1 - η*dL/dθ.  Computing this gradient can be done using the chain rule backwards through the network, and is often referred to as a “backwards” pass.  Many of these fundamentals have been the same for many many years in neural networks, but recent advances in computational power, SGD efficacy, regularization, and nonlinearity choice have compounded to make this training process possible on much larger networks.  The use of all of these modern enhancements together on larger networks is now typically termed deep learning.

The Channel Autoencoder

We construct a channel autoencoder, by inserting a channel model, representative of the impairments in a communication system into the hidden layer of a traditional autoencoder or variational autoencoder, and by choosing a set of bits or codewords (s) which comprises our desired message to send and reconstruct as our input and output.  By structuring an autoencoder in this way, training forces the network to learn a signal transmit representation (x) on some basis function (I/Q samples, OFDM carriers, etc) which can be best decoded from its received form (y) to estimate the value of s.    In contrast to typical autoencoder usage, we may not care about compressing s to form x, but instead simply seek a representation suitable for RF transmission and resilient to the effects present on the relevant channel (e.g. noise, fading, offsets).  In this way, we can jointly optimize the full encoding or modulation process along with the decoding or demodulation process in an end-to-end way, which simply seeks to reconstruct s as well as possible, thereby minimizing the bit or symbol error rate of the system.

The general form of this channel autoencoder is shown above, and described in significantly more depth in [2].   In its simplest form the channel may be comprised of only additive white Gaussian noise within the channel effects module, however highly nonlinear or stochastic functions or effects such as amplifier approximation models or interference sources can be modeled here leveraging a wealth of in-depth domain knowledge on wireless propagation effects and how to model them using digital signal processing (making in depth wireless and DSP domain knowledge still and important and central skill to working with such systems, even when embracing ML heavily to optimize them).   However, there is a caveat that for back propagation and joint end-to-end SGD optimization of such a network, channel effects expressions must be differentiable such that both forward and backward passes through the channel and networks can be readily computed during optimization. 

Below we show one such solution using a channel autoencoder to find a hidden representation or modulation/encoding derived for a 5-bit message s over a slightly nonlinear channel with Gaussian noise, AM/AM and AM/PM distortion, as well as phase noise within the channel model. 

This solution is elegant in that, it converges to a very effective solution very quickly using a relatively simple SGD based approach for a complicated set of channel impairments for which arriving at an exact optimal solution using a more traditional set of exact analytic optimization methods would be extremely cumbersome and time consuming.  It's also important to note that in the grand scheme of impairments faced over a real wireless channel, this is actually still a very simple channel model, which focuses only on single-symbol optimization (as opposed to blocks of symbols or larger codewords), a single transmit and receive antenna, and does not have any interferers or other effects present.  

This approach however, opens up a whole new set of problems such as, how do I choose my network architecture and hyper-parameter settings, how do I optimally train such a network using SGD, at which signal to noise ratio and state of the stochastic impairment models should I conduct my training, and so forth, all of which are largely still open problems which are unlikely to have exact solutions in the near term (many of these problems can also be addressed using approximate and iterative solutions rather practically).  Many of these problems are shared across the entire field of machine learning, and do have best practices which can be followed and rigorous statistical analysis put into solving them across many domains of application, but a number of them are communications specific and pose excellent open research problems within our field.

There are however a number of new very nice properties to this solution as well.  First, the network’s computational complexity is defined by the size of the network, and so accuracy and computational performance can be quite readily traded off through architecture selection.  This computational complexity is also typically lower since it is approximating an end-to-end manifold or solution to the mapping problem rather than partitioning it up into rigid human interpretable interfaces between cascaded functions (for example distinct AWGN-centric constellation mapping and predistortion application).  Finally, the solution it learns is inherently feedforward and parallel, meaning network values can be computed using data-parallelism, and that iteration or loops are not needed, helping to map rapidly to energy efficient low-clock rate, energy efficient concurrent architectures while ensuring low fixed-latency algorithms emerge.

Scaling Complexity and Solving Remaining Problems

Interest in this approach to physical layer design has recently begun to receive a wider set of interest and application over the past year.  Works in [1-5,8-9,14-15] and a number of others further this basic approach and consider the increased complexity of multi-antenna systems, the ability to learn to pre-code based on knowledge of channel state information, the performance under highly non-linear or non-traditional channel mediums, as well as considerations for how such a fundamental system approach can be realized, vetted, and optimized in over-the-air deployed configurations.  Each of these considers a new set of channel effects and system constraints which make sense in the context of a specific communications system, and requires significant domain knowledge and modeling to construct, but leverages largely the same fundamental approach to optimize in an end-to-end way across all of the degrees of complexity introduced into the model to find a globally optimized and tailored encoding solution which maximizes performance over the channel.

Numerous challenges and opportunities remain in this relatively new and wide open area of investigation.  Tuning systems over the air and over black-box channel functions for which no closed form analytic solution (or corresponding gradient) is known remains a challenge although we have proposed several solutions to this.  Error feedback and orchestration of training in distributed and deployed systems remains a challenge in online adaptation of such systems.  Scaling learned codeword sizes up to very large blocks, and best practices for architecture selection, training procedures, and many other areas remain important for the maturation and ultimate deployment and full realization of this class of system.  

The capability of this approach however to produce approximate solutions to quite complex combinations of impairment effects is, however already relatively impactful for many possible communications system applications.  Machine learning is not replacing traditional communications and digital signal processing any time soon, if ever, as domain knowledge, understanding and effects modeling are absolutely critical to its effective use.  However, ML is rapidly becoming a mandatory and central skill set for any communications engineer seeking to optimize complex real world systems, as it for virtually every other quantitative field on earth.  While exact solutions to simplified models have long been the norm and the prevailing approach for many problems in communications, it is important to repeat as we trade off model complexity for solution exactness and continue to build better and better approximate systems, the timeless mantra of George Box, that “all models are wrong, but some are useful.”  Even exact solutions to communications problems have always been wrong, but as we can closely capture, model and optimize for more and more effects truly present in nature, the resulting system solutions will continue to become more and more useful.

References

  1. T. O’Shea, and T. Roy, and N. West, and B. Hilburn, “Physical Layer Communications System Design Over-the-Air Using Adversarial Networks,” under submission.
  2. O’Shea, Timothy, and Jakob Hoydis. "An introduction to deep learning for the physical layer." IEEE Transactions on Cognitive Communications and Networking 3.4 (2017): 563-575.
  3. H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Communications Letters, vol. 7, no. 1, pp. 114 -117, February 2018.
  4. A. Felix, and S. Cammerer, and S. Dorner, and J. Hoydis, and Stephan ten Brink, “OFDM-Autoencoder for End-to-End Learning of Communications Systems,” submitted to SPAWC 2018.
  5. S. Dorner, and S. Cammerer, and J. Hoydis, and Stephan ten Brink, “Deep Learning Based Communication Over the Air,” IEEE Journal of Selected Topics in Signal Processing, vol 12, no. 1, pp. 132-143, February 2018.
  6. E. Björnson, “A View of the Way Forward in 5G From Academia,” ComSoc Technology News https://www.comsoc.org/ctn/view-way-forward-5g-academia
  7. H. Asplund, K. Larsson and P. Okvist, "How Typical is the "Typical Urban" channel model?," VTC Spring 2008 - IEEE Vehicular Technology Conference, Singapore, 2008, pp. 340-343. https://www.ericsson.com/assets/local/publications/conference-papers/typical_urban_channel_model.pdf
  8. Lee, Hoon, Inkyu Lee, and Sang Hyun Lee. "Deep learning based transceiver design for multi-colored VLC systems." Optics express 26.5 (2018): 6222-6238
  9. Huang, Sihao, and Haowen Lin. "Fully Optical Spacecraft Communications: Implementing an Omnidirectional PV-Cell Receiver and 8Mb/s LED Visible Light Downlink with Deep Learning Error Correction." arXiv preprint arXiv:1709.03222 (2017).
  10. Wymeersch, Henk. Iterative receiver design. Vol. 234. Cambridge: Cambridge University Press, 2007.
  11. Karpathy, Andrej, “Software 2.0”, https://medium.com/@karpathy/software-2-0-a64152b37c35
  12. Shannon, Claude Elwood. "A mathematical theory of communication." Bell system technical journal 27.3 (1948): 379-423.
  13. Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
  14. O'Shea, Timothy J., Tugba Erpek, and T. Charles Clancy. "Deep learning based MIMO communications." arXiv preprint arXiv:1707.07980 (2017).
  15. Farsad, Nariman, and Andrea Goldsmith. "Detection algorithms for communication systems using deep learning." arXiv preprint arXiv:1705.08044 (2017).

Leave a comment

Statements and opinions given in a work published by the IEEE or the IEEE Communications Society are the expressions of the author(s). Responsibility for the content of published articles rests upon the authors(s), not IEEE nor the IEEE Communications Society.