Petar Popovski, Editor in Chief of IEEE JSAC
Published: 25 Jan 2022
From January 2022, IEEE JSAC starts to publish the blog “Selected Ideas in Communications”. In the initial period the blog posts will appear monthly. Each post presents one article chosen from the JSAC issue published in the current month. Besides highlighting the value and the research contribution of the article, the blog aims to reflect the nuanced nature of the research and the reviewing process: Under the tip of every successfully published article there is a hidden iceberg of doubts and imperfection, often pointed out by the reviewers. In this respect, the authors of the article are asked to reflect upon the potential weak points of their work and comment on the criticism that they have received during the review process.
Live Wireless Streaming with Transformers
The January 2022 issue of IEEE JSAC is the fourth issue of the Series on Machine Learning in Communications. From the many high-quality papers that have looked for creative ways to apply the methods of Machine Learning (ML) to communications, the blog will highlight the work:
S. Wang, S. Bi and Y. -J. A. Zhang, "Deep Reinforcement Learning With Communication Transformer for Adaptive Live Streaming in Wireless Edge Networks," in IEEE Journal on Selected Areas in Communications, vol. 40, no. 1, pp. 308-322, Jan. 2022, doi: 10.11
The paper deals with live wireless streaming, a scenario that came to prominence with the emergence of various social media and streaming services, such as TikTok or Twitch. Live streaming through social media changes the long-standing assumption that downlink traffic requires much more resources and the uplink, as now the mobile user produces content rather than consuming it. The setup treated in this paper goes even a step further and treats the end-to-end uplink and downlink transmission, with an intermediate transcoding of the video content at an edge server. The scenario is shown on the Figure 1, where a user sends live streaming video in the uplink and there is a group of followers connected to the same Base Station that features Mobile Edge Computing (MEC). There are two major sources of randomness that make this scenario:
- The dynamics of the wireless link of the streamer and the followers leads to variable and unpredictable communication rate.
- The group of followers changes over time, as mobile users may join and depart unpredictably from the live video streaming session.
The edge server receives the frames from the streamer and transcodes them in a way that matches the quality requests and screen resolutions of the followers. In doing this transcoding the edge server spends computing energy, which can be controlled by the choice of CPU frequency. The objective is to design an online algorithm that can balance between two requirements: meeting the streaming quality experience of the followers, but at the same time minimize the power consumption at the provider. With so many random factors with unknown statistics and levers to control, the solution invokes reinforcement learning approaches.
However, Deep Reinforcement Learning (DRL) cannot be applied straightforwardly due to the changing group of followers. Instead, the authors propose a framework based on actor-critic learning and the use of transformers. The transformers in machine learning became popular through their extraordinary performance in Natural Language Processing (NLP). In this paper the authors have applied it in a thoughtful way to the communication problem and captured the dynamics of the network condition and user requirements. They have realized that a few semantic tokens can be used to describe the current condition of the network and its direction of change. The results demonstrate a clear superiority of the proposed approach over baseline schemes.
Reflections by the Authors
JSAC: What do you think is the weak point of your approach?
I think scalability is a weak point that we would like to improve in the future. To capture the multi-scale temporal correlation of heterogeneous system features, we propose to use a Transformer instead of multi-layer perceptron or recurrent neural networks in the actor and critic modules of DRL. However, the complexity of the proposed Transformer does not scale well in systems with a large number of users. First, the complexity of each layer of the transformer scales as O(n2), where n is the number of users. Moreover, the number of layers increases with n due to the higher dimensional input.
Moreover, the proposed approach considers a two-tier edge computing system, where the transcoder is deployed at the edge server. Moving forward, we see a surging interest in multi-tier computing architectures, where computing tasks can be processed anywhere in the device, fog computing, and cloud computing continuum. Noticeably, learning where to deploy the transcoder in the edge/cloud is a hard task for the proposed DRL approach. Our preliminary results suggest a large variance of the reward, making the proposed learning process hard to converge.
JSAC: What has been criticized by the reviewers during the review process and how did you address that?
The reviews in general are very positive. The reviewers paid particular attention to the design of the Transformer. One reviewer asked "What is the physical meaning of the tokenizer and tokens, and why they are needed for transformer design." Another one asked us to “clarify its potential applications to general communication problems, such as in B5G/6G communication systems”.
Our response to the first comment: “In NLP, tokenization is a way of separating a piece of text into smaller units called tokens. After tokenization, the text is embedded as a vector. The tokens are regarded as the minimal semantic unit and are fed to the multi-head attention layers for feature extraction.”
To address the second comment, we note that a key challenge in Transformer learning for general communication problems is to design communication tokens as the minimal semantic unit. For example, tokenization on structured data (e.g., channel gain, bandwidth) is totally different from tokenization on unstructured data (e.g., text in NLP, image in CV). The proposed Transformer comprises of a tokenizer, namely group layer normalization, to create the communication tokens and a Transformer to encode the tokens into an encoded state. In Section VII of the paper, we briefly discuss some potential applications of the Transformer.
This paper is a humble attempt to apply Transformer techniques in mobile edge network problems. We hope our work can shed light on utilizing advanced ML/NLP/CV techniques in communication community.
Statements and opinions given in a work published by the IEEE or the IEEE Communications Society are the expressions of the author(s). Responsibility for the content of published articles rests upon the authors(s), not IEEE nor the IEEE Communications Society.