FPGA Prototyping of a STAR-Based Time-Delay Estimator for 5G Radio Access

Haithem Haggui\(^1\), Faouzi Bellili\(^2\) and Sofiène Affes\(^3\)

\(^1\)INRS-EMT, 800, De La Gauchetière Ouest, Suite 6900, Montréal, QC, Canada
\(^2\)University of Toronto, 10 Kings College Road Toronto, ON, M5S 3G4, Canada

Emails: {haggui, affes}@emt.inrs.ca, faouzi.bellili@utoronto.ca

Abstract—Code-domain non-orthogonal multiple access (NOMA), a much more sophisticated and efficient generalization of code division multiple access (CDMA), is a promising candidate for future 5G transceivers. Precisely, the CDMA Spatio-Temporal Array-Receiver (STAR) transceiver lends itself to very flexible reconfiguration and adaptation to most NOMA-type radio access technologies. It also possesses among other assets extremely high temporal synchronization capabilities. In this paper, we tackle the hardware feasibility of STAR and provide a proof of concept for a STAR-based time-delay estimator (TDE) through an FPGA-based real-time operational prototype running on a MiniBee Software-Defined-Radio (SDR) platform. We propose a modular, versatile, and reconfigurable architecture for the most basic and simplest “canonic” version of STAR at very low usage of FPGA resources, thereby paving the way for the quick implementation of extended and more complex configurations of this powerful transceiver. The real-time performances of the new prototype in terms of time-delay tracking accuracy compared to the original reference MATLAB version confirm the high precision and robustness of our new design to quantization errors and to all other hardware implementation imperfections.

Index Terms: STAR, TDE, post-correlation model (PCM), 5G, NOMA, CDMA, FPGA, SDR.

I. INTRODUCTION

Massive connectivity and low latency are key features of the emergent hyper-connected society. Future 5G wireless networks need to satisfy these requirements through high data rates and bandwidth efficiency. Furthermore, ability to cope with serious multipath and interference effects, within such high-density networks, is highly sought in potential candidates for 5G radio access technologies. In this context, non-orthogonal multiple access (NOMA) [1], a much more sophisticated and efficient generalization of CDMA [2], stands out today as a promising alternative to orthogonal multiple access (OMA) widely adopted in 4G, due to its improved spectral efficiency, its low transmission latency [3], and its robustness to multipath effects [4]. NOMA schemes share the whole time/frequency available resources and are either power-domain or code-domain multiplex. The latter require high synchronization performances and accurate time-delay (TD) estimation. From this perspective, STAR [5] lends itself to very flexible reconfiguration and adaptation to most NOMA-type radio access technologies [2]. It also possesses among other assets extremely high temporal synchronization capabilities [6]. In this work, we study the hardware feasibility of STAR through the production of a hardware prototype for this extremely powerful and promising transceiver that i) supports all signal combining techniques including the most sophisticated schemes among today’s resurgent interference cancellation (IC), ii) easily adapts to almost all multi-carrier, multi-code, multi-rate, Rx/Tx multi-antenna, and diversity vs. multiplexing configurations, and iii) inherently operates simultaneously as one of the most accurate online channel sounders [2, 7, 8, 9].

Unlike [10] which only gives an overview of the resource requirements for hardware implementation, in this paper we will go through the whole physical implementation process to come up with an FPGA-based real-time operational prototype for a STAR-based TDE.

Modern software-defined-radios (SDRs) lend themselves to efficient prototyping of wireless transceivers design. SDR combines excellent FPGA calculation capabilities with integrated reconfigurable RF cards. Furthermore, recent SDR platforms such as for instance BeeCube’s [11], come with their own integrated user-friendly environments. They allow user-defined VHDL blocks to be used side-by-side with the FPGA builder predefined IP-core libraries.

This paper is structured as follows. In Section II, we give a brief overview of the STAR receiver structure with a particular emphasis on its TD tracking building block which stands out as both the theoretical and experimental cornerstone of our hardware prototype. Then, in Section III, we go through the physical implementation process. Section IV is entirely dedicated to the prototype’s HDL design, the characteristics of the proposed hardware architecture, and the design synchronism challenges...
tackled therein. In Section V, we assess the accuracy of TD tracking in real-time. Finally, hardware resources utilization are analyzed before concluding in Section VII.

II. STAR OVERVIEW

What is original about the STAR receiver [5], the cornerstone of our online TDE, is its ability to perform blind channel identification and equalization simultaneously at an attractive low order of complexity making it highly suitable for hardware implementation. The synchronization performance of STAR was already demonstrated to surpass widely used state-of-the-art techniques [4] particularly in scenarios with high Doppler and fast multipath delay drifts. This assessment was made over real-world radio-channel measurements [6] and it is mainly due to STAR’s simple and intuitive spatio-temporal data formulation entitled the post-correlation model (PCM). The idea behind PCM consists in applying the correlation or matched filtering step and collecting it over the symbol’s period before block-processing the resulting post-correlation observation data. This key pre-processing step reduces significantly intersymbol interference (ISI), thereby revealing the spatio-temporal structure of the channel. Exploiting such valuable information about the channel structure is of a high interest for both TD acquisition and tracking steps as unambiguously proved by software simulations in the original work on STAR [5] and its enhanced version [4].

Since we are interested in exploiting STAR as a TDE technique, we will focus in the rest of this paper on its TD estimation and tracking capabilities accuracies. We consider a CDMA system with $M$ receiving antennas broadcasting a spread differential binary phase-shift-keying (DBPSK) sequence $b_n$ over $P$ paths in a multipath fading environment. The processing gain is defined as $L = T/T_c$ where $T$ is the bit duration and $T_c$ is the chip pulse duration. The block diagram in Fig. 1 illustrates STAR’s building blocks upon which entirely relies our hardware design. And all its operations are detailed in Table I.

Therein, PCM is obtained by despreading the observation vector $\tilde{Y}_n$ and gives birth to the underlying data formulation where in $\tilde{H}_n$ is the space-time propagation vector of the channel, $\tilde{s}_n$ is the useful signal component and $\tilde{N}_n$ is an additive noise term. A rough estimate of the channel, denoted $\hat{H}_{n+1}$, is then tracked by an unconstrained least-mean-square (LMS)-type decision feedback identification (UDFI) scheme described by the UDFI equation in Table I where $\tilde{b}_n$ is the hard version of the soft symbol estimate $\hat{s}_n$. Owing to the manifold structure of $\tilde{H}_{n}$ revealed by PCM, the constrained decision feedback identification (CDFI) equation provides a structure-fitted or regularized estimate of the channel $\hat{H}_{n+1}$ to be used, afterwards, in maximum-ratio combining (MRC) for symbols detection.

This “structure fitting” step is performed through an analysis/synthesis paradigm where in, in a parametric approach, the overall channel is first expressed explicitly in terms of various other parameters that physically describe the propagation media. Those parameters are then estimated and plugged in the parametric model in order to reconstruct/synthesize the overall channel. Analysis starts with a space-time separation (STSEP) step in which the spatial dimension of the channel, $\hat{J}_{n+1}$, is isolated from its time-response matrix $D_{n}$. The latter is then updated as detailed in the time matrix update (TMU) equation in Table I. Time-delays are then updated using linear regressions of the phase variations between columns of the column-by-column FFT of $D_{n}$ and $\hat{D}_{n}$ as explained in the TD update (TDU) equations in Table I. The parameters $\mu$, $\eta$ and $\xi$ are adaptation step-sizes.

### III. PHYSICAL IMPLEMENTATION

#### A. System Specifications

As a starting configuration, our hardware design will embody a “canonic” single-input single-output system ($M = 1$) structure transmitting a DBPSK sequence $b_n$ over $P = 1$ path with a linearly-varying time delay $\tau$.
following a drift $d\tau/dt$ of 0.049 ppm and initially set at $10^7T_c$. The maximum number of resolvable paths is fixed to $P_{\text{max}} = 5$ and the processing gain is $L = 32$. We adopt a Gaussian channel model to generate, offline, the observation vector $Y_n$ corresponding to $50\times10^3$ symbols to be stored in the FPGA RAMs once programmed, then, fed to the design in a real-time mode while TD tracking is being processed. In this tracking scenario, we set the SNR to 6 dB, the tracking frequency to $n_{1D} = 4$ symbols and the smoothing factors to $\mu = 0.0498$, $\eta = 0.1221$ and $\xi = 0.5$.

Our hardware design should be highly reconfigurable to ensure an easy extension to higher configurations. That is why, the parameters $M$, $P$, $P_{\text{max}}$, $L$ and the initial set of TDs should be generic parameters in the HDL model, hence, adjustable by reprogramming the FPGA.

Our hardware design should be highly reconfigurable to ensure an easy extension to higher configurations. That is why, the parameters $M$, $P$, $P_{\text{max}}$, $L$ and the initial set of TDs should be generic parameters in the HDL model, hence, adjustable by reprogramming the FPGA.

![Fig. 1. Block diagram of ST AR’s time-delay tracking.](image1)

B. Computational Complexity and Implementation Technology

Original work on STAR [5] has assessed the complexity required for multipath tracking step of STAR and found it to be in the order of $O(MPL)$. This complexity has been reduced in [4] by updating the channel identification at a certain tracking frequency $n_{1D}$ lower than the symbol rate without a significant loss in performance. For instance, according to [4], at a chip rate of 4.096 Mcps, STAR with $M = 2$ antennas and $P = 3$ paths would require a maximum of $10^3$ million operations per second (Mops) for a large processing gain ($L = 256$). Such an attractive low complexity load is easily manageable by modern FPGA families. These intensive computing engines are able to parallelize STAR matrices manipulation and block computation ensuring high processing rates. For example, XC6VSX475T Device from Xilinx Virtex 6 FPGAs family is able to process up to 200 Giga multiply-accumulates per second (GMACs) when operating at 100 MHz due to its 2016 advanced DSP48E1 slices.

IV. HDL Design

A. Top-Down Methodology

The top-down methodology is a widely used strategy to manage complexity. Adopted to implement our STAR-based TDE on FPGA, it consists in partitioning the algorithm into less complex blocks, which, in turn, will be divided into nested sub-blocks. The first level of partitioning gives birth to a set of algorithmic units such as despreader PCM, maximum-ratio combining (MRC), and Analysis. Once the MATLAB floating-point version of the algorithm is ready and design specifications are set, the HDL model for each operation in Table I is produced within the MiniBee SDR platform [11], a highly efficient rapid prototyping platform built around a Virtex-6 Xilinx FPGA. Therein, the BEEcube Platform Studio (BPS) environment gives access to Xilinx System Generator (XSG) IP blockset and allows us to import our own low-level VHDL blocks when higher flexibility is required. Using a set of HW/SW shared IP resources, behavioral simulation is then performed to prepare the automatic process of the logic synthesis, place & route and bit-file generation. The final step will ensure an in-system verification aiming to compare the design’s real-time performance with its homologous MATLAB version.

B. Hardware Architecture

The hardware architecture of our STAR-based TDE needs to adjust in real-time to dynamic channel structures. Therefore, we have developed a modular, versatile, and auto-reconfigurable hardware architecture in order to facilitate its future extension with a path-management unit and larger system configurations. As illustrated in Fig. 2, within this architecture, each sub-block is implemented as an independent self-controlled processor that communicates with the rest of the design sub-blocks through the STAR control unit (SCU) responsible of activation and hand-shaking of interdependent sub-blocks. The input/output signals are stored in local RAMs accessible forward and backward by the various communicating sub-blocks.

![Fig. 2. Hardware architecture.](image2)
1) Datapath design: Based on a top-down methodology, STAR datapath design is the translation to HDL of functional algorithmic units illustrated in Fig. 1 and detailed in Table I. A HDL sub-block is validated once the difference between its output and the output of its MATLAB reference is below a certain acceptable threshold. In general, there is always a residual error that is tightly related to the fixed-point representation of operators i.e. the word-lengths considered to represent integer and fractional parts of quantified inputs, outputs, and intermediate datapath signals. In our design, word-lengths and fixed-point positions are reconfigurable parameters that we optimize and validate by behavioral simulations. Yet, the FPGA needs to be reprogrammed to be adjusted.

2) Control Unit: By contrast of conventional instruction-based software program, a hardware design needs to be synchronized at the very low level which is the Register-Transfer-Level (RTL). Similarly to datapath, we adopted a top-down approach to handle this complex task using efficient synchronization tools which are the finite states machines (FSM). As mentioned previously, SCU is the general conductor of the design. Moreover, each sub-block is locally self-synchronized by its own integrated FSM including “Analysis” sub-block which integrates 3 independent blocks as shown in Table I. SCU and “Analysis” FSMs are illustrated in Fig. 3.

V. FPGA Resources Usage and Timing Performances

Once behavioral simulation is validated, the Xilinx ISE synthesis tool will take the lead to translate HDL blocks into logic gates. The latter will be created inside reconfigurable slices through a bit-stream which defines the status of each configuration bit so as to create our design within the FPGA. During synthesis, Xilinx ISE provides a place & route report regarding FPGA resources usage and a timing report analyzing real timing performances compared to predefined timing constraints. Table II summarizes these reports for our target device Xilinx Virtex 6 xc6vsx475t -2ff1759.

As shown in this table, FPGA Slice Logic usage for the actual configuration of STAR-based TDE scheme is relatively low. No more than 1% of the registers and 2% of the look-up-tables (LUTs) have been exploited. Coming to Specific Features, 74% of Block RAMs has been consumed, in addition to 3% of global buffers (BUFG/BUFGCTRLs) and 2% of high-performance DSP dedicated units DSP48E1s. Additionally, the maximum delay between two consecutive synchronous elements is 9.898 ns allowing a maximum frequency of 100.110 MHz. Therefore, the prefixed timing constraints (100 MHz as a clock rate) has been entirely met. Such an FPGA usage confirms the efficiency of our design approach and keeps us optimistic regarding future extensions of the prototype within the same SDR platform.

VI. REAL-TIME IN-SYSTEM PERFORMANCES

After FPGA programming, the final step of the physical implementation is in-system real-time testing. During this phase, our STAR TDE technique is processed in real-time on FPGA with the specifications detailed, already, in Section III. Hence, the observation signal stored offline inside the FPGA, is fed into the design in real-time once a START pulse is received from the user. The TD estimates are then stored in a BPS shared HW/SW Block-RAM to be analyzed offline afterwards in MATLAB environment. Fig. 4 illustrates real-time FPGA-based TDE and its off-line MATLAB-based homologous, both compared to the true drifting path also depicted as a reference there. The estimation error of our TDE technique is below $10^{-2} T_c$ and its latency is of 2705 clock periods i.e., 27.05 µs, thereby validating our HDL prototype and its implementation.

Considering a future 5G radio access deployment, the current performances of this starting configuration of
our TDE prototype are prevalingly satisfying. In fact, extensions to multiple-input multiple-output (MIMO) systems and multipath configurations are straightforward because of the low FPGA resources usage and the high reconfigurability offered by our design. Furthermore, the 1 ms 5G target latency for ultra-reliable low latency communications (URLLC) is reasonably achievable given the current latency of the design. Coming to 5G real-world data-rate, the specifications calls for a per-user download speed of 100 Mbps. A speed that keeps us optimistic for data-rate, the specifications calls for a per-user download speed of 100 Mbps. A speed that keeps us optimistic for 5G, “Maximum likelihood time delay estimation from single- and multi-carrier DSSS multipath MIMO transmissions for future 5G networks,” to appear in IEEE Trans. on Wireless Comm., 2017.

VII. Conclusion

In order to showcase STAR’s high potential and adaptability to next-generation networks, we present an FPGA-based proof-of-concept for a STAR-based TDE. The outlined prototype is a product of a crucial step within the process of elaborating an over-the-air demo operational in real-world channel propagation conditions. We have shown that our efficient implementation approach and our proposed hardware architecture have led to a highly modular and reconfigurable prototype achieving very high accuracy in terms of TD tracking at very low usage of FPGA resources.

REFERENCES