Ingeniería y Ciencia ISSN:1794-9165 | ISSN-e: 2256-4314 ing. cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015. http://www.eafit.edu.co/ingciencia This article is licensed under a Creative Commons Attribution 4.0 by

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case Alexander López-Parrado 1 and Jaime Velasco Medina

2

Received: 06-04-2015 | Accepted: 02-07-2015 | Online: 31-07-2015 MSC: 65T50, 68W20, 68W25 | PACS: 02.30.Nw, 33.20.Ea, 89.20.Ff doi:10.17230/ingciencia.11.22.4

Abstract In this paper we present an optimized software implementation (sFFT-4.0) of the recently developed Nearly Optimal Sparse Fast Fourier Transform (sFFT) algorithm for the noisy case. First, we developed a modified version of the Nearly Optimal sFFT algorithm for the noisy case, this modified algorithm solves the accuracy issues of the original version by modifying the flat window and the procedures; and second, we implemented the modified algorithm on a multicore platform composed of eight cores. The experimental results on the cluster indicate that the developed implementation is faster than direct calculation using FFTW library under certain conditions of sparseness and signal size, and it improves the execution times of previous implementations like sFFT-2.0. To the best knowledge of the authors, the developed implementation is the first one of the Nearly Optimal sFFT algorithm for the noisy case. Key words: Sparse Fourier Transform; multicore programming; computer cluster. 1 2

Universidad del Quindío, Armenia, Quindío, Colombia, [email protected] Universidad del Valle, Cali, Valle, Colombia, [email protected]

Universidad EAFIT

73|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

Implementación software eficiente de la Transformada de Fourier Escasa casi óptima para el caso con ruido Resumen En este artículo se presenta una implementación software optimizada (sFFT4.0) del algoritmo Transformada Rápida de Fourier Escasa (sFFT) Casi Óptimo para el caso con ruido. En primer lugar, se desarrolló una versión modificada del algoritmo sFFT Casi Óptimo para el caso con ruido, esta modificación resuelve los problemas de exactitud de la versión origial al modificar la ventana plana y los procedimientos; y en segundo lugar, se implementó el algoritmo modificado en una plataforma multinúcleo compuesta de ocho núcleos. Los resultados experimentales en el agrupamiento de computadores muestran que la implementación desarrollada es más rápida que el cálculo directo usando la biblioteca FFTW bajo ciertas condiciones de escasés y tamaño de señal, y mejora los tiempos de ejecución de implementaciones previas como sFFT-2.0. Al mejor conocimiento de los autores, la implementación desarrollada es la primera del algoritmo sFFT Casi Óptimo para el caso con ruido. Palabras clave: Transformada de Fourier Escasa; programación multi núcleo; agrupamiento de computadoras.

1

Introduction

The sFFT term refers to a family of algorithms which allow the estimation of the Discrete Fourier Transform (DFT) of a sparse signal, faster than the FFT algorithms found in the literature [1],[2],[3],[4]; in this case, it is assumed that the signal is sparse or approximately sparse in the DFT domain. In the one hand, researchers from the Massachusetts Institute of Technology (MIT) presented two sFFT algorithms [2] which improve the runtime over all the previous developments [1],[5],[6],[7] including the most optimized conventional FFT algorithms like FFTW [8]; the first algorithm is intended for the noiseless case, and the second algorithm is intended for the noisy case. On the other hand, to the best of our knowledge there are only four software implementations of the MIT sFFT algorithms reported in literature; the first one was developed by the algorithm authors for the first version of the sFFT algorithm [1]; the second one is an optimized implementation of

|74

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

the first version of the sFFT algorithm [9]; the third one is a GPU-based implementation of the first version of the sFFT algorithm [10]; and the fourth one is an optimized implementation of the Nearly Optimal sFFT algorithm for the noiseless case [11]. Therefore, there is no software implementation reported in literature of the Nearly Optimal sFFT algorithm for the noisy case, which is of practical interest for scientific researchers. Therefore considering the above, the main contribution of this paper is the development of a modified version of the Nearly Optimal sFFT algorithm and its optimized software implementation on a multicore environment provided by a Beowulf cluster. The modified algorithm has an improved accuracy when compared with the original version described in [2], by modifying the flat window and the procedures; to the best of our knowledge, the modified algorithm is the first implementation of the Nearly Optimal sFFT algorithm for the noisy case, and it is very suitable for hardware implementation using ASICs or FPGAs. The rest of the paper is organized as follows: Section 2 describes the Nearly Optimal sFFT algorithm, section 3 presents the modified version of the Nearly Optimal sFFT algorithm, section 4 presents experimental results, and section 5 presents the conclusions and future work.

2

Nearly optimal Sparse Fast Fourier Transform algorithm

In this section, we initially present some concepts about sFFT, then we describe the Nearly Optimal sFFT algorithm, and finally we show some software simulations to verify the basic concepts. Given a discrete time signal x ∈ CN of length N , its N -point Discrete Fourier Transform (DFT) ˆ ∈ CN is defined in Eq. 1. x ˆk = x

1 X xn ω kn , k ∈ [N ] N

(1)

n∈[N ]

Where N is a power of two, [N ] denotes the set of indexes {0, 1, . . . , N −1}, and ω = e−i2π/N is the N -th root of unity. In this case, the number of ˆ is named the sparsity order K and it is non-zero elements of the vector x defined in Eq. 2. ˆ K = | supp (x)| (2) ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

75|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

ˆ is the set of indexes of the non-zero elements of the vector Where supp (x) ˆ Then, a time domain signal x is sparse in the DFT domain if K 0, α > 0, | supp (G)| = O(B/α log (N/δ)), where the conditions described in Eq. 4 are satisfied [2]. ˆ ′ = 1 f or |i| ≤ (1 − α)n/(2B) G i ˆ ′ = 0 f or |i| ≥ n/(2B) G i ˆ ′ ∈ [0, 1] f or all i G i ˆ ′ − G| ˆ ∞ Rloc /2} if Q 6= ∅ then l′ = lj + minq∈Q qw t ; else l′ = ⊥; end end return l′

ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

81|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

This procedure performs the adjustment of the frequency location using five processing stages: the first one sets the initial conditions, the second one calculates the hashes-error, the third one calculates the angles of the hashes-error and candidate frequency bins, the fourth one performs the voting stage, and the fifth one locates the frequency bins. The setting of initial conditions calculates the value of the location threshold s, and clears the vote counters of the t candidate adjustments for the B candidate frequency ˆ uˆ′ , bins. The hashes calculation stage obtains Rloc couples of the form u, which are calculated from the signal x by using the procedure HashToBins with pseudo-random permutation parameters of the form (σ, a) and (σ, a + β), respectively. The angle calculation stage obtains the vector c, which is ˆ and the hashes uˆ′ . The voting the angle differences between the hashes u stage increments the vote counter vj,q corresponding to the j-th frequency bin and q-th adjustment, if Eq. 9 is satisfied. min{|θj,q − cj |, 2π − |θj,q − cj |} < sπ

(9)

Where θj,q is the angle of the j-th candidate frequency bin for the q-th adjustment, that is, θj,q is related to the q-th candidate frequency adjustment qw/t. Additionally, the above calculation converges if β is chosen snt at random from the set { snt 4w , . . . , 2w } with small enough threshold s. Finally, the frequency location stage selects the minimum q from the set Q = {q ∈ [t] | vj,q > Rloc /2}; thus the estimated j-th permuted-frequency bin is refined using Eq. 10 l′ = lj + min q∈Q

qw t

(10)

2.2.4 Procedure EstimateValues This procedure, presented in Alg. ˆ J , and it has the following 5, calculates the DFT estimation adjustment w input parameters: the time domain signal x; the instantaneous estimation ˆ the parameter B; the flat window parameters δ and α; the set of zˆ of x; located sparse components L; the number of sparse components to estimate K ′ ; and the number of estimation iterations Rest .

|82

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 5: Estimate values function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, δ ∈ R, α ∈ R, L ∈ NO(B) , K ′ ∈ N+ , Rest ∈ N+ ′ ˆ J ∈ Cmin{| supp(L)|,K } Output: w ˆ B, δ, α L, K ′ , Rest ) procedure EstimateV alues(x, z, ˆ j = 0 ∀ j ∈ [| supp(L)|]; w // Main loop for i ∈ [Rest ] do Choose ai uniformly at random from [N ]; Choose σi uniformly at random from the set of odd numbers in [N ]; ˆ ˆ σi , ai , δ, α); u=HashT oBins(x, B, z, for j ∈ [| supp(L)|] do ˆ hr (Lj ,σi ,N,B)modB ω −σi ai Lj uˆ′ i,j = u end end for j ∈ [| supp(L)|] do // Median for real and imaginary parts separately across the i dimension ˆ j = mediani uˆ′ i,j w end ˆ J k2 J = arg max| supp(J )|=K ′ kw ˆJ return w

ˆ J using This procedure calculates the DFT estimation adjustment w three processing stages: the first one calculates the Rest sets of hashes-error from the signal x by using the procedure HashToBins and considering different pseudo-random permutation parameters; the second one separately calculates the median of the real and imaginary parts of the calculated hashes-error by only considering the set of located sparse frequency bins in L [1],[7], and by cancelling the pseudo-random spectral permutation; and ˆJ . the third one saves the K ′ most energetic components w

3

Modified nearly optimal Sparse Fast Fourier Transform algorithm

In this section we describe the Modified Nearly Optimal sFFT algorithm and its optimized software implementation named sFFT-4.0, this implementation presents an improved accuracy compared with the original description [2]. ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

83|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

3.1

Description of the modified nearly optimal sFFT algorithm

The modified sFFT algorithm is accomplished by designing a different flat window and by performing several modifications to the procedures described in section 2. 3.1.1 Modified flat window In this case, we designed a flat window of the form (1/B, 1/2B, δ, O(B log (N/δ))) as the one described in [1], and considering the time domain and DFT domain representations. The time domain representation of the flat window is described in Eq. 11, and this window does not depend on the parameter α and its support only depends on the parameters B and δ.

Gn+ N modN = 2Ce

−2π 2 (n−N/2)2 σg 2

2

sinc(2C(n − N/2)), n ∈ [N ]

(11)

In this case, the flat window in time domain has a small support given by O(B log (N/δ)) and it is designed with an ideal low-pass filter using a Gaussian window with finite duration to truncate the impulse response. The cutoff frequency of the low-pass filter is 2C, where C is defined in Eq. 12 and the standard deviation of the Gaussian window is defined in Eq. 13.

C=

σg = 2B

1 2B

p 2 log(N/δ)

(12)

(13)

The DFT domain representation the flat window is described in Eq. 14.

|84

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

ˆ′ N = G k+ 2 modN p 1, if |k − N/2| ≤ N (C − p2 log(1/δ)/σg ) 0, if |k − N/2| ≥ N (C + 2 log(1/δ)/σg ) , k ∈ [N ] (14) ncdf(σg ((k − N/2)/N + C)) −ncdf(σg ((k − N/2)/N − C)), otherwise

ˆ′ Where the √ vector G is the approximated flat window and ncdf(x) = erfc(−x/ 2)/2 is the Normal Cumulative Distribution Function [13]. In ˆ ′ satisfies |G ˆ ′ − G| ˆ ∞ < δ, and it has a flat super-pass rethis case, G p gion, a transition band of width g , and a total bandp p 2N 2 log(1/δ)/σ width of BWGˆ ′ = N/B(1 + 2 log(1/δ)/ 2 log(N/δ)) which satisfies BWGˆ ′ ≤ 2N/B. 3.1.2 Modified procedure HashToBins This modified procedure is shown in Alg. 6 by considering the designed flat window. Algorithm 6: Modified hash to bins function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, σ ∈ {2c + 1 | c ∈√[N/2]}, a ∈ [N ], G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ G ˆ ∈ CB Output: u ˆ ′) ˆ B, σ, a, G, G procedure HashT oBins(x, z, uj = 0 ∀ j ∈ [B]; // Spectral Permutation and Sub-sampling for j ∈ {N − | supp(G)|/2 , N + | supp(G)|/2 − 1} do ujmodB = ujmodB + xσ(j−a) mod N Gj−N +| supp(G)|/2 ; end // Sub-sampled DFT ˆ = FFTB (u); u // Efficient convolution calculation ˆ do for j ∈ supp(z) ˆ′ ˆj ω σaj ; ˆ hf (j,σ,N,B)modB = u ˆ hf (j,σ,N,B)modB − G u |of (j,σ,N,B)| z ˆ′ ˆ h (j,σ,N,B)modB = u ˆ h (j,σ,N,B)modB − G zˆj ω σaj ; u c

c

|oc (j,σ,N,B)|

end

ˆ return u

In this case, due to the designed flat window has a doubled bandwidth, each sparse component has two hashes located in the offsets of (j, σ, N, B) and oc (j, σ, N, B) given by the Eqs. 15 and 16, respectively. ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

85|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

of (j, σ, N, B) = πp (j, σ, N ) − hf (j, σ, N, B)N/B

(15)

oc (j, σ, N, B) = πp (j, σ, N ) − hc (j, σ, N, B)N/B

(16)

Where hf (j, σ, N, B) = ⌊πp (j, σ, N )B/N ⌋ is the floor-hash function and hc (j, σ, N, B) = ⌈πp (j, σ, N )B/N ⌉ is the ceil-hash function; hf (j, σ, N, B) ˆ for mod B and hc (j, σ, N, B) mod B are indexes of the two hashes in u a single sparse component of x. The proposed flat window improves the accuracy of the algorithm by solving the problem of the zero-value hashes, and it reduces the sampling cost by considering a smaller support. 3.1.3 Modified procedures LocateSignal and LocateInner These modified procedures are presented in Alg. 7 and Alg. 8, respectively. In this case, we reduced the number of modified procedure HashToBins used by the modified procedure LocateInner by fixing the parameter a of the pseudo-random spectral permutation; this allows the usage of the same ˆ for all the iterations of the modified procedure LocateInner, hashes-error u thus the sampling cost and the speed are improved. Algorithm 7: Modified locate signal function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log√2(N )]}, ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , s ∈ R, G ∈ RB⌈log(N/δ)⌉ , G Rloc ∈ N+ Output: L ∈ NO(B) ˆ ′ , s, Rloc ) ˆ B, G, G procedure LocateSignal(x, z, Choose a uniformly at random a from the set [N ]; Choose σ uniformly at random from the set of odd numbers in [N ]; (0) lj = jN/B ∀ j ∈ [B]; w0 = N/B; t = log2 N ; t′ = t/4; Dmax = ⌈logtprime (w0 + 1)⌉; ˆ ′ ); ˆ ˆ B, σ, a, G, G u=HashT oBins(x, z, // Main loop for D ∈ [Dmax ] do ˆ ′ , s, Rloc , ˆ B, σ, a, G, G l(D+1) =LocateInner(x, z, ˆ l(D) ); w0 /(t′ )D , t, u, end (Dmax )

L = unique{σ −1 ⌊lj

+ 0.5⌋|j ∈ [B]};

return L

|86

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 8: Modified locate signal inner function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, σ ∈ {2c + 1 | c ∈√[N/2]}, a ∈ [N ], G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , s ∈ R, Rloc ∈ N+ , w ∈ R, G ˆ ∈ CB , l ∈ NO(B) t ∈ N+ , u ′ Output: l ∈ NO(B) ˆ ′ , s, Rloc , w, t, u, ˆ B, σ, a, G, G ˆ procedure LocateInner(x, z, l) vj,q = 0 ∀ (j, q) ∈ [B] × [t]; // Main loop for i ∈ [Rloc ] do Choose β uniformly at random from the set t sN t {⌊ sN 4w ⌋, . . . , ⌊ 2w ⌋}; ′ ˆ ˆ ′ ); ˆ σ, a + β, G, G u =HashT oBins(x, B, z, for j ∈ [B] do if lj 6= ⊥ then ˆ j /uˆ′ j ; rj = u cj = arctan2(ℑ{rj }, ℜ{rj }); if cj < 0 then cj = cj + 2π; end for q ∈ [t] do mj,q = lj + q+1/2 t w; 2πβm θj,q = N j,q mod 2π; if min{|θj,q − cj |, 2π − |θj,q − cj |} < sπ then vj,q = vj,q + 1; end end end end end for j ∈ [B] do Q = {q ∈ [t] | vj,q > Rloc /2} if Q 6= ∅ then l′ = lj + minq∈Q qw t ; else l′ = ⊥; end end return l′

ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

87|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

3.1.4 Modified procedure EstimateValues This modified procedure, presented in Alg. 9, improves the estimation of the DFT values by adding a correction factor that cancels the effect of the windowing in DFT domain. Algorithm 9: Modified estimate values function. Input: x ∈ CN , zˆ ∈ CN√ , G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , B ∈ {2c | c ∈ [log 2(N )]}, G L ∈ NO(B) , K ′ ∈ N+ , Rest ∈ N+ ′ ′ ˆ J ∈ Cmin{| supp(L)|,K } , Jˆ ∈ Nmin{| supp(L)|,K } } Output: {w ˆ ′ ,B, L, K ′ , Rest ) ˆ G, G procedure EstimateV alues(x, z, ˆ j = 0 ∀ j ∈ [| supp(L)|]; w // Main loop for i ∈ [Rest ] do Choose ai uniformly at random from [N ]; Choose σi uniformly at random from the set of odd numbers in [N ]; ˆ ′ ); ˆ ˆ σi , ai , G, G u=HashT oBins(x, B, z, for j ∈ [| supp(L)|] do ˆ′ ˆ hr (Lj ,σi ,N,B)modB ω −σi ai Lj /G uˆ′ i,j = u |or (Lj ,σi ,N,B)| end end for j ∈ [| supp(L)|] do // Median for real and imaginary parts separately across the i dimension ˆ j = mediani uˆ′ i,j w end ˆ J k2 J = arg max| supp(J )|=min{| supp(L)|,K ′ } kw ˆ J , J} return {w

3.1.5 Modified procedure sFFT Taking into account the above improvements, the proposed modified Nearly Optimal sFFT algorithm is presented in Alg. 10, and it has the following input parameters: the time domain signal x, the sparsity order K, the parameter δ of the flat window [2], the location threshold s, and the number of estimation iterations (Rest ). The modified algorithm simplifies the calculation of the parameters B and K by halving their values during odd-numbered iterations (r mod 2 = 1), also the modified algorithm improves the accuracy of the DFT estimaˆ tion z.

|88

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 10: Modified sFFT algorithm. Input: x ∈ CN , K ∈ N+ , δ ∈ R, s ∈ R , Rest ∈ N+ Output: zˆ ∈ CN procedure sF F T (x , K, δ, s, Rest ) RsF F T = ⌊log2 (K)⌋; B = 2⌈log2 (K)⌉+1 ; Rloc = ⌊log2 log2 N ⌋; ˆ ′ by using equations 11, 12, 13, and Calculate flat window G, G 14; zˆj = 0 ∀ j ∈ [N ]; // Main loop for r ∈ [RsF F T ] do // Location of Components ˆ ′ , s, Rloc ); ˆ B, G, G L= LocateSignal(x, z, // Estimation of Components ˆ ′ , L, 3K, ˆ B, G, G {wJ , J} =EstimateV alues(x, z, Rest ); zˆLJ = zˆLJ + wJ // Reduction of Window Support if (B > 2) ∧ (r mod 2 = 1) then B = B/2; K = ⌊K/2⌋; ˆ ′ by using equations 11, 12, Calculate flat window G, G 13, and 14; end end return zˆ

4

Experimental results

We developed a software implementation of the modified sFFT algorithm named sFFT-4.0 [1],[2] by using the modified procedures HashToBins, LocateSignal, LocateInner, and EstimateValues. The Linux-based software R implementation was developed using C language, the Intel C Compiler, OpenMP [14], and the Intel MKL library; it was tested on one node of the Adroit cluster at Princeton University by using eight cores; and it was R integrated with MATLAB by using MEX-files shared libraries. We used OpenMP in some portions of the code in order to execute up to eight calls of the procedure HashToBins in parallel, hence we parallelized the main forloops of the procedures LocateInner and EstimateValues. The source code of sFFT-4.0 is available at https://sourceforge.net/projects/sfft40/. Finally, in order to test the performance of sFFT-4.0 implementation; ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

89|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

first, we developed some comparisons of sFFT-4.0 implementation against the previous sFFT implementations AAFFT and sFFT-2.0 in terms of runtime and accuracy versus Signal to Noise Ratio (SNR); and second, we present the achieved improvements when the multicore architecture is used for sFFT-4.0 implementation.

4.1

Comparison tests

Figure 1(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of sFFT-4.0 using a single core, the blue plot represents the runtime of sFFT-2.0 [1], and the green plot represents the runtime of AAFFT [7]. Even though our implementation is running on a single core, it is faster than AAFFT for all N values and faster than sFFT-2.0 for N ≥ 218 . Figure 1(b) shows the simulation results of the runtime versus K for N = 222 ; in this case, sFFT-4.0 is faster than AAFFT and sFFT-2.0 for all K values. sFFT−/4.0/2.0/AAFFT runtime vs. sparsity order for N = 222

2

10

sFFT−4.0 one core sFFT−2.0 AAFFT

sFFT−/4.0/2.0/AAFFT runtime vs. signal size for K = 50

1

10

sFFT−4.0 one core sFFT−2.0 AAFFT

1

10 Runtime [seconds]

Runtime [seconds]

0

10

−1

10

0

10

−1

10

−2

10

−2

10

−3

10

214

215

216

217

218

219

220

Signal size N

(a)

221

222

223

224

2

10

3

10 Sparsity order K

4

10

(b)

Figure 1: 1(a) Comparison of sFFT-4.0 runtime versus N for K = 50 with two sFFT implementations. 1(b) Comparison of sFFT-4.0 runtime versus K for N = 222 with two sFFT implementations.

Figure 2 shows the simulation results of the l1 -error versus SNR for N = 222 and K = 50; where the blue plot represents the error for sFFT-4.0 and the red plot represents the error for sFFT-2.0, the results for AAFFT

|90

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

are not worthy and were not included due to this algorithm is extremely sensitive to noise. sFFT−2.0/4.0 error vs. SNR (K = 50, N=222)

2

10

sFFT−2.0 sFFT−4.0 0

Error per large frequency

10

−2

10

−4

10

−6

10

−8

10

−10

10

−20

0

20

40 60 SNR [dB]

80

100

120

Figure 2: sFFT-4.0 error versus SNR for N = 222 and K = 50.

From 2 it can be seen that for values of SNR greater than −3 dB sFFT4.0 has improved accuracy when compared with sFFT-2.0.

4.2

Multicore tests

Figure 3(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of FFTW3 [8], the blue plot represents the runtime of sFFT-4.0 when one core is used, and the the green plot represents the runtime of sFFT-4.0 when eight cores are used. In this case, sFFT-4.0 is faster than FFTW3 for N above to 216 in the eight cores case; however, there is no a significant speed improvement by using the multicore architecture, this is due to the low sparsity of signal that leads to for-loops with small number of iterations which do not take advantage of the multicore architecture, hence the OpenMP’s overhead times have a strong influence on the overall execution time [14]. Figure 3(b) shows the simulation results of the runtime versus K for N = 222 ; in this case, sFFT-4.0 is faster than FFTW3 for K below to 2500 when a single core is used. From Figure 3(b), it can be noted that there is a significant speed improvement by using the multicore architecture due ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

91|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

the increased sparsity of signal; in this case we achieved an acceleration near to 4×. 22

sFFT−4.0 runtime vs. sparsity order for N = 2

0

10 sFFT−4.0 runtime vs. signal size for K = 50

1

10

Runtime [seconds]

Runtime [seconds]

FFTW3 sFFT−4.0 one core sFFT−4.0 eight cores

0

10

−1

10

−2

10

−1

10

FFTW3 sFFT−4.0 one core sFFT−4.0 eight cores

−3

10

−2

10

−4

10

214

215

216

217

218

219

220

Signal size N

(a)

221

222

223

224

2

10

3

10 Sparsity order K

4

10

(b)

Figure 3: 3(a) sFFT-4.0 runtime versus N for K = 50. 3(b) sFFT-4.0 runtime versus K for N = 222 .

Finally, the modified sFFT algorithm uses the following arithmetic operators: FFT, multiply-accumulate, trigonometric functions (sine, cosine, atan, atan2), modular inversion based on the Fermat’s Little Theorem, among others; these operators can be easily mapped to hardware, thus this advantage could facilitate the implementation of the modified sFFT algorithm on an FPGA or an ASIC.

5

Conclusions

In this paper we present a modified Nearly Optimal sFFT algorithm for the noisy case, this algorithm reduces the sampling cost and corrects the zero-hash issue of the original algorithm by doubling the bandwidth of the flat window and by modifying the original procedures. Additionally, we developed an efficient software implementation of the modified Nearly Optimal sFFT algorithm using a multicore platform, under certain conditions this implementation is faster than the optimized FFT library FFTW3 and previous sFFT implementations. To the best knowledge of the authors, the developed implementation is the first one of the Nearly Optimal sFFT algorithm for the noisy case reported in literature. Finally, the future work

|92

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

will be addressed to perform an efficient hardware implementation of the modified Nearly Optimal sFFT algorithm using an FPGA.

Acknowledgement Alexander López-Parrado thanks Colciencias for the scholarship and Universidad del Quindío for the study commission.

References [1] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Simple and practical algorithm for sparse Fourier transform,” in ACM-SIAM Symposium on Discrete Algorithms (SODA). Kyoto, Japan: SIAM, 2012, pp. 1183–1194. [Online]. Available: http://dl.acm.org/citation.cfm?id=2095116.2095209 74, 76, 77, 83, 84, 89, 90 [2] ——, “Nearly optimal sparse Fourier transform,” in Proceedings of the 44th symposium on Theory of Computing (STOC), New York, USA, May 2012, pp. 563–578. 74, 75, 76, 77, 78, 79, 81, 83, 88, 89 [3] P. Indyk, M. Kapralov, and and Eric Price, “(Nearly) Sample-Optimal Sparse Fourier Transform,” in ACM-SIAM Symposium on Discrete Algorithms (SODA), Portland, USA, 2014. 74, 76 [4] H. Hassanieh, L. Shi, O. Abari, E. Hamed, and D. Katabi, “GHz-wide sensing and decoding using the sparse Fourier transform,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014, pp. 2256–2264. [Online]. Available: http://dx.doi.org/10.1109/INFOCOM.2014.6848169 74 [5] A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss, “Improved time bounds for near-optimal sparse Fourier representations,” in Proceedings of SPIE Wavelets XI, San Diego, USA, 2005, pp. 1–15. 74, 76 [6] A. C. Gilbert, Y. Li, E. Porat, and M. J. Strauss, “Approximate sparse recovery: optimizing time and measurements,” in Proceedings of the 42nd ACM symposium on Theory of computing, Cambridge, USA, 2012. 74, 76 [7] A. C. Gilbert, M. J. Strauss, and J. A. Tropp, “A Tutorial on Fast Fourier Sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 57–66, 2008. [Online]. Available: http://dx.doi.org/10.1109/MSP.2007.915000 74, 76, 83, 90 ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

93|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

[8] M. Frigo and S. G. Johnson, FFTW, 3rd ed., Massachusetts Institute of Technology, 2012. [Online]. Available: http://www.fftw.org/ 74, 91 [9] C. Wang, M. Araya-Polo, S. Chandrasekaran, A. St-Cyr, B. Chapman, and D. Hohl, “Parallel Sparse FFT,” in Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms, New York, NY, USA, 2013, pp. 10:1—-10:8. [Online]. Available: http://dx.doi.org/10.1145/2535753.2535764 75 [10] J. Hu, Z. Wang, Q. Qiu, W. Xiao, and D. J. Lilja, “Sparse Fast Fourier Transform on GPUs and Multi-core CPUs,” in Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on, Oct. 2012, pp. 83–91. [Online]. Available: http://dx.doi.org/10.1109/SBAC-PAD.2012.34 75 [11] J. Schumacher, “High Performance Sparse Fast Fourier Transform,” Master Thesis, ETH Zurich, Department of Computer Science, 2013. [Online]. Available: http://goo.gl/3alHvS 75 [12] A. Dutt and V. Rokhlin, “Fast Fourier transforms for nonequispaced data, II,” Applied and Computational Harmonic Analysis, vol. 2, no. 1, pp. 85–100, 1995. [Online]. Available: http://dx.doi.org/10.1006/acha.1995.1007 76 [13] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 10th ed. Washington, DC, USA: Dover, 1972. 85 [14] OpenMP, OpenMP Application Program Interface, OpenMP Architecture Review Board Std., 2013. 89, 91

|94

Ingeniería y Ciencia

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case Alexander López-Parrado 1 and Jaime Velasco Medina

2

Received: 06-04-2015 | Accepted: 02-07-2015 | Online: 31-07-2015 MSC: 65T50, 68W20, 68W25 | PACS: 02.30.Nw, 33.20.Ea, 89.20.Ff doi:10.17230/ingciencia.11.22.4

Abstract In this paper we present an optimized software implementation (sFFT-4.0) of the recently developed Nearly Optimal Sparse Fast Fourier Transform (sFFT) algorithm for the noisy case. First, we developed a modified version of the Nearly Optimal sFFT algorithm for the noisy case, this modified algorithm solves the accuracy issues of the original version by modifying the flat window and the procedures; and second, we implemented the modified algorithm on a multicore platform composed of eight cores. The experimental results on the cluster indicate that the developed implementation is faster than direct calculation using FFTW library under certain conditions of sparseness and signal size, and it improves the execution times of previous implementations like sFFT-2.0. To the best knowledge of the authors, the developed implementation is the first one of the Nearly Optimal sFFT algorithm for the noisy case. Key words: Sparse Fourier Transform; multicore programming; computer cluster. 1 2

Universidad del Quindío, Armenia, Quindío, Colombia, [email protected] Universidad del Valle, Cali, Valle, Colombia, [email protected]

Universidad EAFIT

73|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

Implementación software eficiente de la Transformada de Fourier Escasa casi óptima para el caso con ruido Resumen En este artículo se presenta una implementación software optimizada (sFFT4.0) del algoritmo Transformada Rápida de Fourier Escasa (sFFT) Casi Óptimo para el caso con ruido. En primer lugar, se desarrolló una versión modificada del algoritmo sFFT Casi Óptimo para el caso con ruido, esta modificación resuelve los problemas de exactitud de la versión origial al modificar la ventana plana y los procedimientos; y en segundo lugar, se implementó el algoritmo modificado en una plataforma multinúcleo compuesta de ocho núcleos. Los resultados experimentales en el agrupamiento de computadores muestran que la implementación desarrollada es más rápida que el cálculo directo usando la biblioteca FFTW bajo ciertas condiciones de escasés y tamaño de señal, y mejora los tiempos de ejecución de implementaciones previas como sFFT-2.0. Al mejor conocimiento de los autores, la implementación desarrollada es la primera del algoritmo sFFT Casi Óptimo para el caso con ruido. Palabras clave: Transformada de Fourier Escasa; programación multi núcleo; agrupamiento de computadoras.

1

Introduction

The sFFT term refers to a family of algorithms which allow the estimation of the Discrete Fourier Transform (DFT) of a sparse signal, faster than the FFT algorithms found in the literature [1],[2],[3],[4]; in this case, it is assumed that the signal is sparse or approximately sparse in the DFT domain. In the one hand, researchers from the Massachusetts Institute of Technology (MIT) presented two sFFT algorithms [2] which improve the runtime over all the previous developments [1],[5],[6],[7] including the most optimized conventional FFT algorithms like FFTW [8]; the first algorithm is intended for the noiseless case, and the second algorithm is intended for the noisy case. On the other hand, to the best of our knowledge there are only four software implementations of the MIT sFFT algorithms reported in literature; the first one was developed by the algorithm authors for the first version of the sFFT algorithm [1]; the second one is an optimized implementation of

|74

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

the first version of the sFFT algorithm [9]; the third one is a GPU-based implementation of the first version of the sFFT algorithm [10]; and the fourth one is an optimized implementation of the Nearly Optimal sFFT algorithm for the noiseless case [11]. Therefore, there is no software implementation reported in literature of the Nearly Optimal sFFT algorithm for the noisy case, which is of practical interest for scientific researchers. Therefore considering the above, the main contribution of this paper is the development of a modified version of the Nearly Optimal sFFT algorithm and its optimized software implementation on a multicore environment provided by a Beowulf cluster. The modified algorithm has an improved accuracy when compared with the original version described in [2], by modifying the flat window and the procedures; to the best of our knowledge, the modified algorithm is the first implementation of the Nearly Optimal sFFT algorithm for the noisy case, and it is very suitable for hardware implementation using ASICs or FPGAs. The rest of the paper is organized as follows: Section 2 describes the Nearly Optimal sFFT algorithm, section 3 presents the modified version of the Nearly Optimal sFFT algorithm, section 4 presents experimental results, and section 5 presents the conclusions and future work.

2

Nearly optimal Sparse Fast Fourier Transform algorithm

In this section, we initially present some concepts about sFFT, then we describe the Nearly Optimal sFFT algorithm, and finally we show some software simulations to verify the basic concepts. Given a discrete time signal x ∈ CN of length N , its N -point Discrete Fourier Transform (DFT) ˆ ∈ CN is defined in Eq. 1. x ˆk = x

1 X xn ω kn , k ∈ [N ] N

(1)

n∈[N ]

Where N is a power of two, [N ] denotes the set of indexes {0, 1, . . . , N −1}, and ω = e−i2π/N is the N -th root of unity. In this case, the number of ˆ is named the sparsity order K and it is non-zero elements of the vector x defined in Eq. 2. ˆ K = | supp (x)| (2) ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

75|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

ˆ is the set of indexes of the non-zero elements of the vector Where supp (x) ˆ Then, a time domain signal x is sparse in the DFT domain if K 0, α > 0, | supp (G)| = O(B/α log (N/δ)), where the conditions described in Eq. 4 are satisfied [2]. ˆ ′ = 1 f or |i| ≤ (1 − α)n/(2B) G i ˆ ′ = 0 f or |i| ≥ n/(2B) G i ˆ ′ ∈ [0, 1] f or all i G i ˆ ′ − G| ˆ ∞ Rloc /2} if Q 6= ∅ then l′ = lj + minq∈Q qw t ; else l′ = ⊥; end end return l′

ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

81|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

This procedure performs the adjustment of the frequency location using five processing stages: the first one sets the initial conditions, the second one calculates the hashes-error, the third one calculates the angles of the hashes-error and candidate frequency bins, the fourth one performs the voting stage, and the fifth one locates the frequency bins. The setting of initial conditions calculates the value of the location threshold s, and clears the vote counters of the t candidate adjustments for the B candidate frequency ˆ uˆ′ , bins. The hashes calculation stage obtains Rloc couples of the form u, which are calculated from the signal x by using the procedure HashToBins with pseudo-random permutation parameters of the form (σ, a) and (σ, a + β), respectively. The angle calculation stage obtains the vector c, which is ˆ and the hashes uˆ′ . The voting the angle differences between the hashes u stage increments the vote counter vj,q corresponding to the j-th frequency bin and q-th adjustment, if Eq. 9 is satisfied. min{|θj,q − cj |, 2π − |θj,q − cj |} < sπ

(9)

Where θj,q is the angle of the j-th candidate frequency bin for the q-th adjustment, that is, θj,q is related to the q-th candidate frequency adjustment qw/t. Additionally, the above calculation converges if β is chosen snt at random from the set { snt 4w , . . . , 2w } with small enough threshold s. Finally, the frequency location stage selects the minimum q from the set Q = {q ∈ [t] | vj,q > Rloc /2}; thus the estimated j-th permuted-frequency bin is refined using Eq. 10 l′ = lj + min q∈Q

qw t

(10)

2.2.4 Procedure EstimateValues This procedure, presented in Alg. ˆ J , and it has the following 5, calculates the DFT estimation adjustment w input parameters: the time domain signal x; the instantaneous estimation ˆ the parameter B; the flat window parameters δ and α; the set of zˆ of x; located sparse components L; the number of sparse components to estimate K ′ ; and the number of estimation iterations Rest .

|82

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 5: Estimate values function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, δ ∈ R, α ∈ R, L ∈ NO(B) , K ′ ∈ N+ , Rest ∈ N+ ′ ˆ J ∈ Cmin{| supp(L)|,K } Output: w ˆ B, δ, α L, K ′ , Rest ) procedure EstimateV alues(x, z, ˆ j = 0 ∀ j ∈ [| supp(L)|]; w // Main loop for i ∈ [Rest ] do Choose ai uniformly at random from [N ]; Choose σi uniformly at random from the set of odd numbers in [N ]; ˆ ˆ σi , ai , δ, α); u=HashT oBins(x, B, z, for j ∈ [| supp(L)|] do ˆ hr (Lj ,σi ,N,B)modB ω −σi ai Lj uˆ′ i,j = u end end for j ∈ [| supp(L)|] do // Median for real and imaginary parts separately across the i dimension ˆ j = mediani uˆ′ i,j w end ˆ J k2 J = arg max| supp(J )|=K ′ kw ˆJ return w

ˆ J using This procedure calculates the DFT estimation adjustment w three processing stages: the first one calculates the Rest sets of hashes-error from the signal x by using the procedure HashToBins and considering different pseudo-random permutation parameters; the second one separately calculates the median of the real and imaginary parts of the calculated hashes-error by only considering the set of located sparse frequency bins in L [1],[7], and by cancelling the pseudo-random spectral permutation; and ˆJ . the third one saves the K ′ most energetic components w

3

Modified nearly optimal Sparse Fast Fourier Transform algorithm

In this section we describe the Modified Nearly Optimal sFFT algorithm and its optimized software implementation named sFFT-4.0, this implementation presents an improved accuracy compared with the original description [2]. ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

83|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

3.1

Description of the modified nearly optimal sFFT algorithm

The modified sFFT algorithm is accomplished by designing a different flat window and by performing several modifications to the procedures described in section 2. 3.1.1 Modified flat window In this case, we designed a flat window of the form (1/B, 1/2B, δ, O(B log (N/δ))) as the one described in [1], and considering the time domain and DFT domain representations. The time domain representation of the flat window is described in Eq. 11, and this window does not depend on the parameter α and its support only depends on the parameters B and δ.

Gn+ N modN = 2Ce

−2π 2 (n−N/2)2 σg 2

2

sinc(2C(n − N/2)), n ∈ [N ]

(11)

In this case, the flat window in time domain has a small support given by O(B log (N/δ)) and it is designed with an ideal low-pass filter using a Gaussian window with finite duration to truncate the impulse response. The cutoff frequency of the low-pass filter is 2C, where C is defined in Eq. 12 and the standard deviation of the Gaussian window is defined in Eq. 13.

C=

σg = 2B

1 2B

p 2 log(N/δ)

(12)

(13)

The DFT domain representation the flat window is described in Eq. 14.

|84

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

ˆ′ N = G k+ 2 modN p 1, if |k − N/2| ≤ N (C − p2 log(1/δ)/σg ) 0, if |k − N/2| ≥ N (C + 2 log(1/δ)/σg ) , k ∈ [N ] (14) ncdf(σg ((k − N/2)/N + C)) −ncdf(σg ((k − N/2)/N − C)), otherwise

ˆ′ Where the √ vector G is the approximated flat window and ncdf(x) = erfc(−x/ 2)/2 is the Normal Cumulative Distribution Function [13]. In ˆ ′ satisfies |G ˆ ′ − G| ˆ ∞ < δ, and it has a flat super-pass rethis case, G p gion, a transition band of width g , and a total bandp p 2N 2 log(1/δ)/σ width of BWGˆ ′ = N/B(1 + 2 log(1/δ)/ 2 log(N/δ)) which satisfies BWGˆ ′ ≤ 2N/B. 3.1.2 Modified procedure HashToBins This modified procedure is shown in Alg. 6 by considering the designed flat window. Algorithm 6: Modified hash to bins function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, σ ∈ {2c + 1 | c ∈√[N/2]}, a ∈ [N ], G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ G ˆ ∈ CB Output: u ˆ ′) ˆ B, σ, a, G, G procedure HashT oBins(x, z, uj = 0 ∀ j ∈ [B]; // Spectral Permutation and Sub-sampling for j ∈ {N − | supp(G)|/2 , N + | supp(G)|/2 − 1} do ujmodB = ujmodB + xσ(j−a) mod N Gj−N +| supp(G)|/2 ; end // Sub-sampled DFT ˆ = FFTB (u); u // Efficient convolution calculation ˆ do for j ∈ supp(z) ˆ′ ˆj ω σaj ; ˆ hf (j,σ,N,B)modB = u ˆ hf (j,σ,N,B)modB − G u |of (j,σ,N,B)| z ˆ′ ˆ h (j,σ,N,B)modB = u ˆ h (j,σ,N,B)modB − G zˆj ω σaj ; u c

c

|oc (j,σ,N,B)|

end

ˆ return u

In this case, due to the designed flat window has a doubled bandwidth, each sparse component has two hashes located in the offsets of (j, σ, N, B) and oc (j, σ, N, B) given by the Eqs. 15 and 16, respectively. ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

85|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

of (j, σ, N, B) = πp (j, σ, N ) − hf (j, σ, N, B)N/B

(15)

oc (j, σ, N, B) = πp (j, σ, N ) − hc (j, σ, N, B)N/B

(16)

Where hf (j, σ, N, B) = ⌊πp (j, σ, N )B/N ⌋ is the floor-hash function and hc (j, σ, N, B) = ⌈πp (j, σ, N )B/N ⌉ is the ceil-hash function; hf (j, σ, N, B) ˆ for mod B and hc (j, σ, N, B) mod B are indexes of the two hashes in u a single sparse component of x. The proposed flat window improves the accuracy of the algorithm by solving the problem of the zero-value hashes, and it reduces the sampling cost by considering a smaller support. 3.1.3 Modified procedures LocateSignal and LocateInner These modified procedures are presented in Alg. 7 and Alg. 8, respectively. In this case, we reduced the number of modified procedure HashToBins used by the modified procedure LocateInner by fixing the parameter a of the pseudo-random spectral permutation; this allows the usage of the same ˆ for all the iterations of the modified procedure LocateInner, hashes-error u thus the sampling cost and the speed are improved. Algorithm 7: Modified locate signal function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log√2(N )]}, ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , s ∈ R, G ∈ RB⌈log(N/δ)⌉ , G Rloc ∈ N+ Output: L ∈ NO(B) ˆ ′ , s, Rloc ) ˆ B, G, G procedure LocateSignal(x, z, Choose a uniformly at random a from the set [N ]; Choose σ uniformly at random from the set of odd numbers in [N ]; (0) lj = jN/B ∀ j ∈ [B]; w0 = N/B; t = log2 N ; t′ = t/4; Dmax = ⌈logtprime (w0 + 1)⌉; ˆ ′ ); ˆ ˆ B, σ, a, G, G u=HashT oBins(x, z, // Main loop for D ∈ [Dmax ] do ˆ ′ , s, Rloc , ˆ B, σ, a, G, G l(D+1) =LocateInner(x, z, ˆ l(D) ); w0 /(t′ )D , t, u, end (Dmax )

L = unique{σ −1 ⌊lj

+ 0.5⌋|j ∈ [B]};

return L

|86

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 8: Modified locate signal inner function. Input: x ∈ CN , zˆ ∈ CN , B ∈ {2c | c ∈ [log 2(N )]}, σ ∈ {2c + 1 | c ∈√[N/2]}, a ∈ [N ], G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , s ∈ R, Rloc ∈ N+ , w ∈ R, G ˆ ∈ CB , l ∈ NO(B) t ∈ N+ , u ′ Output: l ∈ NO(B) ˆ ′ , s, Rloc , w, t, u, ˆ B, σ, a, G, G ˆ procedure LocateInner(x, z, l) vj,q = 0 ∀ (j, q) ∈ [B] × [t]; // Main loop for i ∈ [Rloc ] do Choose β uniformly at random from the set t sN t {⌊ sN 4w ⌋, . . . , ⌊ 2w ⌋}; ′ ˆ ˆ ′ ); ˆ σ, a + β, G, G u =HashT oBins(x, B, z, for j ∈ [B] do if lj 6= ⊥ then ˆ j /uˆ′ j ; rj = u cj = arctan2(ℑ{rj }, ℜ{rj }); if cj < 0 then cj = cj + 2π; end for q ∈ [t] do mj,q = lj + q+1/2 t w; 2πβm θj,q = N j,q mod 2π; if min{|θj,q − cj |, 2π − |θj,q − cj |} < sπ then vj,q = vj,q + 1; end end end end end for j ∈ [B] do Q = {q ∈ [t] | vj,q > Rloc /2} if Q 6= ∅ then l′ = lj + minq∈Q qw t ; else l′ = ⊥; end end return l′

ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

87|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

3.1.4 Modified procedure EstimateValues This modified procedure, presented in Alg. 9, improves the estimation of the DFT values by adding a correction factor that cancels the effect of the windowing in DFT domain. Algorithm 9: Modified estimate values function. Input: x ∈ CN , zˆ ∈ CN√ , G ∈ RB⌈log(N/δ)⌉ , ˆ ′ ∈ RN ⌈1/(2B)+ 2 log(1/δ)/σg )⌉ , B ∈ {2c | c ∈ [log 2(N )]}, G L ∈ NO(B) , K ′ ∈ N+ , Rest ∈ N+ ′ ′ ˆ J ∈ Cmin{| supp(L)|,K } , Jˆ ∈ Nmin{| supp(L)|,K } } Output: {w ˆ ′ ,B, L, K ′ , Rest ) ˆ G, G procedure EstimateV alues(x, z, ˆ j = 0 ∀ j ∈ [| supp(L)|]; w // Main loop for i ∈ [Rest ] do Choose ai uniformly at random from [N ]; Choose σi uniformly at random from the set of odd numbers in [N ]; ˆ ′ ); ˆ ˆ σi , ai , G, G u=HashT oBins(x, B, z, for j ∈ [| supp(L)|] do ˆ′ ˆ hr (Lj ,σi ,N,B)modB ω −σi ai Lj /G uˆ′ i,j = u |or (Lj ,σi ,N,B)| end end for j ∈ [| supp(L)|] do // Median for real and imaginary parts separately across the i dimension ˆ j = mediani uˆ′ i,j w end ˆ J k2 J = arg max| supp(J )|=min{| supp(L)|,K ′ } kw ˆ J , J} return {w

3.1.5 Modified procedure sFFT Taking into account the above improvements, the proposed modified Nearly Optimal sFFT algorithm is presented in Alg. 10, and it has the following input parameters: the time domain signal x, the sparsity order K, the parameter δ of the flat window [2], the location threshold s, and the number of estimation iterations (Rest ). The modified algorithm simplifies the calculation of the parameters B and K by halving their values during odd-numbered iterations (r mod 2 = 1), also the modified algorithm improves the accuracy of the DFT estimaˆ tion z.

|88

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

Algorithm 10: Modified sFFT algorithm. Input: x ∈ CN , K ∈ N+ , δ ∈ R, s ∈ R , Rest ∈ N+ Output: zˆ ∈ CN procedure sF F T (x , K, δ, s, Rest ) RsF F T = ⌊log2 (K)⌋; B = 2⌈log2 (K)⌉+1 ; Rloc = ⌊log2 log2 N ⌋; ˆ ′ by using equations 11, 12, 13, and Calculate flat window G, G 14; zˆj = 0 ∀ j ∈ [N ]; // Main loop for r ∈ [RsF F T ] do // Location of Components ˆ ′ , s, Rloc ); ˆ B, G, G L= LocateSignal(x, z, // Estimation of Components ˆ ′ , L, 3K, ˆ B, G, G {wJ , J} =EstimateV alues(x, z, Rest ); zˆLJ = zˆLJ + wJ // Reduction of Window Support if (B > 2) ∧ (r mod 2 = 1) then B = B/2; K = ⌊K/2⌋; ˆ ′ by using equations 11, 12, Calculate flat window G, G 13, and 14; end end return zˆ

4

Experimental results

We developed a software implementation of the modified sFFT algorithm named sFFT-4.0 [1],[2] by using the modified procedures HashToBins, LocateSignal, LocateInner, and EstimateValues. The Linux-based software R implementation was developed using C language, the Intel C Compiler, OpenMP [14], and the Intel MKL library; it was tested on one node of the Adroit cluster at Princeton University by using eight cores; and it was R integrated with MATLAB by using MEX-files shared libraries. We used OpenMP in some portions of the code in order to execute up to eight calls of the procedure HashToBins in parallel, hence we parallelized the main forloops of the procedures LocateInner and EstimateValues. The source code of sFFT-4.0 is available at https://sourceforge.net/projects/sfft40/. Finally, in order to test the performance of sFFT-4.0 implementation; ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

89|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

first, we developed some comparisons of sFFT-4.0 implementation against the previous sFFT implementations AAFFT and sFFT-2.0 in terms of runtime and accuracy versus Signal to Noise Ratio (SNR); and second, we present the achieved improvements when the multicore architecture is used for sFFT-4.0 implementation.

4.1

Comparison tests

Figure 1(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of sFFT-4.0 using a single core, the blue plot represents the runtime of sFFT-2.0 [1], and the green plot represents the runtime of AAFFT [7]. Even though our implementation is running on a single core, it is faster than AAFFT for all N values and faster than sFFT-2.0 for N ≥ 218 . Figure 1(b) shows the simulation results of the runtime versus K for N = 222 ; in this case, sFFT-4.0 is faster than AAFFT and sFFT-2.0 for all K values. sFFT−/4.0/2.0/AAFFT runtime vs. sparsity order for N = 222

2

10

sFFT−4.0 one core sFFT−2.0 AAFFT

sFFT−/4.0/2.0/AAFFT runtime vs. signal size for K = 50

1

10

sFFT−4.0 one core sFFT−2.0 AAFFT

1

10 Runtime [seconds]

Runtime [seconds]

0

10

−1

10

0

10

−1

10

−2

10

−2

10

−3

10

214

215

216

217

218

219

220

Signal size N

(a)

221

222

223

224

2

10

3

10 Sparsity order K

4

10

(b)

Figure 1: 1(a) Comparison of sFFT-4.0 runtime versus N for K = 50 with two sFFT implementations. 1(b) Comparison of sFFT-4.0 runtime versus K for N = 222 with two sFFT implementations.

Figure 2 shows the simulation results of the l1 -error versus SNR for N = 222 and K = 50; where the blue plot represents the error for sFFT-4.0 and the red plot represents the error for sFFT-2.0, the results for AAFFT

|90

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

are not worthy and were not included due to this algorithm is extremely sensitive to noise. sFFT−2.0/4.0 error vs. SNR (K = 50, N=222)

2

10

sFFT−2.0 sFFT−4.0 0

Error per large frequency

10

−2

10

−4

10

−6

10

−8

10

−10

10

−20

0

20

40 60 SNR [dB]

80

100

120

Figure 2: sFFT-4.0 error versus SNR for N = 222 and K = 50.

From 2 it can be seen that for values of SNR greater than −3 dB sFFT4.0 has improved accuracy when compared with sFFT-2.0.

4.2

Multicore tests

Figure 3(a) shows simulation results of the runtime versus N for K = 50; where, red plot represents the runtime of FFTW3 [8], the blue plot represents the runtime of sFFT-4.0 when one core is used, and the the green plot represents the runtime of sFFT-4.0 when eight cores are used. In this case, sFFT-4.0 is faster than FFTW3 for N above to 216 in the eight cores case; however, there is no a significant speed improvement by using the multicore architecture, this is due to the low sparsity of signal that leads to for-loops with small number of iterations which do not take advantage of the multicore architecture, hence the OpenMP’s overhead times have a strong influence on the overall execution time [14]. Figure 3(b) shows the simulation results of the runtime versus K for N = 222 ; in this case, sFFT-4.0 is faster than FFTW3 for K below to 2500 when a single core is used. From Figure 3(b), it can be noted that there is a significant speed improvement by using the multicore architecture due ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

91|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

the increased sparsity of signal; in this case we achieved an acceleration near to 4×. 22

sFFT−4.0 runtime vs. sparsity order for N = 2

0

10 sFFT−4.0 runtime vs. signal size for K = 50

1

10

Runtime [seconds]

Runtime [seconds]

FFTW3 sFFT−4.0 one core sFFT−4.0 eight cores

0

10

−1

10

−2

10

−1

10

FFTW3 sFFT−4.0 one core sFFT−4.0 eight cores

−3

10

−2

10

−4

10

214

215

216

217

218

219

220

Signal size N

(a)

221

222

223

224

2

10

3

10 Sparsity order K

4

10

(b)

Figure 3: 3(a) sFFT-4.0 runtime versus N for K = 50. 3(b) sFFT-4.0 runtime versus K for N = 222 .

Finally, the modified sFFT algorithm uses the following arithmetic operators: FFT, multiply-accumulate, trigonometric functions (sine, cosine, atan, atan2), modular inversion based on the Fermat’s Little Theorem, among others; these operators can be easily mapped to hardware, thus this advantage could facilitate the implementation of the modified sFFT algorithm on an FPGA or an ASIC.

5

Conclusions

In this paper we present a modified Nearly Optimal sFFT algorithm for the noisy case, this algorithm reduces the sampling cost and corrects the zero-hash issue of the original algorithm by doubling the bandwidth of the flat window and by modifying the original procedures. Additionally, we developed an efficient software implementation of the modified Nearly Optimal sFFT algorithm using a multicore platform, under certain conditions this implementation is faster than the optimized FFT library FFTW3 and previous sFFT implementations. To the best knowledge of the authors, the developed implementation is the first one of the Nearly Optimal sFFT algorithm for the noisy case reported in literature. Finally, the future work

|92

Ingeniería y Ciencia

Alexander López-Parrado, Jaime Velasco-Medina

will be addressed to perform an efficient hardware implementation of the modified Nearly Optimal sFFT algorithm using an FPGA.

Acknowledgement Alexander López-Parrado thanks Colciencias for the scholarship and Universidad del Quindío for the study commission.

References [1] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Simple and practical algorithm for sparse Fourier transform,” in ACM-SIAM Symposium on Discrete Algorithms (SODA). Kyoto, Japan: SIAM, 2012, pp. 1183–1194. [Online]. Available: http://dl.acm.org/citation.cfm?id=2095116.2095209 74, 76, 77, 83, 84, 89, 90 [2] ——, “Nearly optimal sparse Fourier transform,” in Proceedings of the 44th symposium on Theory of Computing (STOC), New York, USA, May 2012, pp. 563–578. 74, 75, 76, 77, 78, 79, 81, 83, 88, 89 [3] P. Indyk, M. Kapralov, and and Eric Price, “(Nearly) Sample-Optimal Sparse Fourier Transform,” in ACM-SIAM Symposium on Discrete Algorithms (SODA), Portland, USA, 2014. 74, 76 [4] H. Hassanieh, L. Shi, O. Abari, E. Hamed, and D. Katabi, “GHz-wide sensing and decoding using the sparse Fourier transform,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014, pp. 2256–2264. [Online]. Available: http://dx.doi.org/10.1109/INFOCOM.2014.6848169 74 [5] A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss, “Improved time bounds for near-optimal sparse Fourier representations,” in Proceedings of SPIE Wavelets XI, San Diego, USA, 2005, pp. 1–15. 74, 76 [6] A. C. Gilbert, Y. Li, E. Porat, and M. J. Strauss, “Approximate sparse recovery: optimizing time and measurements,” in Proceedings of the 42nd ACM symposium on Theory of computing, Cambridge, USA, 2012. 74, 76 [7] A. C. Gilbert, M. J. Strauss, and J. A. Tropp, “A Tutorial on Fast Fourier Sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 57–66, 2008. [Online]. Available: http://dx.doi.org/10.1109/MSP.2007.915000 74, 76, 83, 90 ing.cienc., vol. 11, no. 22, pp. 73–94, julio-diciembre. 2015.

93|

Efficient Software Implementation of the Nearly Optimal Sparse Fast Fourier Transform for the Noisy Case

[8] M. Frigo and S. G. Johnson, FFTW, 3rd ed., Massachusetts Institute of Technology, 2012. [Online]. Available: http://www.fftw.org/ 74, 91 [9] C. Wang, M. Araya-Polo, S. Chandrasekaran, A. St-Cyr, B. Chapman, and D. Hohl, “Parallel Sparse FFT,” in Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms, New York, NY, USA, 2013, pp. 10:1—-10:8. [Online]. Available: http://dx.doi.org/10.1145/2535753.2535764 75 [10] J. Hu, Z. Wang, Q. Qiu, W. Xiao, and D. J. Lilja, “Sparse Fast Fourier Transform on GPUs and Multi-core CPUs,” in Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on, Oct. 2012, pp. 83–91. [Online]. Available: http://dx.doi.org/10.1109/SBAC-PAD.2012.34 75 [11] J. Schumacher, “High Performance Sparse Fast Fourier Transform,” Master Thesis, ETH Zurich, Department of Computer Science, 2013. [Online]. Available: http://goo.gl/3alHvS 75 [12] A. Dutt and V. Rokhlin, “Fast Fourier transforms for nonequispaced data, II,” Applied and Computational Harmonic Analysis, vol. 2, no. 1, pp. 85–100, 1995. [Online]. Available: http://dx.doi.org/10.1006/acha.1995.1007 76 [13] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 10th ed. Washington, DC, USA: Dover, 1972. 85 [14] OpenMP, OpenMP Application Program Interface, OpenMP Architecture Review Board Std., 2013. 89, 91

|94

Ingeniería y Ciencia