Image Hiding on Audio Subband Based On Centroid in Frequency Domain

Audio watermarking is a mechanism for hiding data on audio. Data hiding methods used in this paper are Lifting Wavelet Transform (LWT), Fast Fourier Transform (FFT), Centroid and Quantization Index Modulation (QIM). The first step is to segment host audio into several frames, then the selected sub-band is changed by the FFT by changing the sub-band domain from time to frequency. The centroid process is used to find the center of frequency for the insertion location to get a more stable output. The embedding process is done by QIM. The watermarking performance by adjusted parameters obtains the imperceptibility value with Signal to Noise Ratio (SNR)> 21 dB, Mean Opinion Score (MOS)> 3.8 with a capacity = 86.13 bps. In addition, for most of attacked watermarked audio files, this method is resistant to several attacks such as Low Pass Filter (LPF) with fco> 6 kHz, Band Pass Filter (BPF) with fco 50 Hz 6 kHz, Linear Speed Change (LSC) and MP4 Compression with Bit Error Rate (BER) less than 20%.


INTRODUCTION
With the development of information and communication technology and internet globalization, everyone can access some contents with limitless freedom especially audio.Those developments of internet and information technology could improve data transfer productivity.However, the nature of internet itself cannot be handled entirely when it comes to audio piracy.These unresponsible parties can get audio content freely, modifying it and using it for their own advantages that would harm the data's original owner.Thus, we need a method in audio content that can protect the copyrights, namely audio watermarking.
Digital watermarking is a method of embedding data information into digital multimedia content.The digital multimedia content can contain image, video or audio, while the data information can contain identity or unique data containing texts or images (Hartung & Kutter, 1999).There are parameters that define the best results in watermarking, such as (Singh & Chadha, 2013) : 1. Imperceptibility, a watermark cannot be listened by human ear, it can only be detected by specific signal processing at machine computation.Parameters that can be counted are Signal to Noise Ratio (SNR) and Mean Opinion Score (MOS), 2. Robustness, survival of extracted watermark comparing to original watermark when watermarked audio is attacked by several attacks, such as common signal processing, geometric signal attack, or compression attack.Bit Error Rate (BER) is the value parameter representing robustness, 3. Payload, the amount of watermark that is embedded into the audio host, known as the value of C with unit bit per second (bps).
Embedding watermark in frequency domain of audio with any method was published by several authors.In (Budiman, Suksmono, & Danudirdjo, 2017), authors present a design of audio watermarking system based on Fast Fourier Transform (FFT), with Lifting Wavelet Transform (LWT) and Spread Spectrum (SS) combined, but they did not describe the system robustness against MP4 compression.In (Fan & Wang, 2008) with Discrete Cosine Transform (DCT) and Centroid combined, the authors get the perfect satisfying extraction results.However, during some attacks, it appears that the MP3 compression (48kbps) attack shows the lowest value of extraction caused unaccepted robustness.In (Budiman, Suksmono, & Danudirdjo, 2016), authors compared between FFT and DCT performance by Fibonacci embedding method, the watermark robustness in DCT can reach perfect robustness with all frame length and FFT can get BER = 0 with frame length more than 256, but authors didn't describe the robustness against any attacks.In (Fallahpour & Megías, 2015), authors presented a high-capacity audio watermarking by modifying several of FFT spectrum magnitudes with Fibonacci characteristics.They showed that the method obtained high payload up to 3 kbps, with good imperceptibility and provide good robustness against common audio signal processing, one of which is MP3 compression, but MP3 compression rate is minimum 64 kbps.Quantization Index Modulation (QIM) is popular embedding method which was first introduced in (Chen & Wornell, 2001).In the development of audio watermarking research, QIM was developed by combining it with other transform or decomposition method, such as wavelet decomposition (Hu, Chen, & Hsu, 2014) or (Novamizanti, Budiman, & Wibowo, 2018), transformation to frequency domain (Lei, Soon, & Tan, 2013), SVD (Agradriya, Perdana, Safitri, & Novamizanti, 2017) or QR decomposition (Dhar, 2014).In recent years, researchers found that embedding watermark can also be executed in centroid location of host audio, especially in the frequency domain of audio.In (Hongxia & Mingquan, 2010) authors proposed embedding watermark by calculating audio centroid in time domain and embedding it into hybrid domain, but authors stated that their research was for fragile audio watermarking.In (Hongxia & Mingquan, 2010), authors used DCT for transforming audio to frequency domain and calculate the centroid in frequency domain before embedding watermark into it by QIM.They described only the robustness against noise 58 dB, Low Pass Filter (LPF) 19.8 kHz, resampling (11 kHz and 22 kHz), echo, and MP3 with minimum rate 48 kbps.In (Zhang, Liu, & Huang, 2012), authors used LWT and DCT before calculating centroid and embedding with QIM, anyway, authors did not describe the robustness against attack completely, as an example, MP3 attack was only carried out in one type of rate without any explanation of the rate value.
In this paper, we propose an audio watermarking system based on a centroid location in frequency domain by QIM method.We select this method due to high robustness of a signal in frequency domain and more stable value of signal in centroid location which will increase robustness also.For the embedding process, the first step host audio is segmented into several frames and get the signal into high subband based on high frequency and low subband based on low frequency by using LWT.The LWT algorithm will select which subband will be embedded.Second step, FFT is used to change the host signal from time domain to frequency domain in which the signal will be more robust.Third step, the centroid process is used to find the central point of the frequency for the insertion location resulting more stable output.Forth step, the watermark data can be embeded by using Quantization Index Modulation.Fifth step, after the watermark is embedded, Inverse FFT (IFFT) and Inverse LWT (ILWT) are required to get the watermarked audio into time domain to calculate the SNR and Objective Different Grade (ODG).Extraction process mostly same with embedding process, first step is to frame the watermarked audio and to transform using LWT to get the subband which used for embedding process before.Second step, the FFT will process a domain-changing procedure.Third step, calculate the centroid of each frame.Forth step, the watermark is extracted by using QIM.Finally, in the fifth step, the extracted watermark data is compared to original watermark data for calculating BER as watermarking robustness.The purpose of this method combination for audio watermarking is to get an audio watermarking performance with high imperceptibility and robustness against any attacks.The combination of LWT-FFT before centroid calculation is to select the signal with high power and robust domain with high capacity of watermark to be embedded.Frequency domain is a robust domain for information hiding, while QIM is a watermarking method which is suitable for hiding data in high power signal and QIM also has good imperceptibility.Centroid is chosen as a method for calculating the location of a signal to be embedded in frequency domain because centroid is a statistic calculation obtaining robust value similar with averaging calculation.

Lifting Wavelet Transform (LWT)
LWT is usually used to decrease the processing time and memory requirement.It has several advantages in comparison with conventional wavelet, (a) the LWT process calculation is more efficient because LWT complexity is lower than DWT complexity (b) It needs less memory requirements than conventional wavelet, (c) LWT is not difficult to build a non-linear wavelet decomposition, (d) it has localization features in frequency which reduce the weakness of the conventional wavelet transform (Dhar, 2014).The main principle of LWT is to build a new wavelet with several advantages than conventional wavelet.These are schemes of the LWT process for the audio domain (Sweldens, 1997): 1. Split/Decomposition, is the division of data into two parts; odds (xo) and even (xe).The original data () is divided into odd and even samples with the following formula: 2. Predict (P), is a step of using a function that approximates the data set.The difference between the approaching data and the actual data is by replacing the odd elements of the data set.The remaining element becomes the input for the next step in the transformation after the data is divided into the odd part (xo) and even part (xe) is carried out by the calculation process within wavelet function (high pass filter) denoted by dn, with xe(n) used in predicting xo(n) as follows:

𝑑 = 𝑥 (𝑛) − 𝑃[𝑥 (𝑛)]
(3) 3. Update (U), is a step to replace even samples with average values.The result will be inputed as the next step input on the wavelet transform.The odd element is also rewritten in the original data set forming the filter.The calculation of values by scaling function (low pass filter) is indicated by c(n).Here is the equation:

Fast Fourier Transform (FFT)
The Fast Fourier Transform is an enhancement algorithm of Discrete Fourier Transform (DFT), in which FFT can calculate discrete Fourier algorithms with relatively low complexity and fast calculation times.For the formula change the signal from time domain to frequency domain is as follows (Neyman, Pradnyana, & Sitohang, 2014): where : X(k) is the domain transformation value, x(n) is the digital media block value, N is the amount of the data that will be altered to be a frequency domain, n is the sample in time domain and k is the sample in frequency domain.
As for the formula of Inverse-FFT change the signal back from frequency domain to time domain is as follows (Neyman et al., 2014):

Centroid
The peak spectrum is named Sub-band Spectrum Cencored (SSC).The peak spectrum has relatively little influence when the watermark is embedded.Centroid represents the center of energy distribution of each audio frame.The spectrum centroid of each frame calculated as (Chu & Champagne, 2008): Image Hiding On Audio Subband Based On Centroid In Frequency Domain where :   = spectrum centroid,  = index   [𝑘] = the amplitude of signal, and n = the number of sample.
1.4 Quantization Index Modulation (QIM) Quantization Index Modulation is one of methods that is most used to insert the watermark data into host audio.QIM can be applied in the time domain or frequency domain.The formula of QIM embedding is shown below (Hu et al., 2014): where :  = 0, ±1, ±2, … (0) = the amplitude of original audio signal  (0) = the amplitude of quantized  = original watermark (bit)  = quantization index  = number of quantization bits The formula for extraction the bit of watermark is: where :  = extracted watermark (bit)

METHODOLOGY
The purpose of this paper is to design and analyze the performance of audio watermarking using the LWT-FFT method with Centroid-based location determination and insertion techniques with QIM.The initial step is to design an audio watermarking block diagram.The block diagram consists of embedding and extraction process.The final step is calculating the performance of audio watermarking.Several performance parameters in audio watermarking contain SNR and ODG as the imperceptibility parameters, BER as a robustness parameter, MOS as a subjective imperceptibility parameter, and C as a watermark capacity parameter.To do this audio watermarking research, we use research methodology as following, research problem identification which we already describe in first four paragraphs of Section 1, data preparation which we describe in the beginning of Section 3, embedding and extraction process design which we describe in this section, and the experiment which we describe in Section 3, especially in Subsection 3.1.

Embedding Process
Embedding process is a procedure to insert the watermark data into the host audio.There are several steps in embedding process as shown in Figure 1.Steps of embedding process are described below : 1. Audio host is going through framing process where a whole audio is divided into some audio frames.The framing process will also limit the audio duration.Use the following equation to determine the initial sample and final sample of each frame: centroid process that can be done in the frequency domain.Output is called X(k). 4. X(k) will form a square matrix.Centroid method is used to find out the center point amplitude of the signal, resulting the output of X(c). 5. Read the binary image as w(m,n) and reshape it into 1 dimension by pre-processing process within the audio dimension with the size 1 × M. Value "0" is stated black color spreading, and value "1" is white color.This 1 dimension watermark is assumed as w(n).6. Embed X(c) matrix and w(n) with QIM process by using Equation ( 9) and ( 10).Output of this process is called Xw(c).7. Inserting modified magnitude of watermarked audio to defiling Xw(c) to become Xw(k), in the location of centroid of frequency domain.Convert audio host in spatial domain as xLw(n) to Xw(k) in frequency domain with FFT process. 3.
Xw(k) will form a square matrix.Apply centroid method to find out the center point amplitude of the signal for detected location of the watermark which has been embeded.
The output will be marked as Xw(c). 4.
After detecting location of the watermark by centroid process, the watermark extraction process is performed from the component value by QIM method.It will produce the output of w'(n).
5. Perform conversion process in the post-processing block from one dimension (1D) to two dimension (2D).After that, convert each bits into pixel [signed as w'(m,n)] .
6. Calculate the Bit Error Rate (BER) value.

RESULT AND DISCUSSION
In this chapter we describe about the result and analysis from the process in the previous chapter.In the experiment, we use five host audio files with *.wav file format which the duration of each file is about one minute The host audio that is used includes host.wav,

System Parameters Testing
In this subsection, we disscuss about finding the best parameter before the attack and the optimization process.There are 4 parameters that are used, they are level of decomposition (N), length of frame (Nframe), number of quantization bits (Nbit) and threshold.Table 1 below shows the best parameter for attacking process that had been tested with host.wav as the type of default host audio.The output of audio watermarking process with the input parameters as initial parameters above are SNR = 37.92 dB, BER = 0, Capacity = 1.34 bit/s.By these parameters, we get the extracted watermark exactly the same as original watermark without attack.Then, using those parameters, we attack the watermarked audio by several types of attacks.There are 8 types of attacks which are used in the robustness test of this watermarking audio system.The types of attacks that is used such as LPF, Band Pass Filter (BPF), noise, resampling, time scale modification (TSM), linear speed change (LSC), MP3 and MP4 compression.In this experiment, 5 different audio types are used, such as host.wav,piano.wav,guitar.wav,drums.wav,and bass.wav.From Table 2, it is displayed that the average BER of extracted result of each watermarked audio is about 0.34 to 0.37 which means bad quality watermarks.The system shows that the data cannot survive to the various attacks (Low Pass Filter, Band Pass Filter, Noise, Resampling, Time Scale Modification, Linear Speed Change, MP3 and MP4 Compression).Next, we select 5 host audio files with an attack in each file for parameter optimization.This parameter optimization is performed in order to get better robustness than unoptimized one for 5 host audio files with a selected attack in highlight cell displayed in Table 2.

Optimized Parameter
After the 5 types of host audio attacked, we choose 5 samples of host audio and attacks that will be optimized.The type of attack and host audio that will be optimized based on the BER value which is still possible to do optimization (BER<0.4).The samples are Resampling 16k, MP2 Compression 64k, MP3 Compression 128k, MP4 Compression 32k and BPF cut-off 100-6k.The results of 5 optimized parameters are shown in Table 3.The robustness after optimization is increasing, it means that the BER is decreasing.Anyway, if the robustness is increasing, then the other performances will decrease.As an example, for BPF attack, after optimization, BER decreases to 0.25 from 0.37, as the consequence, the SNR decreases from 63.2 dB to 30.44 dB.This is happening because there is a trade off between BER and SNR.Based on Table 3 and 4, we choose the parameter resulted of BPF attack with cut-off 100-6k as the optimized parameter for all attacks, because the average BER is the lowest.Thus, the selected optimized parameters can be used for next experiment to measure the robustness from all attacks.The best optimized parameters are shown in Table 5.  Comparing to Table 2, Table 6 describes that the overall robustness is much better.The system with adjusted parameter is moslty more robust to various attacks in some hosts with the average BER decreased with the range 0.12 to 0.23.The most robust watermarked audio against the attacks is gitar.wavwith the lowest average BER, 0.12.And the weakest watermarked audio against the attacks is bass.wavwith the highest average BER, 0.23.

3.5
Watermarked Audio Quality To measure the quality of audio that has been inserted with watermark, we perform subjective and objective measurement.Subjective measurement is signed as Mean Opinion Score (MOS).The objective measurement is known as Signal to Noise Ratio (SNR).The MOS value is rated by 30 respondens by listening to 5 types of the original host audio and 5 types of watermarked audio.SNR are measured by program with following formula.The audio quality in SNR and MOS is displayed in Table 8.

3.6
Audio Watermarking Performance Comparison In order to understand how well this method performs, we compare this method with the previous method in (Fan & Wang, 2008) and (Hongxia & Mingquan, 2010).Two previous methods above were also using centroid as watermark location calculation, but they presented different technique of host audio pre-processing.Table 9 displays performance comparison consisting of imperceptibility, robustness and capacity parameter performances.NA means not available.In Table 9, our method has biggest watermark capacity on 86.13 bps with accepted imperceptibility.Anyway, our method obtains lower robustness and lower imperceptibility than the previous method.High watermark capacity in our method pays the low robustness and low imperceptibility.Nevertheless, the robustness and imperceptibility performance in our method are still in accepted subjective range as we already describe in section 3.4 and 3.5.

CONCLUSION
In this paper we combined Lifting Wavelet Transform, Fast Fourier Transform, Centroid calculation and QIM for the embedding method.The proposed system has good robustness against several attacks at most of host audio files with the BER value is less than 20% as accepted robustness.Several attacks on which the system with most host audio files are robust, such as LPF with cut off 9k, resampling with rate 16k, TSM 1%, LSC for all criteria, MP3 compression with rate more than 64kbps and MP4 compression with rate more than 32kbps.The imperceptibility is also good for all type of host audio with range SNR 21.24 dB to 31.26 dB.From survey, MOS is also in good range, between 3.87 to 4.14.The capacity or Image Hiding On Audio Subband Based On Centroid In Frequency Domain ELKOMIKA -41 payload of watermark to be embeded in the audio with optimized parameter is high, that is 86.13 bps.

ACKNOWLEDGEMENT
This research is supported by Research and Community Service of Universitas Telkom.
the audio frames are still in time domain, the decomposition of the original image will produce a four-band data such as coefficient matrix approach i.e.Low-Low (LL), Low-High (LH), High-Low (HL), and High-High (HH).Decomposition of audio will only produce low-pass coefficients (L) and high pass coefficients (H).The lifting scheme is proposed to reduce the calculation time, for LWT simplify the problem by directly analyzing problems in the domain of integers so that LWT count more effectively and only requires a small memory space.Output of this process is called coefficient of LWT x(n), consist of low frequency [XL(n)] and high frequency [XH(n)].3. xL(n) will be transformed into frequency domain by Fast Fourier Transform (FFT) for Figure 1.Embedding Process

Table 2 .
The robustness test result with initial parameters

Table 3 .
Comparison Before and After Optimization Robustness Test Results from Optimized ParameterThe input parameters in Table5was already attacked with all types of attacks.The BER value of bass.wavbeforeoptimization is 0.37 and after optimization is 0.191.It means that after optimization the BER value decrease more than 40%.Thus, in Table6the result of watermark extraction is shown.

Table 6 .
The result of Optimization For All Attacks

Table 8 .
MOS and SNR for 5 type of host audio , the highest quality objectively of watermarked audio is piano.wavwith SNR=31.26dB, but the highest quality subjectively is guitar.wavwith MOS=4.14.The value MOS depends on human capability in hearing the audio, so the low difference of subjective and objective measurement as shown in above table, still makes sense.

Table 9 .
Performance Comparison With Previous Method