Friday, March 25, 2011

How I tried to improve WIFI reception in the garage

By: Bob Casiano



Figure:1 The Squirrel Project

Figure 1  The Squirrel Project


“Squirrel!! Moment”. While working on the ATX-FPS-v0 project, I realized I needed to improve my WIFI signal strength. After reviewing several Google listed projects on using a Pringle can, soup can, etc, I decided to go with a coffee can. Since I happen to have one nearby. This project will be posted later this week in my projects section. While working quickly on, what I thought would be a quick and dirty project, the software I downloaded to monitor signal strength turned out to have an infinite loop in the dll it installed on my laptop!!! So, after rebooting, for a windows update of course, I got the blue screen of death with an explanation about the dll! Three hours later and pretty much out of patience to work on any project. I hit F8, reset the windows startup to the previous working login, setup the virus checker to start up after reboot. This took another 2.5 hours to complete! Now, during that time I was able to trim, mow, vacuum up the leaves on my sidewalks and driveways. And, take care of a neighbors dog while they are hunting this weekend. So, I wasted a day screwing around with an unimportant side project when I could have done something more constructive. Like fixing my fence. Well, the next time I have a “Squirrel” moment, I’ll remind myself about the WIFI episode.

Monday, March 21, 2011

Ptv k feed code software


Ptv k feed code software: Feed reader with artificial intelligence, Create a new RSS feed from several others, Syndicate your web content in 4 easy steps and more.

Download here


Saturday, March 19, 2011

High liquid level activated switch

A method for activating a relay when the liquid level exceeds a predetermined level is shown here. DC voltage is required for driving the relay and AC voltage cannot be used here just like what we did in the case of LED and speaker. Pin 9 of the IC can be used to solve this problem. A capacitor connected from this pin to ground will keep the internal output transistor steadily ON whenever the probe resistance goes higher than the reference resistor. External transistor Q1 is connected to the collector of the internal transistor. The load that is the relay is connected at the collector of Q1. When the probe is not touching the water it equal to an open circuit situation and surely the probe resistance will be many M ohms and it is greater than the Rref(13K). The internal transistor will be switched ON and Q1 whose base is connected to the collector of the internal transistor will be in OFF condition keeping the relay inactive. When the reverse scenario occurs (fluid level touches the probe) the internal transistor is switched OFF and this in turn makes the transistor Q1 ON resulting in the activation. The load connected through the relay whether pump, lamp, alarm, solenoid valve or anything is driven. Resistor R3 limits the collector current of the internal transistor while resistor R4 provides protection to the IC from transients.
high fluid level activated relay
High liquid level activated switch
Probe: The probe used here can be any metal rod with size and shape of your choice. The tank must be made of metal and it should be properly grounded. For non metal tanks fix a small metal contactor at its bottom level and ground it. The probe must be placed at the level you want to monitor.

Notes.
  • The circuit can be assembled on a Perf board.
  • I used 12V DC for powering the circuit.
  • Maximum supply voltage LM1830 can handle is 28V.
  • The tweeter I used was of a 16 ohm type.
  • The relay I used is a 200 ohm/12V type.
  • Maximum load current Q1 (2N2222) can handle is 800mA.
  • The switching current/voltage ratings of the relay must be according to the load you want to drive using it.
  • It is recommended to mount the IC on a holder.

Read more:circuitstoday.com/liquid-level-indicator#ixzz1H1tPzaWu

LM1830 based liquid level indicator circuit

LM1830 is a monolithic integrated circuit that can be used in liquid level indicator / control systems. Manufactured by National Semiconductors, the LM1830 can detect the presence or absence of polar fluids . Circuits based on this IC requires minimum number of external components and AC signal is passed through the sensing probe immersed in the fluid. Usage of AC signal for detection prevents electrolysis and this makes the probes long lasting. The IC is capable of driving a LED, high impedance tweeter or a low power relay at its output.
liquid level sensor
Low liquid level indicator (LED)
The circuit of a low liquid level indicator with LED is shown above. Capacitor Ct sets the frequency of the internal oscillator. With the give value of C1 the frequency will be around 6KHz. Capacitor Cb couples the oscillator output to the probe and it ensures that no DC signal is applied to the probe. The circuit detects the fluid level by comparing the probe to ground resistance with the internal reference resistor Rref.
When the probe to ground resistance goes above the Rref the oscillator output is coupled to the base of the internal output transistor making it conducting. The LED connected to the collector (between pin 12 and Vcc) is driven. Since the base of the transistor is driven using the oscillator, actually the transistor is being switched at the oscillator’s output frequency @50% duty cycle. There is no problem in driving the LED using AC signal and this method is very useful when it comes to use a loud speaker as the indicator. Loud speakers can be driven only by using AC signals and a DC signal will not produce any sound out of the speaker. The circuit diagram of a liquid level indicator using loud speaker at its output is shown below. The circuit is similar to the first circuit except that the LED is replaced by a loud speaker and the load current limiting resistor is changed from 1.2K to 1.5K.
low liquid level alarm
Low liquid level warning (audio)
Notes.
  • The circuit can be assembled on a Perf board.
  • I used 12V DC for powering the circuit.
  • Maximum supply voltage LM1830 can handle is 28V.
  • The tweeter I used was of a 16 ohm type.
  • The relay I used is a 200 ohm/12V type.
  • Maximum load current Q1 (2N2222) can handle is 800mA.
  • The switching current/voltage ratings of the relay must be according to the load you want to drive using it.
  • It is recommended to mount the IC on a holder.

Read more: circuitstoday.com/liquid-level-indicator#ixzz1H1t0SpU7

Saturday, March 12, 2011

Speaker Recognition - Voice Recogoinition


By: Parker Evans

PROJECT SUMMARY

The goal for our project was to be able to distinguish, in real time, between various speakers based only on audio input. In the literature, this problem is known as speaker recognition, or, more precisely, speaker identification. A good deal of progress has been made in this field, but it is still an open research topic, and by no means a solved problem. Speaker recognition is applicable to many fields, including but not limited to artificial intelligence, cryptography, and national security.

PROJECT DESCRIPTION

Introduction
The goal for our project was to be able to distinguish, in real time, between various speakers based only on audio input. In the literature, this problem is known as speaker recognition, or, more precisely, speaker identification. A good deal of progress has been made in this field, but it is still an open research topic, and by no means a solved problem. Speaker recognition is applicable to many fields, including but not limited to artificial intelligence, cryptography, and national security.
Theory of Operation
Human speech, when analyzed in the frequency domain, reveals complicated, yet well understood features, which can be used to indentify the speaker. These features are derived from the mechanics of speech generation, which we shall introduce briefly here. The treatment here will be greatly simplified, and the interested reader is encouraged to consult the references for a more complete introduction.
Speech begins with the generation of an airstream, usually by the lungs and diaphram.1– a process called initiation. This air then passes through the larynx (voicebox), where it is modulated by the glottis (vocal chords). This step is called phonation or voicing23, and is responsible for the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and throat – a process called articulation – and the resultant pressure wave excites the air.4.
As a consequence of this physiology, we notice several characteristics of the frequency domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying fundamental frequency and a series of harmonics at multiples of this fundamental5. This is shown in the figure below, where we have plotted a brief audio waveform for the phoneme ‘ah’ and its magnitude spectrum. The fundamental frequency (113 Hz) and its harmonics appear as spikes in the spectrum.
Figure:1

Figure 1 

The location of the fundamental frequency is speaker dependent, and is a function of the dimensions and tension of the vocal chords. For adults it usually falls between 100 and 250 Hz, and females average significantly higher than males5.
Further analysis of the audio spectrum shows that nearly all energy resides in the band between DC and 4kHz, and beyond 10kHz there is virtually no energy whatsoever. Incidentally, this is one of the primary motivations for the common sampling rate of 8kHz. In our case, we wanted higher fidelity audio, so we chose the (more than suitable) sampling rate of 32kHz.
Yet more detailed investigation shows that different phonemes (particularly vowels) have characteristic regions of high energy in the spectrum. These peak regions are called formants, and their locations can be used to extract both phoneme and (at a higher resolution) speaker information. The vowel formants are well studied and their particular locations are available for reference6. Considering again the ‘ah’ phoneme, the reference indicates that the first and second formants are centered at 700 Hz and 1150 Hz respectively. The audio spectrogram for our recorded ‘ah’ is shown in the figures below, where we have used a smaller FFT to de-emphasize the fundamental and its harmonics. (A spectrogram is simply a time series of FFTs taken on consecutive blocks of samples. We include both a surface and heatmap below). We see that there is significant energy in the predicted frequency bands.
Figure:2

Figure 2 

In order to quantify the amount of energy in the various regions of the spectrum, it would be natural to design a series of overlapping filters that cover the active region of the spectrum. It turns out that this technique is used quite often, and leads us to our primary analysis tool – the mel cepstrum. Consider the triangular filter bank shown below.
Figure:3

Figure 3 

The filters are designed so as to be equally spaced in the mel domain, a logarithmic scale related to frequency by7.
Figure:4

Figure 4 

or inversely
Figure:5

Figure 5 

The mel scale mimics the logarithmic perception of pitch as experienced by humans, and is designed such that 0 and 1000 are equal to themselves in both the mel and frequency scale7. This is demonstrated in the plot below, which is simply a graphical representation of the formula above
Figure:6

Figure 6 

Once the mel filters have been designed, one can calculate a vector of mel weights for a brief audio sample as the inner product of the audio FFT and the mel filters. The number of mel filters to use is a system parameter, but the literature indicates that 12 is typically sufficient for speaker recognition. The last step that is taken is to compute the discrete cosine transform of the logarithm of the mel weight vector. The resulting vector is called the mel-frequency cepstrum (MFC), and the individual components are the mel-frequency cepstral coefficients (MFCCs)8. It is not completely clear to the authors why this last step is taken, but given that it is commonly used, we decided to implement it. The full process block diagram, then, is given below.
Figure:7

Figure 7 

The windowing function referenced in the diagram is typically a Hamming or Hanning window, and is used to reduce smearing of peaks in the frequency domain. The improvement realized by introducing these windows was unclear to the authors, so we implemented them as parameters to the system which could either be used or left out.
Once the MFC has been calculated, the problem of classifying the speaker still remains. Fortunately, the problem is now formulated in the well studied language of machine learning, and the relevant techniques can be applied. Much of our work, then, was in training and evaluating various predictive models, which we will now briefly introduce.
The first (and perhaps simplest) classification method is nearest neighbor search. This technique attempts to classify a vector according to the identity of the nearest point in n-dimensional space, where n here refers to the length of the mel cepstum vector. In the more general case, the k nearest neighbors are found, and classification is based on the plurality winner of the k points. The distance metric used to evaluate nearness of points is typically chosen to be in the class of Lp norms, given by9
Figure:8

Figure 8 

where we recognize the L2-norm to be the traditional Euclidean distance. As a caveat when performing nearest-neighbor search, one must be careful to normalize each dimension to prevent large magnitude dimensions from overwhelming smaller magnitude ones. Thus, it is common to compute z-scores for each vector component based on the mean and variance for each dimension.
The second primary technique that we used for cepstrum classification is the artificial neural network. The study of neural networks is very large and we will not attempt to introduce it here. Since we wished to run the network in real time, we decided to use one of the simplest neural network variants – the perceptron network. Such a network computes hidden layers of nodes by computing sums of weighted inputs, and then generates an output prediction based on a similar sum of weighted outputs from the hidden nodes. Consider the following canonical node.
Figure:9

Figure 9 

For the perceptron network, the output y is computed as
Figure:10

Figure 10 

where the x vector is the input, b is the offset bias, and the w vector represents input weights. The hyperbolic tangent, shown below, is used to “tame” outliers, which would otherwise lead to large outputs.
Figure:11

Figure 11 

High Level Design
The algorithm which has just been presented is quite complicated, and has many “moving parts”. Thus it was expected from the outset that a significant portion of the computation would be done in C on the NIOS II core. However, it is desirable in terms of speed, and more fully utilizes the available hardware, to do processing in Verilog. Conveniently, we discovered that Altera had already writted a fully functional, parameterizable, FFThardware module. Also available in hardware was a module to control the audio codec, with which the authors were already quite familiar. Thus, in reference to the block diagram presented above, we decided that a natural separation of work would be to perform all steps up to and including the FFT in hardware, and then to compute the remaining steps using the NIOS II processor.
Verilog Design
The first step in hardware was to capture and store a sequence of audio samples such that they could be passed to the FFT controller. Capturing samples is a straightforward process which involves communicating with the audio codec. The codec controller was already provided to us and is available in the code listing. The codec registers discussed in the data sheet were configured to meet our particular application requirements, and are summarized below. The most notable changes from the typical setup are the reduced sampling rate of 32kHz (and the corresponding de-emphasis) and the microphone input instead of line input.
Figure:12

Figure 12 

The next step was to store a block sequence of samples such that they could be passed to the FFT controller. We tried several storage methods before settling on one that worked well. The simplest approach here would be to store the samples in registers. But at a sample width of 16 bits, a lengthy block begins to quickly use up logic elements, which is undesirable for a large design. In response to this, we next attempted to use off-chip SRAM. Although this method worked in principle, it became increasingly difficult to arbitrate reading and writing. In addition, SRAM suffers from the inconvenience of setting up write enable and address a cycle before performing data operations. The final approach we settled on, then, was to use dedicated on-chip storage elements via the two-port RAM megafunction. This method enjoys the benefits of both previous methods, without their shortcomings. Specifically, the RAM megafunction handles arbitration, so that the designer can read and write at the same time, but does not use up valuable logic elements.
The next step in the block diagram is to perform windowing on the audio sample block. In our implementation, this step is actually performed as the audio block is being filled, and is thus not a separate step. The weight value applied to the incoming sample is provided by a ROM of precomputed values. The MATLAB script to compute this ROM is provided in the code listing.
Once the windowed audio sample block has been obtained, we must now interface with the FFT megafunction. The bulk of the hardware design is in this interface module. TheFFT megafunction data sheet is listed in the references section and discusses the various possible parameterizations of the module. As one would expect, the megafunction allows you to specify the FFT length, bit width, and whether it is a forward or inverse FFT. Of particular interest, the megafunction also allows you to specify one of four implementations, which determine the extent to which you may provide streaming input samples. For example, the “Streaming” data flow architecture allows you to continuously provide input samples, whereas the “Block” architecture requires that you wait for the FFT to complete and the output buffer to be clear before another transform can begin. Fortunately, the interface signals (which adhere to the Avalon ST specifications) remain virtually unchanged between the various architectures. Thus, we wrote the controller before having decided which implementation we would use.
The particulars of the controller design follow from the timing diagram provided in the data sheet, and are not terribly instructive to discuss. The interested reader may find the commented code in the code listing appendix. Suffice it to say that there are two state machines, one for the input samples and one for the output samples, which clock data into and out of the megafunction by asserting the Avalon signals as prescribed by the data sheet. A transform is initiated by filling a RAM with input samples and then signaling the controller to start. When the transform is complete, a done signal is raised and the output samples are available in a two port RAM similar to that used to store the input. As a convenience to the user, the three relevant clock rates (input sample RAM clock, controller state machine clock, and output sample RAM clock) may all be different. Achieving this goal was the source of many frustrations, but the result is pretty rad.
NIOS II Design
The NIOS II that we used for our application was actually rather simple (though powerful). We employed the largest and most powerful CPU in order that our software classification algorithms would run as quickly as possible. In addition, we phase locked the NIOS to 100 MHz from the base clock of 50 MHz. When parameterizing the CPU, we made sure that it had plenty of instruction and data cache (4KB of each) and we enabled hardware division. We also enabled the hardware floating point multiply and divide. We used aJTAG UART interface and LCD for display during debugging, but they are not used for the final applications. SDRAM was used for storing the program because it is large enough to hold our code base. We were also not sure how much code we would be writing so we thought we would be safe and use the largest memory storage for program memory.
Communication between the CPU and hardware is performed using only parallel I/O (PIO) ports. The first set of PIOs are for the interface with the FFT controller. One bit is required to signal the FFT Controller to begin sampling and applying the FFT transform (this is a single line output PIO called FFTStart). Another bit is driven by the FFT controller and indicates that a transform is complete (this is a single line input PIO called FFTDone). In addition, we need a way to read the transform results from the RAM in the FFT controller (this RAM stores the FFT power spectrum after a transform is complete). We accomplish this using two PIO busses. The first (FFTAddr) is an output bus that specifies the FFTRAM address to read. The address is a number between 0 and 511 (the first half of the 1024 point FFT result) and is thus 9 bits wide. The second (FFTPower) is an input bus that returns the 16 bit value at the specified address in the RAM. Finally, we have a PIOinput bus from the FFT controller which returns the exponent associated with last FFT. (The exponent is an output from the FFT megafunction and indicates the overall gain necessary to apply to the tranform in order for it to be the proper magnitude.) The exponent was employed here to detect whether there was anything being spoken (or any noise) by detecting deviations from a silence value of the exponent. For our FFTimplementation it is 6 bits wide.
Figure:13

Figure 13 

The DE2 user interface (UI) is outlined above. Board signals are communicated with theNIOS via PIOs (excluding reset, which is only used in hardware). Application Mode selection consists of two switches encoded as shown. This is accomplished using a 2 bitPIO bus input named AppSwitches. For nearest neighbor speaker identification we require two additional inputs. First, we need a signal that indicates whether the system is supposed to use the audio input for training or testing. The training/testing signal is implemented using a single line PIO input called Train (negated the signal for positive logic). Second, we need an input to specify the ID of the thing being trained. The ID is a 4 line PIO bus called VowelID. We arbitrarily decided to allow 4 different training “classes” (these could be speakers, vowels, etc). The bus is driven by 4 switches on the DE2 encoded as one-hot. That is, we assume only one switch is enabled at a time during training (to be safe, we also employ a priority encoding scheme in software). We chose 4 adjacent switches so that we could light the LEDs above each one when in identification mode. The final IO is for LED feedback, and consists of a 4 bit PIO output bus (VowelLEDs) for the nearest neighbor identification. We only ever activated one of the bits on this bus to light one of the LEDs above the switches used for training. For the other application modes, we needed a nice way to indicate “accepted” or “not accepted.” For instance, to tell whether something spoken is a vowel or not, or whether the person speaking is the training target (in our case Parker). This was done using two PIO single bit output lines, one connected to a green LED (cleverly named GreenLED) and one to a red LED (RedLED). As one would expect, the green LED indicates “accepted” and the red, “not accepted.”
The final SOPC builder setup for the NIOS can be seen below.
Figure:14

Figure 14 

Interfacing the SOPC with hardware was straighforward and can be seen in the verilog code of DE2_TOP.v below. Essentially, we hooked the PIO as described above, made a 100MHz PLL line for the clock and hooked that up, hooked our reset button up to the active low reset, and hooked all the IO appropriately to our FFT controller. We also had to hook up the SDRAM signals and make sure to phase delay the SDRAM clock by 3ns as described in various documentation.
NIOS II C Application
The code running on the NIOS II executes our algorithms on the raw FFT power spectrum data. The full commented code is presented in the code listing, but we will summarize it here. At the highest level, the code runs in an infinite loop. At the beginning of the loop, an FFT transform is initiated in hardware and the code waits for the signal that it is complete. The waiting is performed by spinning on the PIO FFTDone flag. Once the FFTis complete, we check the FFT exponent to see if appreciable input was recorded on the Mic input line. This tells us whether someone is speaking (although one often has to speak up for it to register). We determined experimentally that the value of the exponent when subjected to silence is 62 (unsigned) and we check whether the unsigned exponent value is less than this to determine if we should continue to the analysis. If we decide that there is only silence on the line, we clear all LED feedback that may be currently active (we don’t want feedback if there is silence on the line). If there is a deviation from the silence value, we continue with the body of the processing.
The first step is to load the power spectrum into memory and convert it to floating point (the conversion is not strictly necessary, but was convenient for our analysis since our algorithms were floating point algorithms and we were not too pressed for time). Since we included hardware floating point in our CPU, floating point operations are quick. We use the PIO interface that we created to set up each address (0-511) and then read the value of the spectrum on the data line. We cast this to a short (since it is 16 bits) and then cast to a floating point number storing the result in a floating point array. At this point, we have the power spectrum of the FFT (of something that is non-silence) in memory. Now we can proceed to process it.
We begin our processing by shifting the spectrum to the Mel scale using a call to our “melshift” routine with the current power spectrum and several parameters including the length of the spectrum (512), the destination array to store the shifted spectrum, the number of frequencies to do (the first 12), and the sampling rate with which the samples were taken which were used as input to the FFT (32KHz). The shifting is described in the theory section above and the details of how we accomplished this algorithmically can be seen in the code listing below.
We next compute the discrete cosine transform of these spectral points to obtain theMFCCs using the “dct” funtion, passing a pointer to the Mel shifted power spectrum (the first 12 components anyway), a pointer to an array to store the results, and the number of cepstral coefficients to produce (again 12). As with the shifting, this is described in the theory section above and can be seen in detail in the code listing. The DCT computation is less efficient than other known methods but as we only need 12 points for our particular application, this was considered acceptable. At this point in the code, we have an array of 12 floating point MFCCs. We can now perform the selected application (based on the application mode switches).
The first (0) application is APP_VOWEL_RECOGNITION. This is an application which tries to decide whether the current thing being input through the Mic is a spoken vowel. The first thing this does in the case statement is to clear any LED feedback for another application. This application uses the green and red LEDs for feedback as described in the NIOS II Design section above. Thus we want to clear the LED feedback for the nearest neighbor application. We then call a function “vowelCheck” which determines if the MFCC array respresents a vowel or not and provides the appropriate feedback. Then the program loops to the beginning again. The “vowelCheck” routine is actually rather obscure. We perform the vowel detection using a two layer feed forward neural network that we trained in Matlab using the Neural Network Toolbox’s nftool application. The resulting network has the form shown below.
Figure:15

Figure 15 

We took the input weights and layer weights from the Matlab structure as well as the input node biases and ouput bias. All of this is stored in memory and can be seen at the top of the program with appropriate commenting (“Neural net for vowels”). We also emulate the transfer function for each node which is a hyperbolic tangent. The routine for the hyperbolic tangent (tanh) is taken from the math.h standard library package. All the vowelCheck function actually does is evaluate the neural net for the input MFCC vector and say yes or no based on an arbitrary cutoff that we specified (a value from the net greater than 0.4 means a vowel, otherwise not). The evaluation of the net is done by first normalizing the array of MFCCs to values in the range of [-1,1]. This is done using the data that we trained on and some helpful coefficients that we generated in matlab for this purpose. The reason this is done is because Matlab’s nftool automatically preprocesses training data to be in this range when it is training the network. Thus we needed to do this to have appropriate data for the net to classify. We then weight each input by the input weight and sum them together for each input neuron. Next, we add the bias and apply the transfer function (tanh) on the output of this function to obtain the output of the neurons. For the output of the network we apply the weights of each input neuron to its output and sum all of these together, then add the output bias and apply the transfer function. This is our network output. Then we just check if the output is above 0.4 and ouput a 1 for yes and a 0 for no. We then use this as logic values from the function and do the feedback as described previously.
The second (1) application is APP_VOWEL_IDENTIFICATION. This is a little misleading because it does not have to be vowels, it just seems to work best with them. This application uses three functions which were written to do 1-Nearest-Neighbor machine learning which is discussed in the background section above. The first of these functions is the “initdbase” function which clears all of the previously stored MFCCs and is called at the beginning of the program. Essentially this means we have to reset the system to clear the training samples. We could have chosed to do this a different way, such as when another application was run, but it was decided that this would be more overhead than it was worth. At the beginning of this application’s case statement we clear the red/green LED feedback because we are going to use the 4 vowel identification LEDs instead. We then look to see if the training button is being held (this is a PIO input which is high when the fourth button on the DE2 is being pressed). If it is, then we should use the current MFCC array for training the 1NN method. We then look to see what the highest numbered active training ID switch is. If none of them are up, we do nothing. Otherwise, we call the “addtrain” function with the number of the switch (3 for the highest, 0 for the lowest). The “addtrain” function essentially just adds the MFCC array to a database with the associated label (specified by the ID switches) if there is room in the database (there is a maximum number of examples per ID). The function also stores some other values so that normalizing the coefficients and finding the closest “point” will be easier and quicker when we do classification. This code can be seen in full detail in the code listing below if the reader is interested to see exactly what values are calculated and stored during training. Finally, in this application, if the user is not pressing the training button, we try to classify the MFCC array to one of the ID’s used in training. This is accomplished by calling the “nearestneighbor” function with the MFCC array. This function does the grunt work of the algorithm. It first normalizes the coefficients by finding the mean, variance, and standard deviation of each coefficient seen so far and then normalizing with these values by taking the difference of each of the values from the mean and dividing by the standard deviation. Once the coefficients are normalized, we then find the normalized trained coefficient array with the lowest “Manhattan Distance” from the provided MFCC array. We then return the label on this MFCC array as the label to the provided example. The details of the implementation of the algorithm to find the trainedMFCC array with minimum distance can be seen in the code listing below. This algorithm is called 1 Nearest Neighbor because it classifies a new example with the same label as its one nearest neighbor.
The third (2) application is APP_SPEAKER_VERIFICATION. This is our attempt at verifying a speaker. We used Parker as the speaker to be verified and trained a three layer feed forward neural network in Matlab in the same manner as the vowel recognition network (but with a slightly different tool – nntool instead of nftool). This network, shown below, has 5 input neurons, 4 inner neurons, and one output neuron.
Figure:16

Figure 16 

After testing with this network, we realized that it was not performing well and had to replace it. In order to improve our results, we decided to build a larger data set of speech samples. Jordan and I each spoke for 10-15 minutes generating hundreds of samples representing positive and negative examples of “Parker” speech. We then trained a simpler network but with more input nodes (20 input nodes) which has the same structure as the one for identifying vowels. We put the network into the code in place of the three layer network described above.
We trained the network on only MFCC arrays that pass the vowel recognition test. We noticed through our testing that vowels seem to work best for classification. Sounds like “sh” and plosives have rather undefined and noisy spectrums. After training the network, we put the weights, biases, and necessary constants into Matlab as before and made a function “classify” which evaluated the network. Additionally, the output was deemed to indicate a positive “Parker” verification when it was above 0.9. The code in the case statement for this application simply clears the nearest neighbor feedback as does the vowel recognition application and then calls this method on the MFCC array, and then if it sees a 1, lights the green (verified) LED, and otherwise lights the red LED.
Finally, we added an application (3) which combines the one nearest neighbor approach and the neural network in order to get a better verification result. Essentially what it does is call the classify method on the current MFCC array. If the classification neural network says that it is Parker, we ask the nearestneighbor method whether the MFCC array’s nearest neighbor has an ID of zero. This means that this application is intended to be used as follows: train ID 0 as Parker and then ID 1, 2, and 3 with other people, then use the application as if it is the same as the verification application above. This application has the same LED feedback as the verification application and has the same training interface as the vowel identification application.
Additionally, we found it useful throughout debugging to have the program print out various values and arrays of coefficients so our code is littered with debugging code. This code however is encompassed in preprocessor directives and is only included for certain debugging levels defined at the top of our code. Thus for non-debugging versions, we set the DEBUG define to 0, and none of this code is actually compiled in (and thus not executed).
Results

Vowel Recognition

The results for the vowel recognition application mode are summarized in the table below. Each of the authors uttered the phonemes in the first column several times (or for a reasonable duration), and the classification results are given in the body columns.
Figure:17

Figure 17 

The truth table is easily computed and is shown below. We can see that the false positive rate is kept very low, as was desired, while keeping the true positive to true negative ratio at a healthy 2:1.
Figure:18

Figure 18 

The results can be seen in the two tables above. Basically, it was fairly successful on the consonants and vowels with which we trained the neural net and slightly less successful on those that we didn’t such as “uh.” Also, the vowel “ee” seemed to prove difficult for our vowel recognizer. Essentially, what we took away from this part of the project was a tool which helped us to find vowels. The vowel recognizer was very good at not saying a non-vowel was a vowel which meant we could rely on it to pass mostly vowels. Since we used this in other parts of the project to filter vowels out of speech, this was very good. We could then assume that most of the MFCC arrays that were classified as vowels were in fact vowels and could be treated as such for further analysis.

Vowel Identification

The results for the vowel identification application mode are summarized in the misclassification matrix below. The nearest neighbor database was trained on only one of the authors voices (Jordan’s) but was tested on both authors. The results are nearly perfect for both authors.
Figure:19

Figure 19 

This application was the most successful of all of our applications. The table above speaks for itself. We do not know for sure why it performs so well other than to say that there are probably very disctinct differences between vowels in terms of frequency spectrum analysis. The nearest neighbor algorithm was able to pick up on these differences with very few points of training data.

Speaker Verification

Figure:20

Figure 20 

The graph above shows a comparison between the false positive and true positive rate of the neural network speaker classification. Many of the results of the scheme can be extracted from this chart. The method proved promising, as the number of false positives was quite low in comparison with the number of true positives at any given level of cutoff for the classification (based on the value output from the verification neural network). As we said before, we used 0.9 as our cutoff, but this was a rather arbitrary choice. One good thing about this cutoff is that there is a fairly low false positive rate which is what we wanted. The technique was not perfect however and there were many false negatives. Throughout the verification development, we were committed to the idea that, in a recognition system, false positives are much worse than false negatives (although we can imagine in something like a phone banking system having to say the same pass phrase over and over again could be very annoying).
In an attempt to improve the false positive rate of our speaker verification system, we used a fourth application which combined the one nearest neighbor approach with the neural network for speaker verification (of Parker). The application is described above. The result of one of our runs with this system was a true positive rate of 34/143 samples of Parker speech or 24% and a false positive rate of 16/143 samples of Jordan’s speech or 11%.
Conclusions
We experienced mixed success with our various experiments. On the one hand, the vowel tests appear to work very well, especially the vowel identification. This may be due to the small database of vowels being used, but we expect that if extended to encompase futher phonemes, the performance would not suffer terribly. On the other hand, the speaker identification/recognition experiments performed poorly in general. This might be casually attributed to the inherent difficulty of the task. After all, humans still occasionally make mistakes when faced with this task, and we are highly evolved to handle it. But this is a rather unsatisfactory conclusion. We may propose some more specific reasons. For instance, the speaker identification application is built on top of the vowel recognition application, and thus suffers from any problems of this base task. It is also possible that we are not using the appropriate number of cepstral coefficients, or that the filters do not cover the optimal region of the spectrum. These are experiments that could have been performed given more time. Finally, it is possible that the underlying FFT length and audio sampling rate are not conducive to capturing the best periods of speech. We did however experience what we would deem modest success in this persuit and also were able to make some interesting applications to perform related functions like vowel recognition and identification.
Given more time, we would have liked to persue more of our suspicions about what was causing our poor performance. In addition, we would have liked to try more machine learning techniques. Specifically, Hidden Markov Models (HMMs) and Support Vector Machines (SVMs) may have performed well on this problem. Also, dynamic neural networks with feedback and time delay would have been very interesting to work with. It seems that they are used in many related problems where time varying patterns are classified and it would have been fun to attempt an implementation in software or hardware for this particular application. If we could build a library of such techniques, it is also possible that the problem would have cracked when subjected to a medley of solutions. As a final note, we would have liked to implement our own version of neural network training instead of relying on MATLABs training (whose implementation specifics we are still unsure of).
We feel that the project was a success in terms of the knowledge and experienced gained as well as some noteworthy results. While our results did not necessarily live up to our expectations at the outset of the project, we are happy with what we have accomplished. We are confident that were we to begin a speech processing project in the future, we would be much better prepared.
ref: eeweb.com/project/parker_evans/speaker-recognition