Adaptive noise cancelling in speech signal

In the process of working on the dialog system (http://habrahabr.ru/post/235763/), we are faced with insurmountable at first glance, a problem – in real combat conditions, the performance of ASR was significantly lower than expected. One of the components that affect the performance, invariably the noise in the background taking a variety of forms. Especially unpleasant for ASR in our experiments was difficult-escape the noise of city streets and noise of the crowded.

It became clear that the problem should be solved, or a real value from the voice system simply will not.

The original plan was to find an analytical solution for these two specific types of noise and to neutralize them. But in the process of experimenting with several algorithms revealed that, first, the voice is quite distorted, and secondly, in our training set noise types was much more. Plus, we didn't want (relatively) clean speech are either modified, as it has a negative effect on word error rate during recognition.

Needed adaptive scheme.

Deep neural networks are an excellent solution in cases where an analytical solution for a function is hard or impossible to find. And we wanted exactly the same function – to transform noise-free speech signal in nezakonchennyi. And that's what we did:
    the
  1. Model is available to the filter-Bank. The alternative was the noise after the filter Bank (e.g., MEL-space), which also would have worked faster. But we wanted to build a model that will work for the material, which must then hear the person, or external ASR system. Thus we can clean from noise any voice streams.
  2. the
  3. Model adaptive. Removes all types of background noise, including music, shouts, going past the truck or the noise of a circular saw, than favorably with other commercially available systems. But if the signal is not salomlar, it practically does not change. Ie is a speech enhancement model – restores the clear voice of the bad signal. One of the interesting further experiments would be the ability of such models to recover the voice Studio quality from 8KHz to 44KHz. We don't expect much, but maybe she'll surprise us.
  4. This particular model only works with 8Khz. This is due to the fact that at the beginning of the training we had a lot 8KHz material. Now we have enough material to 44KHz and we can assemble any model, whether it's necessary. While our area of interest lies in the telecommunications field. the

  5. training Time (after all our optimizations) – about two weeks on 40 cores. The first version had trained for almost 2 months on the same hardware. Moreover, the model continues to improve the error rate for the test data, so we will practice.
  6. the
  7. Speed of processing on a relatively modern CPU is now about 20x real-time on a single core. 20-core server with hyper-threading'ohms can process about 700 audio streams at the same time, almost 90 Mbit/s in raw data.

We went a little further and has implemented a network based on complex numbers. The idea is to restore not only the amplitude but also the phase, which otherwise has to take noise-free signal. And phase of the noise-free signal adversely affects the quality of the restored sound, if you later want to hear. So soon we will have the option of "high quality", due to approximately 2-fold slowdown in processing speed. For use in ASR of the meaning of it, of course not.

Well, after a short time we will move to our new recurrent model, which will boast a sometimes dramatic improvement in the quality of the reconstructed voice at the expense of a substantial expansion of scope. A relatively small window size to the current model leads to small artifacts at transitions between different types of noise, or when the profile signal of the noise overlaps with the profile of speech (the child is crying in the background).
I would Like to show some interesting images
log in image Logout image

Entrance Logout

Entrance Logout

Entrance Logout

For those who want to play around with the model – welcome to our website (http://sapiensapi.com/#speechapidemo). There you can upload your file, get ready, zashumit or remove noise from the file. The interface is simple enough.

Lovers we offer a free API test API through Mashape (http://www.mashape.com/tridemax/speechapi).

If you have any questions – email me at tridemax@sapiensapi.com, or welcome in the comments.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Monitoring PostgreSQL + php-fpm + nginx + disk using Zabbix

Templates ESKD and GOST 7.32 for Lyx 1.6.x

Customize your Google