December 5th, 2003

Voice Turing test; need help with sox/audio processing

We're working on adding Turing tests to our create.bml sign-up page, to determine if the user is a human or a bot.

Bots used to just be a problem in theory, but now with blog-spammer-bots going around spamming journals with revenue-generating porn links, theory is reality, and they're a pain. Invite codes stop spam bots for now (since we rate-limit anon comments by IP, and other comments at a different rate by user), but once we remove invite codes, we need a way to prevent spammers from writing programs which go in a loop, generating accounts, spamming until blocked, generating new accounts, spamming, etc.

The primary mechanism we'll use (which is easy) is a blurry image with text, which the user has to type in to prove they're human. We're going to use Authen::Captcha, modified a bit to fit our setup. It's not as pretty as the CAPTCHA Project's Gimpy, but it'll do.

Now, the harder system (for visually impaired users) is audio Turing tests. We can't just generate audio clips from known samples (numbers/letters), because it's easy then to have a human categorize the closed set of samples, and then have a computer just do matching and figure it out. What we have to do is generate the sound sample, then randomly distort it with white noise, echos, reverbs, etc.

We'll be using just the numbers 1-9 in the audio samples to make it easier on visually impaired foreign users for whom English might not be their first language. Most people can count to ten in a few different languages, even if they can't speak them. We might even provide a clean reference clip of counting 1 to 9 so non-English speakers can learn.

Anyway, moving on to technical material. We'll be using festival and sox to generate the speech and distort it, respectively:

Generate speech to file:
$ cat - | festival
(Parameter.set 'Audio_Method 'Audio_Command)
(Parameter.set 'Audio_Required_Format 'wav)
(Parameter.set 'Audio_Required_Rate 8000)
(Parameter.set 'Audio_Command "cp -v $FILE /tmp/speech")
(SayText "1 2 3 4 5")

Distort it: (with random parameters)
sox /tmp/speed/audiofile_3432 -t ossdsp /dev/dsp echo 1 0.7 200 0.3 vibro 10 0.5

Unfortunately, sox's echo/vibro/other effects seem to be adding on a big space at the end, and then you hear the second half of the file again, sometimes multiple times. Instead of hearing just "1 2 3 4 5" it's "1 2 3 4 5 ..... 4 5 .... 4 5". And this is regardless of echo settings, it seems.

Anybody good with sox and/or audio processing and care to help?