SoX in CTF works differently from Audacity: there’s no waveform to look at, no spectrogram pane to click around in. Everything comes from reading the terminal output and knowing what each number is telling you. TJCTF 2024’s “beep-boop-robot” challenge was the one that convinced me to keep sox -n stat in my audio first-pass checklist — because it told me within thirty seconds that I was looking at a single-tone signal, not the DTMF encoding I’d wasted an hour assuming it was.
The challenge: TJCTF 2024 beep-boop-robot
The download was a file called robot.wav. The challenge name “beep-boop” immediately made me think DTMF — the dual-tone multi-frequency encoding used by phone keypads. DTMF uses specific tone pairs (697+1209 Hz, 770+1336 Hz, etc.) to represent digits 0–9 and a few symbols. There are tools built specifically for DTMF decoding, and if that’s what you’re dealing with, sox stat will tell you immediately that something’s wrong with the single-tone theory.
First, the file header check:
$ soxi robot.wav Input File : 'robot.wav' Channels : 1 Sample Rate : 11050 Precision : 8-bit Duration : 00:00:25.51 = 281892 samples = 25510.5882 CDDA sectors File Size : 281936 Bit Rate : 88.4k Sample Encoding: 8-bit Unsigned Integer PCM
8-bit, mono, 11050 Hz. The sample rate is notable — 11050 Hz is a non-standard rate, half of CD quality. This is fast enough to represent tones up to ~5500 Hz (Nyquist), which covers both DTMF frequencies and Morse code carriers.
How sox stat killed the DTMF hypothesis
Running sox stat analyzes the entire audio stream and outputs statistics:
$ sox robot.wav -n stat Samples read: 281,892 Length (seconds): 25.510588 Scaled by: 128.000000 Maximum amplitude: 0.929688 Minimum amplitude: -0.937500 Midline amplitude: -0.003906 Mean norm: 0.226330 Mean amplitude: -0.001873 RMS amplitude: 0.388809 Maximum delta: 0.531250 Minimum delta: -0.531250 Mean delta: 0.000000 RMS delta: 0.218124 Rough frequency: 483 Volume adjustment: 1.075630
The key number is Rough frequency: 483 Hz. sox calculates this by counting zero crossings and dividing by two times the duration. For DTMF, you’d expect to see a rough frequency roughly between the two tones being mixed — somewhere in the 1000–1400 Hz range for the low-high tone pairs. 483 Hz is lower than any DTMF tone pair.
What does 483 Hz mean for a file whose dominant frequency is actually ~1000 Hz? The signal is being turned on and off — Morse code. When the signal is OFF, there are no zero crossings. A 1000 Hz carrier with roughly 50% duty cycle produces about 1000 zero crossings per second instead of 2000, which gives a rough frequency of ~500 Hz. The 483 Hz result is exactly consistent with a 1000 Hz tone that’s on for about half the recording.
The other stat values that confirm this: RMS amplitude of 0.389 (high — this is a loud, clear tone, not a subtle modulation) and Mean norm of 0.226 (the average absolute amplitude is about 23% of maximum, consistent with a signal that’s ON for roughly half the time).
Confirming the carrier frequency with sox spectrogram
The stat output points to Morse code, but doesn’t confirm the carrier frequency. For that, generate a spectrogram:
$ sox robot.wav -n spectrogram -o robot_spec.png
The spectrogram for beep-boop-robot shows exactly one horizontal band — a sharp line at approximately 1000 Hz. It’s not a continuous line; it appears as a series of segments, some short (dots) and some longer (dashes). The background is clean: no harmonics, no frequency spread, no noise floor activity above 1100 Hz. This is a textbook single-tone Morse code signal.
Compare this to what spectrogram steganography looks like: hidden images show up as patterned texture across a wide frequency range, often at the top of the spectrogram in a band that wasn’t used for actual audio. If that’s what you’re looking for and the spectrogram is a single clean line, the answer is in the time domain, not the frequency domain.
The DTMF Rabbit Hole: why the name was misleading
“Beep boop” in CTF context is a strong signal for DTMF or dial-tone encoding. That association caused me to spend about an hour on a dead end before running sox stat. Here’s the diagnostic path I should have used from the start:
- Run
sox file.wav -n statand check Rough frequency - If rough frequency is in the 700–1600 Hz range and RMS is high → possible DTMF (two simultaneous tones)
- If rough frequency is much lower than expected carrier → on/off modulation → Morse code
- If rough frequency is near expected carrier and RMS is low → possible LSB steganography (carrier intact, low-level modification)
For DTMF specifically, the rough frequency calculation breaks down because two simultaneous tones create complex zero crossing patterns. The reliable test for DTMF is generating a frequency spectrum and looking for the two characteristic peaks. sox can’t do this directly, but the spectrogram makes it immediately visible — DTMF has two horizontal bands, not one.
Extracting timing data for Morse decoding
Once you’ve identified the file as Morse code, the next step is extracting the on/off timing. A Python decoder using 25ms analysis windows confirmed the message structure. The output began with “HI HOW ARE YOU DOING. THE FLAG IS TJCTF{…” — the Morse preface before the actual flag is a common CTF touch that makes the signal look less obviously competitive. Seeing “TJCTF{” in the output confirmed I’d decoded the right pattern. SoX can help with timing extraction too: the silence effect can detect silence regions, effectively marking the OFF periods:
$ sox robot.wav -n silence 1 0.05 1% -l 1 0.05 1% stat 2>&1
For more precise timing, I used Python’s wave module to compute amplitude over 25ms windows and extract the on/off pattern directly. The timing units in the file were:
- Short burst (~50–75ms): dot
- Long burst (~150–175ms): dash
- Short gap (~25ms): between elements in the same letter
- Medium gap (~100–125ms): between letters
- Long gap (~375–400ms): between words
The message decoded as a phrase followed by the flag. Since this is a tool article rather than a full writeup, the important thing is the identification pipeline, not the specific content of what the Morse encoded.
SoX commands that matter in CTF: a practical reference
File information
# Detailed header info $ soxi input.wav # Statistics (rough frequency, RMS, amplitude range) $ sox input.wav -n stat # Stat with stderr redirect (some versions output to stderr) $ sox input.wav -n stat 2>&1
Spectrogram generation
# Basic spectrogram (PNG) $ sox input.wav -n spectrogram -o output.png # Wider view for longer files $ sox input.wav -n spectrogram -x 1200 -y 600 -o output.png # High-frequency resolution for spotting subtle patterns $ sox input.wav -n spectrogram -w Kaiser -y 1024 -o output.png
Format conversion and manipulation
# Convert sample rate $ sox input.wav -r 44100 output.wav # Extract one channel from stereo (left = 1, right = 2) $ sox input.wav output.wav remix 1 $ sox input.wav output.wav remix 2 # Reverse audio (for reversed-speech challenges) $ sox input.wav output.wav reverse # Speed up or slow down without pitch change $ sox input.wav output.wav tempo 0.5 # half speed $ sox input.wav output.wav tempo 2.0 # double speed
Silence detection
# Remove silence from start and end $ sox input.wav output.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse # Split on silence (useful for separating Morse code segments) $ sox input.wav segment.wav silence 1 0.1 1% : newfile : restart
SoX vs Audacity vs FFmpeg: when to use which
| Task | Best tool | Why not the others |
|---|---|---|
| Quick stat check (rough frequency, RMS) | SoX (-n stat) | Audacity requires loading the GUI; FFmpeg stat output is less readable for audio analysis |
| CLI spectrogram PNG generation | SoX (-n spectrogram) | Audacity is GUI-only; FFmpeg can generate spectrograms but requires more flags and separate display step |
| Visual spectrogram exploration (zoom, scroll) | Audacity | SoX generates static PNG; can’t interact with it |
| Detailed frequency analysis (identify exact Hz) | Audacity (frequency plot) | SoX rough frequency is approximate; FFmpeg plot requires additional rendering |
| Format conversion, batch processing | FFmpeg | More codec support than SoX; better for video/audio containers |
| Stereo channel extraction | SoX (remix) | Clean one-liner vs Audacity’s menu navigation |
| Reversing audio | SoX (reverse) | Fastest for CLI; Audacity is easier for selective reversal |
| LSB audio steganography | Neither — use Audacity visual + Python | SoX and FFmpeg don’t decode LSB; need manual bit extraction |
My workflow in CTF: soxi first for the header, then sox -n stat for the rough frequency and amplitude stats. If rough frequency is anomalously low or high, that’s a signal worth following before opening anything in a GUI. Audacity comes out only when I need to visually explore the spectrogram at a specific frequency range or time position — tasks that SoX’s static output can’t support.
Common audio CTF patterns and how SoX identifies them
Morse code
Rough frequency significantly below carrier frequency. Spectrogram shows single band with segmented horizontal pattern. Amplitude alternates between full-on and full-off. High RMS amplitude (signal is loud and clear). Confirmed in TJCTF 2024 beep-boop-robot.
DTMF encoding
Two frequency bands visible in spectrogram simultaneously. Rough frequency in the 800–1400 Hz range. Spectrogram pattern shows column-like blobs (tones held for digit duration, then silence between digits). The decoder tools multimon-ng -t WAV -a DTMF or online DTMF decoders handle extraction.
Spectrogram steganography
Auditory content sounds normal but spectrogram shows unusual patterning at high frequencies (often above 15 kHz where human hearing is weak). Rough frequency and RMS from sox stat will look normal — no statistical anomaly. The tell is purely visual in the spectrogram. This is one situation where Audacity’s interactive spectrogram is better than SoX’s static PNG: you need to zoom into the high-frequency range and see if there’s an image hidden there.
Reversed audio
Audio that sounds like backwards speech or noise. No statistical anomaly in sox stat. Run sox input.wav output.wav reverse and listen to the result. Sometimes combined with speed changes — if reversed audio still sounds garbled, try sox output.wav final.wav tempo 0.5.
Stereo channel split
Challenge delivers a stereo file where each channel contains different data. soxi will show Channels: 2. Extract and analyze each channel separately:
$ sox stereo.wav left.wav remix 1 $ sox stereo.wav right.wav remix 2 $ sox left.wav -n stat 2>&1 $ sox right.wav -n stat 2>&1
Further Reading
SoX handles command-line audio analysis, but for challenges that require visual spectrogram exploration, Audacity in CTF covers spectrogram forensics in detail — including how to distinguish time-domain patterns from frequency-domain hidden content, with a worked example from picoCTF Morse Code.
For a broader overview of forensics tools across file types, CTF Forensics Tools: The Ultimate Guide for Beginners places SoX in context alongside image and binary analysis tools.
If the audio challenge involves an image file with a WAV secretly embedded inside it, binwalk in CTF explains how to extract it before audio analysis begins.
Leave a Reply