Table of Contents

CTF Audio Challenges: A Practical SoX Combat Guide

執筆者:

カテゴリ:

SoX in CTF works differently from Audacity: there’s no waveform to look at, no spectrogram pane to click around in. Everything comes from reading the terminal output and knowing what each number is telling you. TJCTF 2024’s “beep-boop-robot” challenge was the one that convinced me to keep sox -n stat in my audio first-pass checklist — because it told me within thirty seconds that I was looking at a single-tone signal, not the DTMF encoding I’d wasted an hour assuming it was.

The challenge: TJCTF 2024 beep-boop-robot

The download was a file called robot.wav. The challenge name “beep-boop” immediately made me think DTMF — the dual-tone multi-frequency encoding used by phone keypads. DTMF uses specific tone pairs (697+1209 Hz, 770+1336 Hz, etc.) to represent digits 0–9 and a few symbols. There are tools built specifically for DTMF decoding, and if that’s what you’re dealing with, sox stat will tell you immediately that something’s wrong with the single-tone theory.

First, the file header check:

$ soxi robot.wav
Input File     : 'robot.wav'
Channels       : 1
Sample Rate    : 11050
Precision      : 8-bit
Duration       : 00:00:25.51 = 281892 samples = 25510.5882 CDDA sectors
File Size      : 281936
Bit Rate       : 88.4k
Sample Encoding: 8-bit Unsigned Integer PCM

8-bit, mono, 11050 Hz. The sample rate is notable — 11050 Hz is a non-standard rate, half of CD quality. This is fast enough to represent tones up to ~5500 Hz (Nyquist), which covers both DTMF frequencies and Morse code carriers.

How sox stat killed the DTMF hypothesis

Running sox stat analyzes the entire audio stream and outputs statistics:

$ sox robot.wav -n stat

Samples read:                   281,892
Length (seconds):             25.510588
Scaled by:                   128.000000
Maximum amplitude:             0.929688
Minimum amplitude:            -0.937500
Midline amplitude:            -0.003906
Mean    norm:                  0.226330
Mean    amplitude:            -0.001873
RMS     amplitude:             0.388809
Maximum delta:                 0.531250
Minimum delta:                -0.531250
Mean    delta:                 0.000000
RMS     delta:                 0.218124
Rough   frequency:                  483
Volume adjustment:             1.075630

The key number is Rough frequency: 483 Hz. sox calculates this by counting zero crossings and dividing by two times the duration. For DTMF, you’d expect to see a rough frequency roughly between the two tones being mixed — somewhere in the 1000–1400 Hz range for the low-high tone pairs. 483 Hz is lower than any DTMF tone pair.

What does 483 Hz mean for a file whose dominant frequency is actually ~1000 Hz? The signal is being turned on and off — Morse code. When the signal is OFF, there are no zero crossings. A 1000 Hz carrier with roughly 50% duty cycle produces about 1000 zero crossings per second instead of 2000, which gives a rough frequency of ~500 Hz. The 483 Hz result is exactly consistent with a 1000 Hz tone that’s on for about half the recording.

The other stat values that confirm this: RMS amplitude of 0.389 (high — this is a loud, clear tone, not a subtle modulation) and Mean norm of 0.226 (the average absolute amplitude is about 23% of maximum, consistent with a signal that’s ON for roughly half the time).

Confirming the carrier frequency with sox spectrogram

The stat output points to Morse code, but doesn’t confirm the carrier frequency. For that, generate a spectrogram:

$ sox robot.wav -n spectrogram -o robot_spec.png

The spectrogram for beep-boop-robot shows exactly one horizontal band — a sharp line at approximately 1000 Hz. It’s not a continuous line; it appears as a series of segments, some short (dots) and some longer (dashes). The background is clean: no harmonics, no frequency spread, no noise floor activity above 1100 Hz. This is a textbook single-tone Morse code signal.

Compare this to what spectrogram steganography looks like: hidden images show up as patterned texture across a wide frequency range, often at the top of the spectrogram in a band that wasn’t used for actual audio. If that’s what you’re looking for and the spectrogram is a single clean line, the answer is in the time domain, not the frequency domain.

The DTMF Rabbit Hole: why the name was misleading

“Beep boop” in CTF context is a strong signal for DTMF or dial-tone encoding. That association caused me to spend about an hour on a dead end before running sox stat. Here’s the diagnostic path I should have used from the start:

Run sox file.wav -n stat and check Rough frequency
If rough frequency is in the 700–1600 Hz range and RMS is high → possible DTMF (two simultaneous tones)
If rough frequency is much lower than expected carrier → on/off modulation → Morse code
If rough frequency is near expected carrier and RMS is low → possible LSB steganography (carrier intact, low-level modification)

For DTMF specifically, the rough frequency calculation breaks down because two simultaneous tones create complex zero crossing patterns. The reliable test for DTMF is generating a frequency spectrum and looking for the two characteristic peaks. sox can’t do this directly, but the spectrogram makes it immediately visible — DTMF has two horizontal bands, not one.

Extracting timing data for Morse decoding

Once you’ve identified the file as Morse code, the next step is extracting the on/off timing. A Python decoder using 25ms analysis windows confirmed the message structure. The output began with “HI HOW ARE YOU DOING. THE FLAG IS TJCTF{…” — the Morse preface before the actual flag is a common CTF touch that makes the signal look less obviously competitive. Seeing “TJCTF{” in the output confirmed I’d decoded the right pattern. SoX can help with timing extraction too: the silence effect can detect silence regions, effectively marking the OFF periods:

$ sox robot.wav -n silence 1 0.05 1% -l 1 0.05 1% stat 2>&1

For more precise timing, I used Python’s wave module to compute amplitude over 25ms windows and extract the on/off pattern directly. The timing units in the file were:

Short burst (~50–75ms): dot
Long burst (~150–175ms): dash
Short gap (~25ms): between elements in the same letter
Medium gap (~100–125ms): between letters
Long gap (~375–400ms): between words

The message decoded as a phrase followed by the flag. Since this is a tool article rather than a full writeup, the important thing is the identification pipeline, not the specific content of what the Morse encoded.

SoX commands that matter in CTF: a practical reference

File information

# Detailed header info
$ soxi input.wav

# Statistics (rough frequency, RMS, amplitude range)
$ sox input.wav -n stat

# Stat with stderr redirect (some versions output to stderr)
$ sox input.wav -n stat 2>&1

Spectrogram generation

# Basic spectrogram (PNG)
$ sox input.wav -n spectrogram -o output.png

# Wider view for longer files
$ sox input.wav -n spectrogram -x 1200 -y 600 -o output.png

# High-frequency resolution for spotting subtle patterns
$ sox input.wav -n spectrogram -w Kaiser -y 1024 -o output.png

Format conversion and manipulation

# Convert sample rate
$ sox input.wav -r 44100 output.wav

# Extract one channel from stereo (left = 1, right = 2)
$ sox input.wav output.wav remix 1
$ sox input.wav output.wav remix 2

# Reverse audio (for reversed-speech challenges)
$ sox input.wav output.wav reverse

# Speed up or slow down without pitch change
$ sox input.wav output.wav tempo 0.5   # half speed
$ sox input.wav output.wav tempo 2.0   # double speed

Silence detection

# Remove silence from start and end
$ sox input.wav output.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse

# Split on silence (useful for separating Morse code segments)
$ sox input.wav segment.wav silence 1 0.1 1% : newfile : restart

SoX vs Audacity vs FFmpeg: when to use which

Task	Best tool	Why not the others
Quick stat check (rough frequency, RMS)	SoX (`-n stat`)	Audacity requires loading the GUI; FFmpeg stat output is less readable for audio analysis
CLI spectrogram PNG generation	SoX (`-n spectrogram`)	Audacity is GUI-only; FFmpeg can generate spectrograms but requires more flags and separate display step
Visual spectrogram exploration (zoom, scroll)	Audacity	SoX generates static PNG; can’t interact with it
Detailed frequency analysis (identify exact Hz)	Audacity (frequency plot)	SoX rough frequency is approximate; FFmpeg plot requires additional rendering
Format conversion, batch processing	FFmpeg	More codec support than SoX; better for video/audio containers
Stereo channel extraction	SoX (`remix`)	Clean one-liner vs Audacity’s menu navigation
Reversing audio	SoX (`reverse`)	Fastest for CLI; Audacity is easier for selective reversal
LSB audio steganography	Neither — use Audacity visual + Python	SoX and FFmpeg don’t decode LSB; need manual bit extraction

My workflow in CTF: soxi first for the header, then sox -n stat for the rough frequency and amplitude stats. If rough frequency is anomalously low or high, that’s a signal worth following before opening anything in a GUI. Audacity comes out only when I need to visually explore the spectrogram at a specific frequency range or time position — tasks that SoX’s static output can’t support.

Common audio CTF patterns and how SoX identifies them

Morse code

Rough frequency significantly below carrier frequency. Spectrogram shows single band with segmented horizontal pattern. Amplitude alternates between full-on and full-off. High RMS amplitude (signal is loud and clear). Confirmed in TJCTF 2024 beep-boop-robot.

DTMF encoding

Two frequency bands visible in spectrogram simultaneously. Rough frequency in the 800–1400 Hz range. Spectrogram pattern shows column-like blobs (tones held for digit duration, then silence between digits). The decoder tools multimon-ng -t WAV -a DTMF or online DTMF decoders handle extraction.

Spectrogram steganography

Auditory content sounds normal but spectrogram shows unusual patterning at high frequencies (often above 15 kHz where human hearing is weak). Rough frequency and RMS from sox stat will look normal — no statistical anomaly. The tell is purely visual in the spectrogram. This is one situation where Audacity’s interactive spectrogram is better than SoX’s static PNG: you need to zoom into the high-frequency range and see if there’s an image hidden there.

Reversed audio

Audio that sounds like backwards speech or noise. No statistical anomaly in sox stat. Run sox input.wav output.wav reverse and listen to the result. Sometimes combined with speed changes — if reversed audio still sounds garbled, try sox output.wav final.wav tempo 0.5.

Stereo channel split

Challenge delivers a stereo file where each channel contains different data. soxi will show Channels: 2. Extract and analyze each channel separately:

$ sox stereo.wav left.wav remix 1
$ sox stereo.wav right.wav remix 2
$ sox left.wav -n stat 2>&1
$ sox right.wav -n stat 2>&1

CTF Audio Challenges: A Practical SoX Combat Guide

The challenge: TJCTF 2024 beep-boop-robot

How sox stat killed the DTMF hypothesis

Confirming the carrier frequency with sox spectrogram

The DTMF Rabbit Hole: why the name was misleading

Extracting timing data for Morse decoding

SoX commands that matter in CTF: a practical reference

File information

Spectrogram generation

Format conversion and manipulation

Silence detection

SoX vs Audacity vs FFmpeg: when to use which

Common audio CTF patterns and how SoX identifies them

Morse code

DTMF encoding

Spectrogram steganography

Reversed audio

Stereo channel split

Further Reading

コメント

Leave a Reply Cancel reply

投稿をさらに読み込む

Local Authority picoCTF Writeup

Inspect HTML picoCTF Writeup

Wireshark doo dooo do doo picoCTF Writeup

Crack the Gate 1 picoCTF Writeup