binwalk in CTF: Spot False Positives Fast

binwalk in CTF has a reliable pattern for wasting your first twenty minutes: it reports a TIFF image data hit at some offset deep inside a PNG, you spend time trying to extract it, and nothing usable comes out. The TIFF isn’t real — it’s the ICC color profile header, which happens to share the same magic bytes as a TIFF file. Once I understood this during picoCTF Matryoshka Doll, the challenge that had stalled me for nearly half an hour resolved in about five minutes.


The false positive that tripped me up: TIFF inside a PNG

Matryoshka Doll is a picoCTF Forensics Medium challenge. The download is called dolls.jpg. The first thing I always run is file before binwalk, and in this case that check immediately revealed something off:

$ file dolls.jpg
dolls.jpg: PNG image data, 594 x 1104, 8-bit/color RGBA, non-interlaced

It’s a PNG, not a JPEG. The extension is wrong — which in CTF almost always means something was intentionally obscured. Then I ran binwalk:

$ binwalk dolls.jpg

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             PNG image, 594 x 1104, 8-bit/color RGBA, non-interlaced
3226          0xC9A           TIFF image data, big-endian, offset of first image directory: 8
272492        0x4286C         Zip archive data, at least v2.0 to extract, compressed size: 378954, uncompressed size: 383940, name: base_images/2_c.jpg
651612        0x9F15C         End of Zip archive, footer length: 22

Three hits. The TIFF at offset 3226 (0xC9A) looked significant, and I immediately tried to chase it. I extracted the bytes starting at that offset with dd, tried opening the result as a TIFF, and got nothing. This is the false positive that catches most beginners.

What’s actually at 0xC9A? The ICC color profile embedded in the PNG. ICC profile headers start with the bytes 4D 4D 00 2A — the same four bytes used by big-endian TIFF files (“MM” for Motorola byte order, then 0x002A = 42, the TIFF magic number). binwalk matches on raw byte signatures, so it reports “TIFF image data.” The bytes are genuinely there; they’re just not a real TIFF.

My reasoning at the time: two hits at 0x0 (the PNG) and 0x4286C (the ZIP) made sense. A third hit in the middle — TIFF image data — implied something else was also embedded. I assumed the challenge had three layers of content, not two, and that the TIFF was one of them. That assumption cost me twenty minutes. The tell: this same TIFF hit at offset 0xC9A appeared in every single layer of the challenge (four images total). ICC profiles are common in PNG files from Apple systems — when you see the same offset appearing identically across multiple files, that’s a structural artifact, not embedded content.

The real hit was the Zip archive at 272492 (0x4286C), which contained the next image in the chain.


Extracting the nested layers: what binwalk -e actually does

binwalk’s -e flag attempts automatic extraction. In theory it handles everything; in practice it depends on system tools being available. On my setup:

$ binwalk -e dolls.jpg

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
272492        0x4286C         Zip archive data, ...

WARNING: One or more files failed to extract: either no utility was found or it's unimplemented

The warning means binwalk found the ZIP but couldn’t unzip it automatically. The extracted directory _dolls.jpg.extracted/ still gets created and contains the raw ZIP file — you just have to unzip it manually. I’ve learned to do this by default rather than trusting -e to handle everything:

$ ls _dolls.jpg.extracted/
4286C.zip  base_images/

$ unzip _dolls.jpg.extracted/4286C.zip -d layer2/
Archive:  _dolls.jpg.extracted/4286C.zip
  inflating: layer2/base_images/2_c.jpg

Then the same cycle repeats: run binwalk on the new file, ignore the TIFF false positive at 0xC9A, extract the real ZIP.

Layer 2

$ binwalk layer2/base_images/2_c.jpg

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             PNG image, 526 x 1106, 8-bit/color RGBA, non-interlaced
3226          0xC9A           TIFF image data, big-endian, ...        ← same ICC false positive
187707        0x2DD3B         Zip archive data, ..., name: base_images/3_c.jpg
383807        0x5DB3F         End of Zip archive, footer length: 22

$ unzip layer2/base_images/_2_c.jpg.extracted/2DD3B.zip -d layer3/
  inflating: layer3/base_images/3_c.jpg

Layer 3

$ binwalk layer3/base_images/3_c.jpg

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             PNG image, 428 x 1104, 8-bit/color RGBA, non-interlaced
3226          0xC9A           TIFF image data, big-endian, ...        ← same ICC false positive
123606        0x1E2D6         Zip archive data, ..., name: base_images/4_c.jpg
201425        0x312D1         End of Zip archive, footer length: 22

$ unzip layer3/base_images/_3_c.jpg.extracted/1E2D6.zip -d layer4/
  inflating: layer4/base_images/4_c.jpg

Layer 4: the flag appears

$ binwalk layer4/base_images/4_c.jpg

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             PNG image, 320 x 768, 8-bit/color RGBA, non-interlaced
3226          0xC9A           TIFF image data, big-endian, ...        ← same ICC false positive
79578         0x136DA         Zip archive data, ..., name: flag.txt   ← this is different
79786         0x137AA         End of Zip archive, footer length: 22

$ unzip layer4/base_images/_4_c.jpg.extracted/136DA.zip -d layer5/
  inflating: layer5/flag.txt

$ cat layer5/flag.txt
p i c o C T F { e 3 f 3 7 8 f e 6 c 1 e a 7 f 6 b c 5 a c 2 c 3 d 6 8 0 1 c 1 f }

Flag: picoCTF{e3f378fe6c1ea7f6bc5ac2c3d6801c1f}

The image dimensions shrank at each layer (594×1104 → 526×1106 → 428×1104 → 320×768), which in hindsight was a hint that each image was nested inside the previous one. The ZIP at layer 4 differed from the previous three: instead of name: base_images/X_c.jpg, it contained name: flag.txt. That pattern change is the signal that you’ve hit the final layer.


Reading the binwalk output: how to separate real hits from noise

binwalk works by scanning for file signatures — specific byte sequences that mark the start of a known file format. The problem is that these byte sequences appear naturally in binary data, not just at real file boundaries. Knowing which hits are real requires cross-referencing the offset and context.

Signs a hit is likely real:

  • The offset makes sense structurally — it appears after the previous file’s end
  • binwalk also reports a corresponding “End of [archive]” entry
  • The described file size in the output is plausible (not 0 bytes, not 500MB for an 80KB file)
  • The filename listed inside the archive is recognizable (flag.txt, image.png)

Signs a hit is likely a false positive:

  • It appears at the same offset across multiple unrelated files (ICC profile at 0xC9A)
  • It’s a short format with no size or filename context (raw TIFF image data with no supporting ZIP/End-of-archive entry)
  • It appears inside the middle of an already-identified file (TIFF at 0xC9A is inside the PNG from offset 0)
  • DER certificate hits — extremely common in any file that has passed through TLS infrastructure

The DER certificate case is worth flagging separately. If you’re working with network captures or downloaded files, binwalk will frequently report Certificate in DER format hits scattered throughout the file. These are TLS certificate bytes from SSL handshakes or HTTPS responses. Unless you’re specifically hunting certificate extraction, ignore every DER hit and focus on archive formats (ZIP, Gzip, 7-zip, RAR) and image formats that appear outside existing identified regions.


When binwalk is the right choice and when it isn’t

binwalk is specifically designed for file carving — finding embedded files inside a container. It is not the right tool for every forensics situation, and knowing when to reach for it versus something else saves time.

SituationReach forWhy not the other
PNG/JPG with something hidden insidebinwalksteghide embeds via DCT modification (no signature); strings catches plaintext only
Passphrase-protected hidden data in JPEGsteghide + Stegseekbinwalk won’t see DCT-embedded content
Disk image, extract a specific partitiondd + fdiskbinwalk is slower on large disk images; dd gives you exact byte ranges
Finding flags/strings in any binarystrings first, then binwalkstrings is instant; binwalk is heavier
Nested files (Matryoshka patterns)binwalk (recursive)Nothing else handles recursive extraction as cleanly
Corrupt file headerpngcheck / hexeditbinwalk won’t detect broken headers it can’t parse

The decision tree I use in CTF: run file first (catches extension mismatches), then strings | grep -i flag (catches trivially embedded plaintext), then binwalk (catches embedded archives and images). steghide and zsteg come later, only if the challenge explicitly suggests image steganography.


binwalk -e versus manual extraction: which to use

The automatic extraction flag (-e) works when all required system utilities are installed. When they’re not, binwalk still creates the _filename.extracted/ directory and drops the raw bytes of each identified file there — but it doesn’t unpack archives. You get a 4286C.zip file instead of the contents.

I use manual extraction by default for two reasons:

  1. Control over directory structure. When you’re four layers deep, -e creates nested _extracted directories that become difficult to navigate. Manually naming each layer (layer2/, layer3/, etc.) keeps the extraction organized.
  2. Visibility. Running binwalk then unzip separately means you see each layer’s output explicitly. With -e, everything happens silently and you can miss the point where the inner content changes (ZIP → flag.txt in this challenge).

For single-layer extractions where binwalk reports one clean hit and one end-of-archive marker, -e is fine. For anything recursive or with multiple candidate hits, go manual.


Common patterns in CTF challenges that need binwalk

Extension mismatch

Files delivered with wrong extensions (.jpg that’s actually a PNG, .dat that contains a ZIP) are the most common setup. Always run file before binwalk. If the file output disagrees with the extension, that disagreement is usually the first step in the solve.

Trailer data after a valid file

A real PNG or JPEG ends at a specific byte sequence (IEND for PNG, FFD9 for JPEG). Anything after that is invisible to image viewers but visible to binwalk. A challenge might present a normal-looking image that has a ZIP archive appended after the image end marker. binwalk reports the archive offset; you extract from that point forward.

Polyglot files

A polyglot file is valid as two different formats simultaneously — for example, a file that is both a valid PNG and a valid PDF depending on which parser reads it. binwalk will report both format signatures, and your job is to figure out which reading the challenge intends. The picoCTF challenge “Secret of the Polyglot” is a clean example: binwalk flag2of2-final.pdf shows a PNG at offset 0 and a PDF starting at offset 914. Renaming the file to .png and opening it reveals a flag half; the PDF view shows the other half.

Recursive nesting (Matryoshka pattern)

Each file contains an archive containing the next file, which contains an archive, and so on. The challenge name “Matryoshka Doll” makes the pattern explicit; other challenges hide it. The signal is that binwalk reports a similar structure (archive → image) repeatedly. The termination condition is when the deepest archive contains something different from the pattern — in this case, flag.txt instead of another X_c.jpg.


Pre-analysis checklist before running binwalk

Running binwalk immediately on every file isn’t wrong, but these prior checks often point you to the answer faster:

  1. file <target> — confirm actual file type matches extension
  2. strings <target> | grep -i "pico\|flag\|ctf" — catch any plaintext flags before doing heavy analysis. For Matryoshka Doll, this returned nothing — the flag was binary-embedded inside a ZIP, not written as plaintext anywhere. That null result confirmed the data wasn’t trivially accessible and justified moving to binwalk.
  3. xxd <target> | head -4 — read the magic bytes manually to verify the file type
  4. exiftool <target> — check metadata for embedded text, comments, GPS data
  5. Then: binwalk <target>

For Matryoshka Doll, file at step 1 immediately flagged the PNG-vs-JPG mismatch. strings at step 2 returned nothing useful (the flag was binary-embedded, not plaintext). binwalk at step 5 showed the archive structure. The TIFF false positive at 0xC9A looked important until I recognized it as ICC profile data — at which point the ZIP at 0x4286C was the only real hit worth following.


Further Reading

This challenge is part of the picoCTF Forensics category. If you’re building out your forensics toolkit, CTF Forensics Tools: The Ultimate Guide for Beginners covers which tools to reach for at each stage of a challenge.

When binwalk finds a PNG inside a file and the PNG appears corrupted, pngcheck in CTF is the next tool to run — it identifies exactly which chunk is broken and what the valid structure should look like.

If binwalk returns no useful hits and the file is an image, the data may be steganographically embedded rather than appended. steghide in CTF covers the DCT-based embedding approach that binwalk cannot detect.

For forensics challenges involving disk images rather than individual files, dd in CTF explains how to extract specific partitions by offset — the same technique that applies when binwalk gives you a precise byte address and you want to extract without relying on the automatic extraction flag.

コメント

Leave a Reply

Your email address will not be published. Required fields are marked *

投稿をさらに読み込む