pdfdumper in CTF: Extracting PDF Content and Common Challenge Patterns


A Quiet Challenge with Just One PDF File

At a regional CTF’s Forensics category, all I got was a single file named challenge.pdf.

The problem statement only said “Find the hidden flag.” A typical “here’s a file, good luck” kind of challenge.

First, the basics.

bash

$ file challenge.pdf
challenge.pdf: PDF document, version 1.7

A normal PDF. No suspicious disguises.

bash

$ strings challenge.pdf | grep -i flag

Nothing.

Opened it in Adobe Reader. One blank page. Nothing visible.

At this point, I had four hypotheses:

  1. Embedded image — LSB steganography hidden in a white image
  2. Attachment — Another file attached to the PDF
  3. JavaScript — JS running invisibly somewhere
  4. Pure steganography — The PDF itself is a cover, and binwalk will reveal something

Memories of a past challenge where a ZIP file was disguised as a PNG came back, so I started with hypothesis #4.

The First Rabbit Hole — 15 Minutes Wasted on binwalk

bash

$ binwalk challenge.pdf

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             PDF document, version: "1.7"

…That’s it. Nothing else.

“Wait, maybe it’s just compressed,” I thought, and tried extraction with the -e option.

bash

$ binwalk -e challenge.pdf

A folder was created, but it was empty.

15 minutes gone.

Next thought: image extraction. Maybe there’s an image embedded in the PDF with steganography.

bash

$ pdfimages challenge.pdf output

No output. Confirmed: not a single image embedded.

That’s when I realized: I had been operating under the assumption that “PDF file = image container.”

I hadn’t actually looked inside the PDF structure. I had been making decisions based purely on file format, completely skipping structural analysis.

The Turning Point — Realizing “I Haven’t Actually Looked at the PDF Structure”

Taking a step back, I recalled what PDFs are actually made of:

  • Object numbering structure — Each element is managed in a format like 1 0 obj
  • Streams — Compressed data, images, or scripts
  • EmbeddedFile — Separate data can be embedded as attachments
  • /JS key — If JavaScript is embedded, this keyword appears

In other words, if I could see the PDF’s internal structure at a glance, I’d know where everything is.

Two tools came to mind:

  • pdf-parser.py — By Didier Stevens, detailed object-by-object analysis
  • pdfdumper — Dumps all objects at once and visualizes the structure

pdf-parser.py is precise, but it requires specifying object numbers and digging through them one by one, which takes time.

My current situation:

  • Single PDF file
  • No visual hints
  • Forensics category
  • 1.5 hours remaining

Based on “can visualize the entire structure instantly,” I chose pdfdumper.

Running pdfdumper — Execution Log and My First Misunderstanding

bash

$ pdfdumper challenge.pdf
```

The output: a directory with a list of all objects and their contents dumped as individual files.
```
obj1.txt  — Catalog
obj2.txt  — Pages
obj12.bin — Fairly large file
obj22.js  — JS stream!

The presence of obj22.js made me think “this is it.”

But before that, obj12.bin caught my attention. Several dozen KB. Maybe some data is hidden there.

bash

$ file obj12.bin
obj12.bin: TrueType Font data

Font data.

Lesson learned: Large data in a PDF ≠ necessarily hidden data. Fonts are normally large.

Opened obj22.js.

javascript

var encoded = "ZmxhZ3tQREZfanNfc3RyZWFtX2hpZGRlbl9kYXRhfQ==";

Base64. No eval, no document.write. Just embedded data.

bash

$ echo "ZmxhZ3tQREZfanNfc3RyZWFtX2hpZGRlbl9kYXRhfQ==" | base64 -d
flag{PDF_js_stream_hidden_data}

Got it.

Why I Could Discard the Other Objects

Once pdfdumper dumped all objects as files, I was able to establish what to look at first.

Decision criteria:

  • Size — Too small means just metadata or reference info
  • Type — Keywords like /JS or /EmbeddedFile
  • Structural consistency — Fonts and page trees follow standard patterns

I didn’t waste time on the font data because the object’s Type information showed /Font.

This visibility came from pdfdumper outputting structural information alongside each object.

With strings alone, I couldn’t have made this judgment.

When NOT to Use pdfdumper

Conversely, there are cases where I wouldn’t use this tool.

Cases to Skip It

  1. PDF is just an image container
    If file shows embedded data is primarily JPEG, pdfimages is sufficient.
  2. Problem statement hints at “viewer exploit”
    Dynamic analysis is needed, so trace PDF.js or Adobe Reader behavior in a sandbox environment.
  3. Web category with upload functionality
    Prioritize looking for validation bugs in upload processing over PDF structure analysis.

When to Switch Tools

  • No JS found → Switch to pdf-parser.py for detailed analysis
  • Many streams → Format with qpdf --qdf first, then analyze
  • Encrypted objects → Confirm decryption method first

CTF Decision Comparison Table

Conditionpdfdumperpdf-parser.pybinwalk
Initial speed◎ One-shot dump△ Analysis-focused
Structure visualization×
Rabbit hole resistance×
Stego-focused problems
JS analysis×

Go-to decision
Single PDF / structure unknown → pdfdumper only

Don’t use when
Suspected ZIP disguise → binwalk
Object-level scrutiny needed → pdf-parser

The Moment I Understood PDF Internal Structure

This challenge deepened my understanding of PDF structure a bit:

  • xref table — Manages object position information
  • indirect object — Independent data referenceable in 1 0 obj format
  • stream compression (FlateDecode) — Most data is compressed, so raw data isn’t readable
  • JavaScript embedding spec — Tied to actions via the /JS key
  • Difference between fonts and embedded data — Fonts are /Font type, data is /EmbeddedFile type

However, even knowing the theory, “dumping everything” wastes time.

In CTF, “triage over scrutiny” is key. pdfdumper was a tool that accelerated that triage process.

Shortest Action Plan for Next Time

If I encounter a similar PDF challenge again, I’ll follow this flow:

bash

# 1. File check
file challenge.pdf

# 2. Dump all objects with pdfdumper
pdfdumper challenge.pdf

# 3. Check by priority
# - Check for /JS
# - Check for /EmbeddedFile
# - Check for images

# 4. Switch if no results in 10 minutes
# - Detailed analysis with pdf-parser.py
# - Format with qpdf and read in text editor

You don’t need to make the perfect choice from the start. Cutting your losses and switching after 10 minutes is what I consider practical decision-making in CTF.


This challenge wasn’t so much about learning how to use a tool, but about learning “when and what to use” — the decision criteria.

Next time I encounter a similar situation, I won’t hesitate.


Further Reading

If you’re looking to expand your CTF forensics toolkit beyond PDF analysis, I’ve written a comprehensive guide covering the essential tools you’ll need. Check out CTF Forensics Tools: The Ultimate Guide for Beginners for a complete overview of forensics techniques and tooling strategies.

Beyond pdfdumper, there are several other forensics scenarios you’ll commonly encounter in CTF competitions. When you’re dealing with password-protected ZIP files, my article on zip2john in CTF: Extracting ZIP Passwords and Common Challenge Patterns walks through the practical approach to cracking them. For challenges involving QR codes or barcodes hidden in images or documents, zbarimg in CTF: QR/Barcode Decoding Techniques and Common Challenge Patterns covers the detection and extraction workflow. And if you find yourself facing image-based steganography challenges, steghide in CTF: How to Hide and Extract Data from Files demonstrates the decision-making process behind using this classic stego tool.

You can find more CTF technique breakdowns and tool deep-dives at alsavaudomila.com.

Leave a Reply

Your email address will not be published. Required fields are marked *