A Quiet Challenge with Just One PDF File
At a regional CTF’s Forensics category, all I got was a single file named challenge.pdf.
The problem statement only said “Find the hidden flag.” A typical “here’s a file, good luck” kind of challenge.
First, the basics.
bash
$ file challenge.pdf challenge.pdf: PDF document, version 1.7
A normal PDF. No suspicious disguises.
bash
$ strings challenge.pdf | grep -i flag
Nothing.
Opened it in Adobe Reader. One blank page. Nothing visible.
At this point, I had four hypotheses:
- Embedded image — LSB steganography hidden in a white image
- Attachment — Another file attached to the PDF
- JavaScript — JS running invisibly somewhere
- Pure steganography — The PDF itself is a cover, and binwalk will reveal something
Memories of a past challenge where a ZIP file was disguised as a PNG came back, so I started with hypothesis #4.
The First Rabbit Hole — 15 Minutes Wasted on binwalk
bash
$ binwalk challenge.pdf DECIMAL HEXADECIMAL DESCRIPTION -------------------------------------------------------------------------------- 0 0x0 PDF document, version: "1.7"
…That’s it. Nothing else.
“Wait, maybe it’s just compressed,” I thought, and tried extraction with the -e option.
bash
$ binwalk -e challenge.pdf
A folder was created, but it was empty.
15 minutes gone.
Next thought: image extraction. Maybe there’s an image embedded in the PDF with steganography.
bash
$ pdfimages challenge.pdf output
No output. Confirmed: not a single image embedded.
That’s when I realized: I had been operating under the assumption that “PDF file = image container.”
I hadn’t actually looked inside the PDF structure. I had been making decisions based purely on file format, completely skipping structural analysis.
The Turning Point — Realizing “I Haven’t Actually Looked at the PDF Structure”
Taking a step back, I recalled what PDFs are actually made of:
- Object numbering structure — Each element is managed in a format like
1 0 obj - Streams — Compressed data, images, or scripts
- EmbeddedFile — Separate data can be embedded as attachments
- /JS key — If JavaScript is embedded, this keyword appears
In other words, if I could see the PDF’s internal structure at a glance, I’d know where everything is.
Two tools came to mind:
- pdf-parser.py — By Didier Stevens, detailed object-by-object analysis
- pdfdumper — Dumps all objects at once and visualizes the structure
pdf-parser.py is precise, but it requires specifying object numbers and digging through them one by one, which takes time.
My current situation:
- Single PDF file
- No visual hints
- Forensics category
- 1.5 hours remaining
Based on “can visualize the entire structure instantly,” I chose pdfdumper.
Running pdfdumper — Execution Log and My First Misunderstanding
bash
$ pdfdumper challenge.pdf ``` The output: a directory with a list of all objects and their contents dumped as individual files. ``` obj1.txt — Catalog obj2.txt — Pages obj12.bin — Fairly large file obj22.js — JS stream!
The presence of obj22.js made me think “this is it.”
But before that, obj12.bin caught my attention. Several dozen KB. Maybe some data is hidden there.
bash
$ file obj12.bin obj12.bin: TrueType Font data
Font data.
Lesson learned: Large data in a PDF ≠ necessarily hidden data. Fonts are normally large.
Opened obj22.js.
javascript
var encoded = "ZmxhZ3tQREZfanNfc3RyZWFtX2hpZGRlbl9kYXRhfQ==";
Base64. No eval, no document.write. Just embedded data.
bash
$ echo "ZmxhZ3tQREZfanNfc3RyZWFtX2hpZGRlbl9kYXRhfQ==" | base64 -d
flag{PDF_js_stream_hidden_data}
Got it.
Why I Could Discard the Other Objects
Once pdfdumper dumped all objects as files, I was able to establish what to look at first.
Decision criteria:
- Size — Too small means just metadata or reference info
- Type — Keywords like
/JSor/EmbeddedFile - Structural consistency — Fonts and page trees follow standard patterns
I didn’t waste time on the font data because the object’s Type information showed /Font.
This visibility came from pdfdumper outputting structural information alongside each object.
With strings alone, I couldn’t have made this judgment.
When NOT to Use pdfdumper
Conversely, there are cases where I wouldn’t use this tool.
Cases to Skip It
- PDF is just an image container
Iffileshows embedded data is primarily JPEG,pdfimagesis sufficient. - Problem statement hints at “viewer exploit”
Dynamic analysis is needed, so trace PDF.js or Adobe Reader behavior in a sandbox environment. - Web category with upload functionality
Prioritize looking for validation bugs in upload processing over PDF structure analysis.
When to Switch Tools
- No JS found → Switch to
pdf-parser.pyfor detailed analysis - Many streams → Format with
qpdf --qdffirst, then analyze - Encrypted objects → Confirm decryption method first
CTF Decision Comparison Table
| Condition | pdfdumper | pdf-parser.py | binwalk |
|---|---|---|---|
| Initial speed | ◎ One-shot dump | △ Analysis-focused | ○ |
| Structure visualization | ◎ | ◎ | × |
| Rabbit hole resistance | ○ | △ | × |
| Stego-focused problems | △ | △ | ◎ |
| JS analysis | ◎ | ◎ | × |
Go-to decision
Single PDF / structure unknown → pdfdumper only
Don’t use when
Suspected ZIP disguise → binwalk
Object-level scrutiny needed → pdf-parser
The Moment I Understood PDF Internal Structure
This challenge deepened my understanding of PDF structure a bit:
- xref table — Manages object position information
- indirect object — Independent data referenceable in
1 0 objformat - stream compression (FlateDecode) — Most data is compressed, so raw data isn’t readable
- JavaScript embedding spec — Tied to actions via the
/JSkey - Difference between fonts and embedded data — Fonts are
/Fonttype, data is/EmbeddedFiletype
However, even knowing the theory, “dumping everything” wastes time.
In CTF, “triage over scrutiny” is key. pdfdumper was a tool that accelerated that triage process.
Shortest Action Plan for Next Time
If I encounter a similar PDF challenge again, I’ll follow this flow:
bash
# 1. File check file challenge.pdf # 2. Dump all objects with pdfdumper pdfdumper challenge.pdf # 3. Check by priority # - Check for /JS # - Check for /EmbeddedFile # - Check for images # 4. Switch if no results in 10 minutes # - Detailed analysis with pdf-parser.py # - Format with qpdf and read in text editor
You don’t need to make the perfect choice from the start. Cutting your losses and switching after 10 minutes is what I consider practical decision-making in CTF.
This challenge wasn’t so much about learning how to use a tool, but about learning “when and what to use” — the decision criteria.
Next time I encounter a similar situation, I won’t hesitate.
Further Reading
If you’re looking to expand your CTF forensics toolkit beyond PDF analysis, I’ve written a comprehensive guide covering the essential tools you’ll need. Check out CTF Forensics Tools: The Ultimate Guide for Beginners for a complete overview of forensics techniques and tooling strategies.
Beyond pdfdumper, there are several other forensics scenarios you’ll commonly encounter in CTF competitions. When you’re dealing with password-protected ZIP files, my article on zip2john in CTF: Extracting ZIP Passwords and Common Challenge Patterns walks through the practical approach to cracking them. For challenges involving QR codes or barcodes hidden in images or documents, zbarimg in CTF: QR/Barcode Decoding Techniques and Common Challenge Patterns covers the detection and extraction workflow. And if you find yourself facing image-based steganography challenges, steghide in CTF: How to Hide and Extract Data from Files demonstrates the decision-making process behind using this classic stego tool.
You can find more CTF technique breakdowns and tool deep-dives at alsavaudomila.com.
Leave a Reply