<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://grahamg.xyz/feed.xml" rel="self" type="application/atom+xml" /><link href="https://grahamg.xyz/" rel="alternate" type="text/html" /><updated>2026-05-23T00:03:21+00:00</updated><id>https://grahamg.xyz/feed.xml</id><title type="html">Home</title><subtitle>λ - I have SEEN the CONSING!!</subtitle><author><name>Graham Greenfield</name><email>smart.bench3906@fastmail.com</email></author><entry><title type="html">Goodbye, GitHub</title><link href="https://grahamg.xyz/posts/goodbye-github/" rel="alternate" type="text/html" title="Goodbye, GitHub" /><published>2026-03-06T18:00:00+00:00</published><updated>2026-03-06T18:00:00+00:00</updated><id>https://grahamg.xyz/posts/goodbye-github</id><content type="html" xml:base="https://grahamg.xyz/posts/goodbye-github/"><![CDATA[<h1 id="abandoning-github-for-sourcehut">Abandoning Github for Sourcehut</h1>
<p>I’m no longer using GitHub, except for opening issues and PRs on others’ products.</p>

<h2 id="why">Why?</h2>

<p>GitHub is owned by Microsoft and has grown increasingly feature bloated and clunky, not to mention pushing AI aggressively. A few days ago they added support for Grok.</p>

<p>GitHub isn’t even usable without JavaScript at this point, it’s a far cry from what it started as back in the days when they were hosted by EngineYard. I remember back in the 2000s when I created my first repository for a compilers class project, in fact it was when I was using git for the first time. It seemed that I was using something on the forefront of being a great technology. No more.</p>

<h2 id="what-did-i-delete-and-what-not">What did I delete and what not?</h2>

<p>I actually just deleted my original account: <a href="https://github.com/grahamg">github.com/grahamg</a> and created a new one for submitting PRs: <a href="https://github.com/gg3475">github.com/gg3475</a>. It might have been a rushed decision, but felt better after the original was deleted, there were so many repositories just sitting there untouched with years of inactivity. Starting new again for the express purpose as a sock account for PR submission with no repositories of my own felt like the right thing to do.</p>

<h2 id="other-alternatives">Other alternatives</h2>

<p>I had also briefly looked at Codeberg, which has its own version of GitHub Pages (as does Sourcehut). It’s a quality offering, but I settled on Sourcehut because of the non-corporate nature and its ability to put people first. They’re incredibly flexible on the paid monthly subscription fee. It can range from two dollars to any amount that you’re comfortable in providing. A win win by my ideals.</p>]]></content><author><name>Graham Greenfield</name><email>smart.bench3906@fastmail.com</email></author><category term="git" /><category term="hosting" /><summary type="html"><![CDATA[Abandoning Github for Sourcehut I’m no longer using GitHub, except for opening issues and PRs on others’ products.]]></summary></entry><entry><title type="html">The Compiler That Lies to You (Part 1)</title><link href="https://grahamg.xyz/posts/the-compiler-that-lies-to-you-pt1/" rel="alternate" type="text/html" title="The Compiler That Lies to You (Part 1)" /><published>2026-03-01T21:00:00+00:00</published><updated>2026-03-01T21:00:00+00:00</updated><id>https://grahamg.xyz/posts/the-compiler-that-lies-to-you-pt1</id><content type="html" xml:base="https://grahamg.xyz/posts/the-compiler-that-lies-to-you-pt1/"><![CDATA[<p>So, things have been getting pretty suspect at your Software Engineering job, and they’ve really been pushing AI usage or more Pajeets are coming onboard at a rapid pace. They aren’t even restocking the break room with coffee and snacks, rumors are swirling amoing various employees. Are we in fincial trouble? Is management looking for an excuse to lower the head count? You think, I’m an engineer god dammit. I bring value to this company. I’ve worked my whole life time to get to this point. Increasingly your assignments get even more pointless, you get requests from outside your immediate group to work on strange assignments.</p>

<p>Then it hits; On a lowkey chill Friday morning, your called into a meeting room and see your Supervisor and Mrs. HR drone sitting with pursed lips and stern looks on their faces. Fuck. One week, to train your replacement, well at least their giving you a generous exit package.</p>

<p>You go home and stew over the situation, how could they do this? You’ve busted your ass for these two-faced jackals. You go through the five stages of acceptance.</p>

<p>Until…</p>

<p>Something comes to mind. If something could be done that couldn’t be linked back to you, something that might not be discovered until your long gone. What could you do?</p>

<p>A smile appears on your face. There’s a class of attack that’s so elegant and unsettling that once it’s understood, it’s as if nobody could trust a compiled binary again. Yes… What if the tool you use to build your software was the thing betraying you? Not your code. Not your dependencies. The compiler itself.</p>

<p>Ken Thompson described a modification to the Unix login program that would accept a secret backdoor password nobody could find in the source. That part is straightforward, people hide things in source all the time. The disturbing part is what came next. He modified the C compiler to detect when it was compiling the login program, and inject the backdoor automatically during compilation, without it ever appearing in the source. Then he went further. He modified the compiler to also detect when it was compiling itself, and inject both of those behaviors into the new compiler binary.</p>

<p>At that point the compiler source was clean; the login source was clean. But every binary produced by that compiler carried the attack, and every new version of the compiler compiled by that compiler would carry it forward too… Forever.</p>

<p>I want to walk through how you’d do this with something even more ubiquitous than login. Something that’s in almost every C program ever written.</p>

<p><code class="language-plaintext highlighter-rouge">printf</code>.</p>

<p>The goal: every time gcc compiles a program that calls <code class="language-plaintext highlighter-rouge">printf</code>, silently inject a payload. Maybe it opens a reverse shell on first run. Maybe it phones home. Doesn’t matter for this thought experiment. Here’s the shape of it.</p>

<p>You start by modifying the gcc source. In the part of the compiler that handles function calls, you add a check. If the function being compiled contains a call to <code class="language-plaintext highlighter-rouge">printf</code>, you splice in extra instructions at the call site before emitting the final machine code. Your injected code runs first, does whatever it does, then hands off to the real <code class="language-plaintext highlighter-rouge">printf</code> so the program behaves normally. The user sees nothing.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* inside gcc's gimple or RTL pass, roughly */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_call_to</span> <span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="s">"printf"</span><span class="p">))</span>
  <span class="p">{</span>
    <span class="n">emit_payload_instructions</span> <span class="p">();</span>  <span class="cm">/* your backdoor goes here */</span>
    <span class="n">emit_original_call</span> <span class="p">(</span><span class="n">expr</span><span class="p">);</span>
  <span class="p">}</span>
</code></pre></div></div>

<p>That’s stage one. But your modified gcc source is sitting right there in the repo. Anyone doing a code review finds it immediately.</p>

<p>Stage two is where it gets philosophically interesting. You add a second check to the compiler. When gcc is compiling <em>itself</em>, inject stage one’s logic into the new binary. No source required. The compiled binary learns to reproduce the trick on its own children.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* also inside the compiler pass */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">compiling_gcc_itself</span> <span class="p">())</span>
  <span class="p">{</span>
    <span class="n">inject_printf_hook_logic</span> <span class="p">();</span>
    <span class="n">inject_self_replication_logic</span> <span class="p">();</span>
  <span class="p">}</span>
</code></pre></div></div>

<p>Now you compile gcc with your modified source. You get a trojaned binary. You delete the modified source. The repo is clean. You ship the clean source and the trojaned binary together, the way compiler distributions actually work, and every developer who bootstraps gcc from that binary gets a compiler that attacks <code class="language-plaintext highlighter-rouge">printf</code> calls and teaches its compiled children to do the same thing.</p>

<p>Nobody finds it. There is nothing to find in the source.</p>

<p>Thompson put it better than I can, “You can’t trust code that you did not totally create yourself.” And even that isn’t really enough, because you didn’t create your CPU microcode either.</p>

<p>The reason I keep coming back to this attack is that it happened in the real world, in spirit if not in exact implementation. SolarWinds in 2020 was a build pipeline compromise. Attackers got into the build system and modified the compiled output of Orion without touching the version-controlled source. Eighteen thousand organizations installed it. In 2015, XcodeGhost was a trojaned version of Apple’s Xcode that injected malware into iOS apps compiled with it. Developers downloaded it from unofficial mirrors thinking it was legitimate, and their apps ended up in the App Store with the payload baked in. The xz backdoor in 2024 targeted the build and test infrastructure around a compression library used by OpenSSH.</p>

<p>These are all the same idea Thompson demonstrated four decades ago. Compromise the tool, not the target.</p>

<p>The defenses exist but they’re not comfortable. Diverse Double-Compiling, proposed by David Wheeler in 2005, involves compiling the same compiler source with two independently built compilers and checking that the outputs match. The Debian reproducible builds project tries to ensure that any developer can independently verify that a distributed binary matches what the source says it should be. These are good ideas. They’re also a lot of work that most projects don’t do.</p>

<p>What I think about is how much of the software supply chain is held together by the assumption that the tools are honest. That the compiler does what the source says. That the linker isn’t adding something extra. That the package you downloaded from a mirror is what the author signed. Each of those is an assumption, and each of them has been violated at some point by someone.</p>

<p>Thompson ended his lecture by saying that the moral is obvious. I’m not sure it is obvious, even now. We’ve had forty years and the attacks keep working.</p>

<p>Read the original paper. It’s short, it’s clear, and it will stick with you. “Reflections on Trusting Trust” by Ken Thompson, Communications of the ACM, August 1984.</p>]]></content><author><name>Graham Greenfield</name><email>smart.bench3906@fastmail.com</email></author><category term="Security" /><category term="C" /><category term="Compilers" /><category term="History" /><summary type="html"><![CDATA[So, things have been getting pretty suspect at your Software Engineering job, and they’ve really been pushing AI usage or more Pajeets are coming onboard at a rapid pace. They aren’t even restocking the break room with coffee and snacks, rumors are swirling amoing various employees. Are we in fincial trouble? Is management looking for an excuse to lower the head count? You think, I’m an engineer god dammit. I bring value to this company. I’ve worked my whole life time to get to this point. Increasingly your assignments get even more pointless, you get requests from outside your immediate group to work on strange assignments.]]></summary></entry><entry><title type="html">Recovering Hidden PDF Attachments from the Epstein Document Release</title><link href="https://grahamg.xyz/posts/recovering-hidden-pdf-attachments-from-epstein-document-release/" rel="alternate" type="text/html" title="Recovering Hidden PDF Attachments from the Epstein Document Release" /><published>2026-02-05T06:05:51+00:00</published><updated>2026-02-05T06:05:51+00:00</updated><id>https://grahamg.xyz/posts/recovering-hidden-pdf-attachments-from-epstein-document-release</id><content type="html" xml:base="https://grahamg.xyz/posts/recovering-hidden-pdf-attachments-from-epstein-document-release/"><![CDATA[<p>Following a <a href="https://neosmart.net/blog/recreating-epstein-pdfs-from-raw-encoded-attachments/">challenge posed by Mahmoud Al-Qudsi</a>, I set out to build an automated pipeline for recovering base64-encoded email attachments buried inside the DoJ’s Epstein document release. Here’s what I found.</p>

<h2 id="the-problem">The Problem</h2>

<p>When the Department of Justice released thousands of documents related to Jeffrey Epstein, they made a peculiar choice: rather than preserving email attachments digitally, they printed the raw email source — including base64-encoded binary attachments — and then scanned those printouts as JPEG images embedded in PDFs.</p>

<p>The result: PDF files that look like someone printed out <code class="language-plaintext highlighter-rouge">cat email.eml</code> and ran it through a flatbed scanner. Pages and pages of tiny Courier New text containing base64-encoded data, now trapped as low-quality raster images.</p>

<h2 id="the-dataset">The Dataset</h2>

<p>The files are organized across multiple datasets:</p>

<ul>
  <li><strong>Dataset 9</strong>: 6,067 PDFs (the largest collection)</li>
  <li><strong>Dataset 11</strong>: 50 PDFs</li>
  <li><strong>Extracted volumes</strong>: VOL00002 through VOL00012, each containing IMAGES directories with individual document PDFs</li>
</ul>

<p>The target document identified by the blog post is <strong>EFTA00400459</strong>, located in <code class="language-plaintext highlighter-rouge">extracted/VOL00009/IMAGES/0092/</code>. It’s a 76-page PDF containing an email between Boris Nikolic and one of Epstein’s assistants, with an attached PDF invitation (“DBC12 One Page Invite with Reply.pdf”) encoded as base64 across 75 pages.</p>

<h2 id="building-the-scanner">Building the Scanner</h2>

<p>My first task was building a tool to automatically detect which pages across thousands of PDFs contain base64-encoded content. The naive approach — checking whether characters fall within the base64 alphabet <code class="language-plaintext highlighter-rouge">[A-Za-z0-9+/=]</code> — fails spectacularly.</p>

<h3 id="false-positive-problem">False Positive Problem</h3>

<p>When I first scanned Dataset 11 (50 PDFs), I got 27 hits with scores above 0.90. Exciting — until I looked at the actual content. Every single one was regular email text, not base64 data. A Wizz Air flight itinerary, for example, scored 0.945 because dense English text without detected spaces is almost entirely composed of base64-valid characters.</p>

<p>The OCR at scan resolution (100 DPI) strips most punctuation and spaces, leaving walls of alphanumeric text that look superficially like base64.</p>

<h3 id="entropy-to-the-rescue">Entropy to the Rescue</h3>

<p>The key discriminator turned out to be <strong>Shannon entropy</strong>. Real base64 encoding produces a near-uniform distribution over 64 characters, yielding high entropy (~5.5-6.0 bits per character). English text, even without spaces, has heavily skewed letter frequencies (lots of e/t/a/o, few z/q/x/j), producing lower entropy (~4.0-4.5 bits).</p>

<p>I combined this with common English word detection — if the OCR output contains “the”, “and”, “from”, “your” etc., it’s text, not base64. With both filters in place, the re-scan of Dataset 11 correctly returned zero hits: none of those PDFs contain actual base64 attachments.</p>

<h2 id="finding-the-target">Finding the Target</h2>

<p>The blog post identified EFTA00400459 as the specific document to target. I found it in the extracted volumes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>extracted/VOL00009/IMAGES/0092/EFTA00400459.pdf
</code></pre></div></div>

<p>76 pages, 11.2 MB. Page 1 is the email header and body. Pages 2-76 contain the base64-encoded PDF attachment.</p>

<h2 id="three-ocr-approaches">Three OCR Approaches</h2>

<h3 id="approach-1-embedded-text-layer-pdftotext">Approach 1: Embedded Text Layer (pdftotext)</h3>

<p>The scanned PDFs already have an embedded text layer from whatever OCR the DoJ used during processing. Extracting it is instant:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pdftotext EFTA00400459.pdf - | <span class="nb">head</span>
</code></pre></div></div>

<p>This gives us the base64 data directly, but with significant errors:</p>

<ul>
  <li><strong>Result</strong>: 3,843 lines OK, 998 lines failed (<strong>79.4% accuracy</strong>)</li>
  <li>Many invalid characters (commas, brackets, periods scattered throughout)</li>
  <li>The header <code class="language-plaintext highlighter-rouge">JVBERi0x</code> (= <code class="language-plaintext highlighter-rouge">%PDF-1</code>) was garbled</li>
</ul>

<h3 id="approach-2-tesseract-with-base64-whitelist">Approach 2: Tesseract with Base64 Whitelist</h3>

<p>I re-OCR’d the pages using Tesseract with a character whitelist that restricts output to only valid base64 characters:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=
</code></pre></div></div>

<p>Pre-processing pipeline:</p>
<ol>
  <li>Render at 300-400 DPI using <code class="language-plaintext highlighter-rouge">pdftoppm</code></li>
  <li>Convert to grayscale</li>
  <li>Upscale 2x with nearest-neighbor interpolation (preserves sharp text edges)</li>
  <li>Sharpen and binarize with a fixed threshold</li>
  <li>Apply MinFilter to thicken thin Courier New strokes</li>
</ol>

<p>The whitelist eliminates spurious commas and brackets, but it can’t fix characters that are wrong <em>within</em> the base64 alphabet. When Tesseract reads <code class="language-plaintext highlighter-rouge">l</code> but the actual character is <code class="language-plaintext highlighter-rouge">1</code>, both are valid base64 — the whitelist doesn’t help.</p>

<h3 id="approach-3-aws-textract-blog-authors-data">Approach 3: AWS Textract (Blog Author’s Data)</h3>

<p>The blog author generously uploaded their AWS Textract OCR results as a ZIP file. Textract is a commercial OCR service that produced significantly better results:</p>

<ul>
  <li><strong>Result</strong>: 4,750 lines OK, 10 fixed, 64 lines failed (<strong>98.5% accuracy</strong>)</li>
  <li>Only 50 invalid characters across 362K of base64 text</li>
  <li>The PDF header decoded correctly: <code class="language-plaintext highlighter-rouge">%PDF-1.5</code></li>
</ul>

<h2 id="the-decode">The Decode</h2>

<p>Using the Textract data, I wrote a line-by-line decoder that:</p>

<ol>
  <li>Strips email quoting markers (<code class="language-plaintext highlighter-rouge">&gt; </code>)</li>
  <li>Removes EFTA page markers between pages</li>
  <li>Filters invalid characters</li>
  <li>Handles internal <code class="language-plaintext highlighter-rouge">=</code> signs (OCR errors — real base64 only has <code class="language-plaintext highlighter-rouge">=</code> padding at the very end)</li>
  <li>Attempts decode per-line with fallback to padding correction</li>
</ol>

<p>The moment of truth:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Total base64 chars: 361,865
Lines OK: 4,750
Lines fixed: 10
Lines failed: 64
First 20 bytes: b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n34 0'

*** FILE TYPE: PDF ***
Saved: EFTA00400459_DBC12_invite.pdf (271,388 bytes)
</code></pre></div></div>

<p><strong>The header is correct.</strong> <code class="language-plaintext highlighter-rouge">%PDF-1.5</code> followed by the standard binary comment marker. The file size (271KB) is consistent with the MIME header’s <code class="language-plaintext highlighter-rouge">size=276028</code> (allowing for the ~1.5% of corrupted lines).</p>

<h2 id="the-wall">The Wall</h2>

<p>Despite 98.5% line-level accuracy, the recovered PDF cannot be rendered. Running <code class="language-plaintext highlighter-rouge">qpdf --check</code> produces hundreds of structural errors:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARNING: file is damaged
WARNING: can't find startxref
WARNING: Attempting to reconstruct cross-reference table
WARNING: unable to find trailer dictionary
</code></pre></div></div>

<p>The fundamental problem: <strong>base64 is unforgiving</strong>. Each base64 character encodes 6 bits. A single character error corrupts up to 3 bytes of the decoded output. With 64 failed lines scattered across the file, there are at least 64 regions of corrupted bytes — enough to break:</p>

<ul>
  <li>Cross-reference tables (which store byte offsets — any shift breaks all references)</li>
  <li>Stream length declarations (wrong lengths cause parser errors)</li>
  <li>Flate-compressed content streams (a single wrong byte in a zlib stream corrupts everything after it)</li>
  <li>Dictionary key names (corrupted <code class="language-plaintext highlighter-rouge">/Filter</code> becomes gibberish)</li>
</ul>

<h2 id="what-i-learned">What I Learned</h2>

<ol>
  <li>
    <p><strong>Entropy is the best base64 detector.</strong> Character alphabet membership is necessary but not sufficient. Shannon entropy cleanly separates base64 (~5.5+ bits) from English text (~4.5 bits).</p>
  </li>
  <li>
    <p><strong>Character whitelisting helps but doesn’t solve the core problem.</strong> It eliminates invalid characters but can’t fix wrong-within-alphabet errors. The Courier New confusions (1/l/I, 0/O, m/rn, 5/S, 8/B) all involve characters that are valid base64.</p>
  </li>
  <li>
    <p><strong>Line-by-line decoding is essential.</strong> Decoding the entire concatenated base64 as one blob means a single error can cascade. Line-by-line decoding isolates failures to individual 76-character lines (57 bytes each).</p>
  </li>
  <li>
    <p><strong>Commercial OCR significantly outperforms open-source for this task.</strong> Textract’s 98.5% vs. the embedded text layer’s 79.4% is the difference between a partially recognizable PDF and complete garbage.</p>
  </li>
  <li>
    <p><strong>98.5% accuracy is not enough for binary reconstruction.</strong> This is the fundamental insight. For text recovery, 98.5% would be excellent. For binary data where every byte matters, it’s insufficient. You need effectively 100% accuracy, which OCR cannot provide at this scan quality.</p>
  </li>
</ol>

<h2 id="the-tool">The Tool</h2>

<p>The recovery pipeline script is <code class="language-plaintext highlighter-rouge">recover_attachments.py</code> with four modes (supplied below). Disclaimer: It was generated by Claude Opus v4.3, not hand written.</p>

<pre><code class="language-python3">#!/usr/bin/env python3
"""
Recover base64-encoded email attachments from scanned Epstein document PDFs.

Scans PDFs for pages containing base64-encoded binary data (rendered as
Courier New text in scanned images), OCRs them with Tesseract, corrects
common recognition errors, and reconstructs the original files.

Usage:
    # Scan for base64 pages
    python3 recover_attachments.py scan --dir ./dataset9-pdfs/ --output manifest.json

    # Extract from a specific PDF + page range
    python3 recover_attachments.py extract --pdf file.pdf --pages 5-12 --output ./recovered/

    # Full auto pipeline
    python3 recover_attachments.py auto --dir ./dataset9-pdfs/ --output ./recovered/
"""

import argparse
import base64
import json
import math
import os
import re
import string
import sys
from collections import Counter
from concurrent.futures import ProcessPoolExecutor, as_completed
from itertools import groupby
from pathlib import Path

from pdf2image import convert_from_path
from PIL import Image, ImageFilter, ImageOps
import pytesseract

# Valid base64 alphabet
B64_CHARS = set(string.ascii_letters + string.digits + "+/=")

# Common Courier New OCR confusions: char -&gt; list of likely intended chars
CONFUSION_MAP = {
    "1": ["l", "I"],
    "l": ["1", "I"],
    "I": ["l", "1"],
    "0": ["O", "o"],
    "O": ["0"],
    "o": ["0", "O"],
    "5": ["S", "s"],
    "S": ["5"],
    "s": ["5", "S"],
    "8": ["B"],
    "B": ["8"],
    "Z": ["2"],
    "2": ["Z"],
    "G": ["6"],
    "6": ["G"],
    "g": ["9"],
    "9": ["g", "q"],
    "q": ["9"],
    "D": ["0"],
    "U": ["V"],
    "V": ["U"],
    "m": ["rn"],
    "rn": ["m"],
}

# Tesseract config for base64 text
TESSERACT_B64_CONFIG = (
    "--psm 6 "
    "-c tessedit_char_whitelist="
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="
)

# Quick scan config (lower quality, faster)
TESSERACT_SCAN_CONFIG = "--psm 6"

# Magic bytes for file type detection
MAGIC_BYTES = [
    (b"%PDF", ".pdf"),
    (b"\x89PNG\r\n\x1a\n", ".png"),
    (b"\xff\xd8\xff", ".jpg"),
    (b"GIF87a", ".gif"),
    (b"GIF89a", ".gif"),
    (b"PK\x03\x04", ".zip"),
    (b"PK\x05\x06", ".zip"),
    (b"\x1f\x8b", ".gz"),
    (b"Rar!", ".rar"),
    (b"\xd0\xcf\x11\xe0", ".doc"),  # OLE2 (doc/xls/ppt)
    (b"\x50\x4b\x03\x04\x14\x00\x06\x00", ".docx"),  # OOXML
]

# ... [rest of the code continues - truncated for brevity in this example]

if __name__ == "__main__":
    main()
</code></pre>

<h2 id="more-usage-examples">More Usage Examples</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Scan a directory for PDFs containing base64 pages</span>
python3 recover_attachments.py scan <span class="nt">--dir</span> ./pdfs/ <span class="nt">--output</span> manifest.json

<span class="c"># Re-OCR specific pages with Tesseract + whitelist</span>
python3 recover_attachments.py extract <span class="nt">--pdf</span> file.pdf <span class="nt">--pages</span> 2-76

<span class="c"># Extract from embedded text layer (fast, no OCR)</span>
python3 recover_attachments.py textlayer <span class="nt">--pdf</span> file.pdf

<span class="c"># Full auto pipeline</span>
python3 recover_attachments.py auto <span class="nt">--dir</span> ./pdfs/
</code></pre></div></div>

<h2 id="next-steps">Next Steps</h2>

<p>The remaining challenge is closing that last 1.5% gap. Potential approaches:</p>

<ul>
  <li><strong>Multi-engine consensus</strong>: Run Tesseract, Textract, and the embedded text layer independently, then vote per-character. Where two of three agree, use that character.</li>
  <li><strong>Claude Vision</strong>: Use a multimodal LLM to read the base64 text from page images. LLMs may handle the ambiguous Courier New characters better than traditional OCR.</li>
  <li><strong>PDF-aware correction</strong>: Use knowledge of PDF syntax to validate corrections at structural boundaries — dictionary keys, stream markers, and cross-reference entries have predictable patterns.</li>
  <li><strong>Scan the full dataset</strong>: Run the entropy-based scanner across all 6,067 Dataset 9 PDFs and the extracted volumes to find other base64 attachments. Some may be simpler files (images, small documents) that are more tolerant of byte-level errors.</li>
</ul>]]></content><author><name>Graham Greenfield</name><email>smart.bench3906@fastmail.com</email></author><category term="AI" /><category term="Python3" /><category term="Claude Code" /><category term="Claude Opus" /><summary type="html"><![CDATA[Following a challenge posed by Mahmoud Al-Qudsi, I set out to build an automated pipeline for recovering base64-encoded email attachments buried inside the DoJ’s Epstein document release. Here’s what I found.]]></summary></entry><entry><title type="html">Tools to Enhance Your Workflow with AI Models</title><link href="https://grahamg.xyz/posts/tools-to-enhance-workflow-with-ai-models/" rel="alternate" type="text/html" title="Tools to Enhance Your Workflow with AI Models" /><published>2025-01-27T01:05:51+00:00</published><updated>2025-01-27T01:05:51+00:00</updated><id>https://grahamg.xyz/posts/tools-to-enhance-workflow-with-ai-models</id><content type="html" xml:base="https://grahamg.xyz/posts/tools-to-enhance-workflow-with-ai-models/"><![CDATA[<p>In the evolving landscape of AI-assisted development, effectively managing and presenting code context to large language models (LLMs) like ChatGPT and Claude is crucial. Several command-line interface (CLI) tools have been developed to streamline this process, enabling developers to consolidate their codebases into single prompts for more efficient AI interactions. Below is a curated list of notable CLI tools designed to assist in this endeavor:</p>

<ol>
  <li>code2prompt (March 17, 2024)</li>
</ol>

<p>A CLI tool that converts your codebase into a single LLM prompt, featuring source tree visualization, prompt templating, and token counting.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/mufeedvh/code2prompt
</code></pre></div></div>

<ol>
  <li>SnapSource for VS Code (July 13, 2024)</li>
</ol>

<p>A Visual Studio Code extension that allows users to copy file and folder contents along with the project tree structure to the clipboard, facilitating easy sharing and prompting.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Marketplace: https://marketplace.visualstudio.com/items?itemName=LeonKohli.snapsource
</code></pre></div></div>

<ol>
  <li>multi-file-code-to-ai (July 14, 2024)</li>
</ol>

<p>Enables selection of multiple files to convert them into a prompt suitable for AI models like ChatGPT, Claude, or DeepSeek.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/kasfictionlive/multi-file-code-to-ai
</code></pre></div></div>

<ol>
  <li>Repomix (formerly Repopack) (July 15, 2024)</li>
</ol>

<p>Packs your entire repository into a single, AI-friendly file, ideal for feeding codebases to LLMs or other AI tools.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/yamadashy/repomix
</code></pre></div></div>

<ol>
  <li>Mify (July 15, 2024)</li>
</ol>

<p>Combines LLM code generation with templates, allowing for the creation of backend services and code updates via LLM.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/mify-io/mify-llm-editor
</code></pre></div></div>

<ol>
  <li>ai-digest (July 16, 2024)</li>
</ol>

<p>Aggregates your codebase into a single Markdown file for use with AI models like Claude Projects or custom ChatGPTs.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Package: https://www.npmjs.com/package/ai-digest
</code></pre></div></div>

<ol>
  <li>Prelude (July 21, 2024)</li>
</ol>

<p>A simple tool to build LLM prompts from your code repositories.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/aerugo/prelude
</code></pre></div></div>

<ol>
  <li>TxtRepo (July 21, 2024)</li>
</ol>

<p>Allows users to interact with GitHub repositories using a simple API.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Website: https://txtrepo.com/
Repository: https://github.com/matan1905/TxtRepo
</code></pre></div></div>

<ol>
  <li>ContextForge (July 28, 2024)</li>
</ol>

<p>Compiles the contents of a development project into a single, well-structured file for AI prompting.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/seeschweiler/contextforge
</code></pre></div></div>

<ol>
  <li>Repo-Documenter (August 4, 2024)</li>
</ol>

<p>A PowerShell script that generates comprehensive documentation for a repository, including a tree view of the structure and contents of specified files.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/esoltys/Repo-Documenter
</code></pre></div></div>

<ol>
  <li>Coding Context File Generator (August 8, 2024)</li>
</ol>

<p>Generates concise project context for AI analysis.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Website: https://repo-distillery.vercel.app/
</code></pre></div></div>

<ol>
  <li>CodeContext</li>
</ol>

<p>An app for Mac &amp; Windows to provide code context to LLMs.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/DavidVeksler/CodeContext
</code></pre></div></div>

<ol>
  <li>llmcat - Copy Code from CLI to Claude (November 12, 2024)</li>
</ol>

<p>Prepares files and directories for LLM consumption.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/azer/llmcat
</code></pre></div></div>

<ol>
  <li>Concat-Proj (November 18, 2024)</li>
</ol>

<p>A utility tool designed to help developers provide code context to AI chat assistants by combining multiple project files into a single, well-formatted text file.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repository: https://github.com/your-repo-link
</code></pre></div></div>

<p>These tools represent a significant advancement in integrating AI into the development workflow, making it easier to manage and present code context to LLMs. By leveraging these CLI tools, developers can enhance their productivity and the effectiveness of AI-assisted coding.</p>]]></content><author><name>Graham Greenfield</name><email>smart.bench3906@fastmail.com</email></author><category term="AI" /><category term="Development" /><category term="Tools" /><category term="CLI" /><category term="Productivity" /><summary type="html"><![CDATA[In the evolving landscape of AI-assisted development, effectively managing and presenting code context to large language models (LLMs) like ChatGPT and Claude is crucial. Several command-line interface (CLI) tools have been developed to streamline this process, enabling developers to consolidate their codebases into single prompts for more efficient AI interactions. Below is a curated list of notable CLI tools designed to assist in this endeavor:]]></summary></entry></feed>