Faster Method to Enumerate Heaps on Windows

Faster Method to Enumerate Heaps on Windows - www.SecurityXploded.com

Introduction

This article explains a fast method of enumerating heaps on Windows. Traditional Windows heap enumeration functions are slower and takes lot of time while traversing large number of heap blocks. This article uncovers the reason behind it and shows you a new efficient way of enumerating process heaps based on reverse engineering of Windows heap API functions.

Reason Behind the Research

Some days back I was doing password strength related research on Yahoo Messenger. It used to store the password on the heap and I wrote an sample code using normal heap functions to locate and retrieve the password. The password was basically stored on one of the heap block which was near the end of 60,000th block. So I had to traverse all the 60,000 heap blocks to read the target block and the execution took more than 10 minutes..! I tried running the program on multiple machines but no change. Naturally I was getting irritated as I had to wait for so long just to see the results.

To find a way around this problem, I tried looking on the internet for answers but found nothing. Then I finally resort to finding the truth myself and started reverse engineering the Windows heap functions. Finally after few hours of work, I discovered the reason behind the delay and wrote my own tool which enumerated the complete heaps within few seconds.

Windows Heap Enumeration

Typical heap traversing process on Windows involves following steps

Initialization
CreateToolhelp32Snapshot function used to take snapshot of the specified process. This is the first function called before using any of the heap enumeration functions.

Listing heap nodes of process
After the initialization, Heap32ListFirst & Heap32ListNext functions are used to retrieve all heap nodes within the process.

Enumerating heap blocks within each node
Heap32First & Heap32Next functions lists all the heap blocks within each node

This enumeration method does not present any problem when listing small number of heap blocks. However when it comes to enumerating large number of heap blocks (more than 50,000) in each node then it takes more than 10 minutes. This delay cannot be justified in today's computing world.

Mystery Behind Slow Heap Functions

The reason lies in the poor implementation of Heap32First and Heap32Next functions. Code for both the functions are almost similar except some arithmetic routines used to move to right block. Here is the assembly code for Heap32Next function from kernel32.dll.Let us see what is happening behind the screen...

.text:7C863BF8 ; BOOL __stdcall Heap32Next(LPHEAPENTRY32 lphe)
.text:7C863BF8 public _Heap32Next@4
.text:7C863BF8 _Heap32Next@4 proc near
.text:7C863BF8
.text:7C863BF8 lphe = dword ptr 8
.text:7C863BF8
.text:7C863BF8 mov edi, edi
.text:7C863BFA push ebp
.text:7C863BFB mov ebp, esp
.text:7C863BFD push ebx
.text:7C863BFE push esi
.text:7C863BFF mov esi, [ebp+lphe]
.text:7C863C02 test esi, esi
.text:7C863C04 push edi
.text:7C863C05 jz loc_7C863D0C
.text:7C863C0B cmp dword ptr [esi], 24h
.text:7C863C0E jnz loc_7C863D0C
.text:7C863C14 push 0
.text:7C863C16 push 0
.text:7C863C18 call ds:__imp__RtlCreateQueryDebugBuffer@8
.text:7C863C1E mov ebx, eax
.text:7C863C20 test ebx, ebx
.text:7C863C22 jnz short loc_7C863C2E
.text:7C863C24 mov eax, 0C0000001h
.text:7C863C29 jmp loc_7C863D18

First it invokes RtlCreateQueryDebugBuffer(0, FALSE) which allocates buffer for storing heap data.On success, it returns pointer to allocated debug buffer which is stored in ebx. If the function fails to allocate buffer then it jumpts to the end of the function otherwise continues with the code below.

.text:7C863C2E
.text:7C863C2E loc_7C863C2E: ; CODE XREF: Heap32Next(x)+2Aj
.text:7C863C2E push ebx ; Debug Buffer pointer
.text:7C863C2F push 14h ; PDI_HEAPS | PDI_HEAP_BLOCKS
.text:7C863C31 push dword ptr [esi+1Ch] ; Process id
.text:7C863C34 call ds:__imp__RtlQueryProcessDebugInformation@12
.text:7C863C3A mov edi, eax
.text:7C863C3C test edi, edi
.text:7C863C3E jge short loc_7C863C4D ; continue next
.text:7C863C40 push ebx
.text:7C863C41 call ds:__imp__RtlDestroyQueryDebugBuffer@4
.text:7C863C47 push edi
.text:7C863C48 jmp loc_7C863D11 ; exit

Next RtlQueryProcessDebugInformation is called to load all heap blocks of the process. This function in fact loads entire heap nodes and corresponding heap blocks of the process.

.text:7C863C4D
.text:7C863C4D loc_7C863C4D: ; CODE XREF: Heap32Next(x)+46j
.text:7C863C4D mov ecx, [ebx+38h] ; ecx = heap information structure
.text:7C863C50 mov edx, [ecx] ;edx = count of heap nodes within process
.text:7C863C52 xor eax, eax
.text:7C863C54 test edx, edx
.text:7C863C56 jbe short loc_7C863C6A
.text:7C863C58 mov edi, [esi+20h]
.text:7C863C5B add ecx, 4 ;ecx now point to first heap node

Debug Buffer contains the pointer to heap information structure at offset 0x38. First parameter of this heap structure is node count and after that it contains array of DEBUG_HEAP_INFORMATION structure which represent each heap node. The heap information structure is shown below.

Struct HeapInfo
{
DWORD heapNodeCount;
DEBUG_HEAP_INFORMATION heapNode[heapNodeCount];
};

Once the node count is retrieved ecx is incremented by 4 to point to the first heap node.

.text:7C863C5E
.text:7C863C5E loc_7C863C5E: ; CODE XREF: Heap32Next(x)+70j
.text:7C863C5E cmp [ecx], edi
.text:7C863C60 jz short loc_7C863C6A
.text:7C863C62 inc eax
.text:7C863C63 add ecx, 40h ; Move to the next heap node
.text:7C863C66 cmp eax, edx ; Check if current count > total heap node count
.text:7C863C68 jb short loc_7C863C5E

This is simple loop which checks if we are at the right heap node. If not then ecx is incremented by size of DEBUG_HEAP_INFORMATION to move to the next heap node. Here is its structure information.

// It represents each heap node
typedef struct _DEBUG_HEAP_INFORMATION
{
      ULONG Base; // 0x00
      ULONG Flags; // 0x04
      USHORT Granularity; // 0x08
      USHORT Unknown; // 0x0A
      ULONG Allocated; // 0x0C
      ULONG Committed; // 0x10
      ULONG TagCount; // 0x14
      ULONG BlockCount; // 0x18
      ULONG Reserved[7]; // 0x1C
      PVOID Tags; // 0x38
      PVOID Blocks; // 0x3C Heap block pointer for this node.
} DEBUG_HEAP_INFORMATION, *PDEBUG_HEAP_INFORMATION;

.text:7C863C6A
.text:7C863C6A loc_7C863C6A: ; CODE XREF: Heap32Next(x)+5Ej
.text:7C863C6A ; Heap32Next(x)+68j
.text:7C863C6A cmp eax, edx
.text:7C863C6C jnb short loc_7C863CB8 ; did not find the right node, so exit
.text:7C863C6E inc dword ptr [esi+18h] ; HEAPENTRY32->dwResvd++
.text:7C863C71 mov ecx, [esi+18h]
.text:7C863C74 shl eax, 6
.text:7C863C77 mov edi, eax
.text:7C863C79 mov eax, [ebx+38h]
.text:7C863C7C lea edx, [edi+eax]
.text:7C863C7F cmp ecx, [edx+1Ch] ; Check if current count > total block count
.text:7C863C82 jnb short loc_7C863CB8 ; take the exit
.text:7C863C84 mov eax, ecx
.text:7C863C86 shl eax, 4
.text:7C863C89 add eax, [edx+40h] ; now eax points to current heap block
.text:7C863C8C test byte ptr [eax+4], 2
.text:7C863C90 jz short loc_7C863CC8

Once the right heap node is found, it uses the curHeapNode->Blocks pointer to traverse the heap blocks to find the current heap block. Here HEAPENTRY32-> dwResvd is internally used to keep track of block count.

.text:7C863C92
.text:7C863C92 loc_7C863C92: ; CODE XREF: Heap32Next(x)+BEj
.text:7C863C92 mov edx, [ebx+38h]
.text:7C863C95 movzx edx, word ptr [edi+edx+0Ch] ; edx = curHeapNode->Granularity
.text:7C863C9A add edx, [eax+0Ch] ; edx += curBlock->address
.text:7C863C9D mov [esi+8], edx ; HEAPENTRY32->dwAddress = edx;
.text:7C863CA0 mov edx, [ebx+38h]
.text:7C863CA3 cmp ecx, [edi+edx+1Ch] ; check block count
.text:7C863CA7 jnb short loc_7C863CB8 ; Go to exit
.text:7C863CA9 inc ecx
.text:7C863CAA add eax, 10h ; move to next block
.text:7C863CAD mov [esi+18h], ecx ; HEAPENTRY32->dwResvd = ecx
.text:7C863CB0 test byte ptr [eax+4], 2 ; curBlock->flag & 2
.text:7C863CB4 jz short loc_7C863CCE
.text:7C863CB6 jmp short loc_7C863C92 ; continue until valid block found

The above code looks for valid heap block based on certain flags. Each heap block has following structure

struct HeapBlock
{
      DWORD size;
      DWORD flag;
      DWORD unknown;
      DWORD address;
};

.text:7C863CB8
.text:7C863CB8 loc_7C863CB8: ; CODE XREF: Heap32Next(x)+74j
.text:7C863CB8 ; Heap32Next(x)+8Aj ...
.text:7C863CB8 push ebx
.text:7C863CB9 call ds:__imp__RtlDestroyQueryDebugBuffer@4
.text:7C863CBF push 12h ; dwErrCode
.text:7C863CC1 call _SetLastError@4 ; SetLastError(x)
.text:7C863CC6 jmp short loc_7C863D16

The above code is error handler which just destroys the buffer created at the beginning, sets the last error code and returns to the caller.

.text:7C863CC8
.text:7C863CC8 loc_7C863CC8: ; CODE XREF: Heap32Next(x)+98j
.text:7C863CC8 mov ecx, [esi+0Ch] ; ecx = HEAPENTRY32->dwBlockSize
.text:7C863CCB add [esi+8], ecx ; HEAPENTRY32->dwAddress += ecx

For the first block, if the value curBlock->flag & 2 != 2 then the current block address is computed by adding previos block address and block size.

.text:7C863CCE
.text:7C863CCE loc_7C863CCE: ; CODE XREF: Heap32Next(x)+BCj
.text:7C863CCE mov ecx, [eax]
.text:7C863CD0 xor edi, edi
.text:7C863CD2 mov [esi+0Ch], ecx ; HEAPENTRY32->dwBlockSize = ecx
.text:7C863CD5 mov ax, [eax+4]
.text:7C863CD9 inc edi
.text:7C863CDA test al, 0F1h
.text:7C863CDC jnz short loc_7C863CFE
.text:7C863CDE test ah, 2
.text:7C863CE1 jnz short loc_7C863CFE
.text:7C863CE3 test al, 20h
.text:7C863CE5 jz short loc_7C863CF0
.text:7C863CE7 mov dword ptr [esi+10h], 4 ; HEAPENTRY32->dwFlags=4
.text:7C863CEE jmp short loc_7C863D01
.text:7C863CF0
.text:7C863CF0 loc_7C863CF0: ; CODE XREF: Heap32Next(x)+EDj
.text:7C863CF0 test ah, 1
.text:7C863CF3 jz short loc_7C863D01
.text:7C863CF5 mov dword ptr [esi+10h], 2 ; HEAPENTRY32->dwFlags=2
.text:7C863CFC jmp short loc_7C863D01
.text:7C863CFE
.text:7C863CFE loc_7C863CFE: ; CODE XREF: Heap32Next(x)+E4j
.text:7C863CFE ; Heap32Next(x)+E9j
.text:7C863CFE mov [esi+10h], edi ; HEAPENTRY32->dwFlags=4

Once the right heap block is found, its type is determined which can be either LF32_FIXED, LF32_FREE or LF32_MOVEABLE. Using this information, HEAPENTRY32->dwFlags value is updated.

.text:7C863D01
.text:7C863D01 loc_7C863D01: ; CODE XREF: Heap32Next(x)+F6j
.text:7C863D01 ; Heap32Next(x)+FBj ...
.text:7C863D01 push ebx
.text:7C863D02 call ds:__imp__RtlDestroyQueryDebugBuffer@4
.text:7C863D08 mov eax, edi
.text:7C863D0A jmp short loc_7C863D18

At the end, the allocated buffer is destroyed by calling the function RtlDestroyQueryDebugBuffer.

.text:7C863D0C
.text:7C863D0C loc_7C863D0C: ; CODE XREF: Heap32Next(x)+Dj
.text:7C863D0C ; Heap32Next(x)+16j
.text:7C863D0C push 0C0000004h
.text:7C863D11
.text:7C863D11 loc_7C863D11: ; CODE XREF: Heap32Next(x)+50j
.text:7C863D11 call _BaseSetLastNTError@4 ; BaseSetLastNTError(x)
.text:7C863D16
.text:7C863D16 loc_7C863D16: ; CODE XREF: Heap32Next(x)+CEj
.text:7C863D16 xor eax, eax
.text:7C863D18
.text:7C863D18 loc_7C863D18: ; CODE XREF: Heap32Next(x)+31j
.text:7C863D18 ; Heap32Next(x)+112j
.text:7C863D18 pop edi
.text:7C863D19 pop esi
.text:7C863D1A pop ebx
.text:7C863D1B pop ebp
.text:7C863D1C retn 4
.text:7C863D1C Heap32Next@4 endp

If there is an error then function BaseSetLastNTError is called to set last error. Otherwise function returns normally.

To summarize, following action takes place during execution of Heap32Next function

First RtlCreateQueryDebugBuffer is invoked to allocate large buffer to store entire process heap data.
Next it calls RtlQueryProcessDebugInformation to load the all the heap data into the buffer.
After that it traverses the heap nodes to find the right heap node.
Then it traces through the all the heap blocks in that heap node to find the next valid heap block.
At the end RtlDestroyQueryDebugBuffer is used to free the allocated buffer.

This sequence is executed every time Heap32Next is called. Assume that you are enumerating large number of heap blocks (say 50,000) then for every heap block, buffer is created, entire process heap is loaded, traversed and finally the buffer is destroyed. This is the reason behind the long delay while enumerating thousands of heap blocks.

Clearly, its the poor design, one could have just done the allocation & storing heaps during initialization and whenever heap32next is called, it can just traverse the blocks to return the next valid heap block. Using the later approach any large number of heaps can be enumerated in few seconds.

Faster Heap Enumeration Method

Based on above information found during my reverse engineering of Heap32Next and several other heap functions, I wrote my own heap enumeration function which is much more faster and efficient than using Windows heap functions. Here are the complete details of the implementation

To keep the code simple & readable, I have removed all error checking and other unnecessary instructions.

Various Structures Used in the Code

//Debug Buffer used by RtlCreateQueryDebugBuffer
typedef struct _DEBUG_BUFFER
{
      HANDLE SectionHandle;
      PVOID SectionBase;
      PVOID RemoteSectionBase;
      ULONG SectionBaseDelta;
      HANDLE EventPairHandle;
      ULONG Unknown[2];
      HANDLE RemoteThreadHandle;
      ULONG InfoClassMask;
      ULONG SizeOfInfo;
      ULONG AllocatedSize;
      ULONG SectionSize;
      PVOID ModuleInformation;
      PVOID BackTraceInformation;
      PVOID HeapInformation;
      PVOID LockInformation;
      PVOID Reserved[8];
} DEBUG_BUFFER, *PDEBUG_BUFFER;

//This represents each heap node
typedef struct _DEBUG_HEAP_INFORMATION
{
      ULONG Base; // 0x00
      ULONG Flags; // 0x04
      USHORT Granularity; // 0x08
      USHORT Unknown; // 0x0A
      ULONG Allocated; // 0x0C
      ULONG Committed; // 0x10
      ULONG TagCount; // 0x14
      ULONG BlockCount; // 0x18
      ULONG Reserved[7]; // 0x1C
      PVOID Tags; // 0x38
      PVOID Blocks; // 0x3C
} DEBUG_HEAP_INFORMATION, *PDEBUG_HEAP_INFORMATION;

//Internal structure used to store heap block information.
struct HeapBlock
{
      PVOID dwAddress;
      DWORD dwSize;
      DWORD dwFlags;
      ULONG reserved;
};

Enumerating Process Heap Nodes

Here is the sample code demonstrating the traversing through all the heap nodes of a process. It starts with creating debug buffer and then loading entire process heaps using RtlQueryProcessDebugInformation function.Then it enumerates through all the heap nodes and displays the relevant information. At the end, internal buffer is destroyed.

void DisplayHeapNodes(DWORD pid)
{
// Create debug buffer
PDEBUG_BUFFER db = RtlCreateQueryDebugBuffer(0, FALSE);

// Get process heap data
RtlQueryProcessDebugInformation( pid, PDI_HEAPS | PDI_HEAP_BLOCKS, db);

ULONG heapNodeCount = db->HeapInformation ? *PULONG(db->HeapInformation):0;

PDEBUG_HEAP_INFORMATION heapInfo = PDEBUG_HEAP_INFORMATION(PULONG(db-> HeapInformation) + 1);

// Go through each of the heap nodes and dispaly the information
for (unsigned int i = 0; i < heapNodeCount; i++)
{
    printf("\n Base Address = 0x%.8x", heapInfo[i].Base);
    printf("\n Block count = %d", heapInfo[i].BlockCount);
    printf("\n Committed Size= 0x%.8x", heapInfo[i].Committed);
    printf("\n Allocated Size = 0x%.8x", heapInfo[i].Allocated);
    printf("\n Flags = 0x%.8x", heapInfo[i].Flags);
}

// Clean up the buffer
RtlDestroyQueryDebugBuffer( db );
}

Enumerating Heap Blocks within a Node

The code below illustrates the enumerating heap blocks within the specified node for the process.

void DisplayHeapBlocks(DWORD pid, DWORD nodeAddress)
{
HeapBlock hb ={0,0,0,0};

// Create debug buffer
PDEBUG_BUFFER db = RtlCreateQueryDebugBuffer(0, FALSE);

// Get process heap data
LONG ret = RtlQueryProcessDebugInformation( pid, PDI_HEAPS | PDI_HEAP_BLOCKS, db);

ULONG heapNodeCount = db->HeapInformation ? *PULONG(db->HeapInformation) : 0;

PDEBUG_HEAP_INFORMATION heapInfo = PDEBUG_HEAP_INFORMATION(PULONG(db->HeapInformation) + 1);

// Go through each of the heap nodes
for (unsigned int i = 0; i < heapNodeCount; i++)
{
    if(heapInfo[i].Base == nodeAddress)
    {
      // Now enumerate all blocks within this heap node...
      int c = 0;
      memset(&hb,0,sizeof(hb));

      if( GetFirstHeapBlock(&heapInfo[i] , &hb) )
      {
        do
        {
          printf("\n Block count = %d", c+1);
          printf("\n Block address = 0x%.8x", hb.dwAddress);
          printf("\n Block Size = 0x%.8x", hb.dwSize);

          if( hb.dwFlags == LF32_FREE )
             printf("\n Status = Free");
          else
             printf("\n Status = Allocated");

          c++;
        }
        while( GetNextHeapBlock( &heapInfo[i], &hb) );
      }
      break;
    }
}

// Clean up the buffer
RtlDestroyQueryDebugBuffer( db );
}

This function initialize heaps and then traverses through all the heap nodes to find the specified heap node. Next it enumerates all the heap blocks within this node using GetFirstHeapBlock & GetNextHeapBlock functions and displays the information.

BOOL GetFirstHeapBlock( PDEBUG_HEAP_INFORMATION curHeapNode, HeapBlock *hb)
{
int *block;

hb->reserved = 0;
hb->dwAddress = 0;
hb->dwFlags = 0;

block = (int*) curHeapNode->Blocks;

while( ( *(block+1) & 2 ) == 2 )
{
    hb->reserved++;
    hb->dwAddress = (void *) ( *(block+3) + curHeapNode->Granularity );
    block = block+4;
    hb->dwSize = *block;
}

// Update the flags...
USHORT flags = *(block+1);

if( ( flags & 0xF1 ) != 0 || ( flags & 0x0200 ) != 0 )
    hb->dwFlags = 1;
else if( (flags & 0x20) != 0 )
         hb->dwFlags = 4;
       else if( (flags & 0x0100) != 0 )
              hb->dwFlags = 2;

   return TRUE;
}

BOOL GetNextHeapBlock( PDEBUG_HEAP_INFORMATION curHeapNode, HeapBlock *hb)
{
int *block;

hb->reserved++;
block = (int*) curHeapNode->Blocks;

// Make it point to next block address entry
block = block + hb->reserved * 4;

if( ( *(block+1) & 2 ) == 2 )
{
    do
    {
      // new address = curBlockAddress + Granularity ;
      hb->dwAddress = (void *) ( *(block+3) + curHeapNode->Granularity );

      // If all the blocks have been enumerated....exit
      if( hb->reserved > curHeapNode->BlockCount)
        return FALSE;

      hb->reserved++;
      blockAddress = block + 4; //move to next block
      hb->dwSize = *block;
     }
     while( ( *(block+1) & 2 ) == 2 );
}
else
{
    // New Address = prev Address + prev block size ;
    hb->dwAddress = (void*) ( (int)hb->dwAddress + hb->dwSize );
    hb->dwSize = *block;
}

// Update the flags...
USHORT flags = *( block+1);

if( ( flags & 0xF1 ) != 0 || ( flags & 0x0200 ) != 0 )
    hb->dwFlags = 1;
else if( (flags & 0x20) != 0 )
         hb->dwFlags = 4;
       else if( (flags & 0x0100) != 0 )
              hb->dwFlags = 2;

return TRUE;
}

Much of the functionality for the above functions is derived from the assembly code of Heap32First and Heap32Next functions.

GetFirstHeapBlock returns the first valid heap block in the current heap node. GetNextHeapBlock does the similar task to find next valid heap block. Information about the previous heap block is stored in the HeapBlock structure and it should not be modified by the caller.

If you want to implement heap enumeration functionality in your project, then you can use the following template.

1. InitHeap()
This will call RtlCreateQueryDebugBuffer & RtlQueryProcessDebugInformation to initialize the process heap data.

2. GetHeapNode(int index)
This function traverses the list of nodes and returns the node at specified index.

3. GetFirstHeapBlock(PDEBUG_HEAP_INFORMATION curHeapNode, HeapBlock *hb)
Return the first valid heap block within the specified node.

4. GetNextHeapBlock(PDEBUG_HEAP_INFORMATION curHeapNode, HeapBlock *hb)
Return the next valid heap block within the specified node. HeapBlock will contain the information about the previous heap block.

5. DestroyHeap()
This function invokes RtlDestroyQueryDebugBuffer to free the internal buffer.

Conclusion

This article gives insight on implementation of Windows heap API functions and explains reason for their slower execution. Then it presents a new way to implement faster and efficient heap enumeration functionality.