Hunting on Sysmon events with Jupyter Notebooks (Part 2 - Process Execution)

Leonardo M. Falcon
14 min readMay 1, 2021

In our previous post, we introduced Sysmon. We also provided the steps to install a threat hunting environment that you can use to generate, store, and hunt through Sysmon logs using Jupyter notebooks. This article will discuss how to use Jupyter and Python and libraries like Pandas to analyze millions of Sysmon events efficiently. This time we will focus on the Sysmon events related to process/code execution.

Sysmon process execution events

Before we start hunting on Sysmon or any other log/data source, we need to understand well what is the format of the data and which fields are useful for us as hunters. Not all the fields in a log source are equally valuable for hunting; some are more relevant than others. For example, the PID field from process execution events is most commonly used during specific IT investigations.

On its website, Sysmon provides the following events that are important for understanding process execution in a Windows environment.

Event ID 1: Process creation

The process creation event provides extended information about a newly created process. The full command line provides context on the process execution. The ProcessGUID field is a unique value for this process across a domain to make event correlation easier. The hash is a full hash of the file with the algorithms in the HashType field.

Event ID 6: Driver loaded

The driver loaded events provide information about drivers that are loaded in the system. The configured hashes are provided as well as signature information. The signature is created asynchronously for performance reasons and indicates if the file was removed after loading.

Event ID 7: Image loaded

The image loaded event logs when a module is loaded in a specific process. This event is disabled by default and needs to be configured with the –l option. It indicates the process in which the module is loaded, hashes, and signature information. The signature is created asynchronously for performance reasons and means if the file was removed after loading. This event should be configured carefully, as monitoring all image load events will generate many events.

Requirements

  • Installed and running Sysmon service in a Windows system
  • Installed and running Winlogbeat service on the same device as Sysmon
  • Fully installed and configured Logstash

Explore Sysmon event ID 1 with the event viewer

  • Open the Windows event viewer and navigate to “Application and Services Logs → Windows -> Sysmon”
  • Click on “Filter Current Log…” on the right menu and set the filter to show only events with ID 1
  • Select an event in the middle panel and double click it to display its details in a new window

As you can see, this Sysmon event type provides many interesting fields for hunting and IR investigations. Below you can find a brief description of the most relevant fields:

UtcTime: Time when the event was created (born) in the device.

ParentProcessGuid/ProcessGuid: This is a unique ID for this process across a domain. This value greatly improves the correlation of the activity of a specific process across the same Windows domain.

ParentProcessId/ProcessId: A unique number allocated by the Windows kernel to each system’s active process. This allows process manipulations like adjusting the process priority, suspending it, or killing it.

ParentImage/Image: Contains a string representing the full filesystem path to the process that was executed.

OriginalFileName: The filesystem file name of the process executed.

ParentCommandLine/CommandLine: The command line parameters that were used to execute the process.

CurrentDirectory: Current working directory of the executed process.

User: The Windows user that executed the process.

LogonGuid: Supposedly you should be able to correlate logon events on this computer with corresponding authentication events on the domain controller using this GUID. This is not always true.

LogonID: A semi-unique (unique between reboots) number that identifies the logon session just initiated. Any events logged subsequently during this logon session will report the same Logon ID

Hashes: List of all hash functions calculated for the process file.

You can repeat the same steps to get familiar with the events with ID 6 and 7. More information about these and other events can be found here.

About pandas

Pandas is a Python library for high-level data manipulation developed by Wes McKinney. It is built on the Numpy package, and its vital data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

Pandas has many powerful features:

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging. Even create domain-specific time offsets and join time series without losing data;
  • Highly optimized for performance, with critical code paths written in Cython or C.

You can learn more about pandas' features on its website.

We will use pandas to manipulate very large datasets containing Sysmon events with many different columns. Thanks to pandas we will be able to do this quickly and effortlessly.

Slicing through Sysmon data in Jupyter

Before we can access the Sysmon data in our WSL-Ubuntu Jupyter environment we need to do one more thing. At the moment WSL it’s not able to handle natively network shares so it’s necessary to mount them first using DrvFs. Follow the steps below to mount your Logstash network share in WSL.

  • Create a local folder to be used as a mount point

$ sudo mkdir /mnt/sysmon-logs

  • Mount the remote Logstash samba folder

Note: Before executing the command below, you must ensure that the samba share is also mounted in your Windows host

$ sudo mount -t drvfs '\\[LOGSTASH-IP]\sysmon-logs' /mnt/sysmon-logs

  • List the contents of the folder to verify that the files are accessible

$ ls /mnt/sysmon-logs

Sysmon process creation hunting playbook

  • Create a new empty Python3 playbook in Jupyter Lab
  • Load the Sysmon log files from the shared drive

First, we must read the names of the log files for a specific time frame. We can do this using the Python library “glob”. In the example below, we are loading all the log files created on a specific day.

import glob
files = []
for f in glob.glob(“/mnt/sysmon-logs/winlogbeat-2020–08–24-*.json”):
files.append(f)
print(f)

Next, we need to read all the JSON events from the log files into a single Python list.

import json
events = []
for f in files:
fin = open(f, ‘r’)
for line in fin.readlines():
event = json.loads(line.strip())
events.append(event)

Afterward, we can filter this list and select only the Sysmon events with ID 1 (process creation). We take only the contents of the “winlog” section from the JSON record and create a new list for each event. This section contains all the relevant Sysmon fields we will need for our hunt.

evt_id1 = []
for evt in events:
if evt[‘winlog’][‘provider_name’] == ‘Microsoft-Windows-Sysmon’:
if evt[‘winlog’][‘event_id’] == 1:
evt_id1.append(evt[‘winlog’])

Below you can see an example of the structure of the “winlog” dictionary section extracted from the original Sysmon event generated by Winlogbeat. This section contains all the fields that are interesting to us.

import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(evt_id1[0])
{ 'api': 'wineventlog',
'channel': 'Microsoft-Windows-Sysmon/Operational',
'computer_name': 'WinTest01',
'event_data': { 'CommandLine': '"C:\\Program Files '
'(x86)\\Dropbox\\Update\\DropboxUpdate.exe" '
'/ua /installsource scheduler',
'Company': 'Dropbox, Inc.',
'CurrentDirectory': 'C:\\WINDOWS\\system32\\',
'Description': 'Dropbox Update',
'FileVersion': '1.3.27.73',
'Hashes': 'SHA1=D3A77E94D08F2EB9A8276F32CA16F65D1CE8B524,MD5=A1F58FFF448E4099297D6EE0641D4D0E,SHA256=47839789332AAF8861F7731BF2D3FBB5E0991EA0D0B457BB4C8C1784F76C73DC,IMPHASH=907BD326A444DBC0E31CEF85B0646F45',
'Image': 'C:\\Program Files '
'(x86)\\Dropbox\\Update\\DropboxUpdate.exe',
'IntegrityLevel': 'System',
'LogonGuid': '{5a87d633-dc4c-5f34-e703-000000000000}',
'LogonId': '0x3e7',
'OriginalFileName': 'DropboxUpdate.exe',
'ParentCommandLine': 'C:\\WINDOWS\\system32\\svchost.exe '
'-k netsvcs -p -s Schedule',
'ParentImage': 'C:\\Windows\\System32\\svchost.exe',
'ParentProcessGuid': '{5a87d633-dc4c-5f34-1b00-000000001c00}',
'ParentProcessId': '2044',
'ProcessGuid': '{5a87d633-16b8-5f3a-314e-000000001c00}',
'ProcessId': '30832',
'Product': 'Dropbox Update',
'RuleName': '-',
'TerminalSessionId': '0',
'User': 'NT AUTHORITY\\SYSTEM',
'UtcTime': '2020-08-17 05:33:44.304'},
'event_id': 1,
'opcode': 'Info',
'process': {'pid': 6268, 'thread': {'id': 7576}},
'provider_guid': '{5770385f-c22a-43e0-bf4c-06f5698ffbd9}',
'provider_name': 'Microsoft-Windows-Sysmon',
'record_id': 1201012,
'task': 'Process Create (rule: ProcessCreate)',
'user': { 'domain': 'NT AUTHORITY',
'identifier': 'S-1-5-18',
'name': 'SYSTEM',
'type': 'User'},
'version': 5}
  • Preparing events for pandas

Before we can create a pandas dataframe object, we need to reformat the events into a structure understood by pandas. This object is a list of lists where our events will become the rows. We first create a header with the names of the columns we want to use. Afterward, we iterate through the previous list containing the process execution events and map the fields in the dictionary to the new list object following the header’s mapping. Note that we are using only the sha1 hash from the list of hashes calculated for the process in this example.

header = ['timestamp', 'computer_name', 'process_path', 'parent_path', 'command_line', 'parent_command_line', 'user', 'sha1', 'company', 'description']
events_list = []
for evt in evt_id1:
new_evt = []
try:
new_evt.append(evt['event_data']['UtcTime'])
new_evt.append(evt['computer_name'])
new_evt.append(evt['event_data']['Image'])
new_evt.append(evt['event_data']['ParentImage'])
new_evt.append(evt['event_data']['CommandLine'])
new_evt.append(evt['event_data']['ParentCommandLine'])
new_evt.append(evt['event_data']['User'])
new_evt.append(evt['event_data']['Hashes'][5:45])
new_evt.append(evt['event_data']['Company'])
new_evt.append(evt['event_data']['Description'])
events_list.append(new_evt)
except KeyError:
pass
  • Generating a pandas dataframe

We can now generate the pandas dataframe object using the header and the list of lists object. Note that we are also converting the string values of the column ‘timestamp’ into datetime objects. We will need this later on for our hunts using time series.

import pandas as pd
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(events_list, columns=header)
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
df.head(5)

Baseline hunting

We are ready now to start having some fun :)

Let’s take all this Sysmon data and turn it into information that’s useful for our organization or our clients. A very important thing to do when we are starting to hunt in a new environment is baselining. Let’s see some examples of hunts that can help with this activity.

  • Top software vendors active in the environment

We can use the value_counts Pandas’ function to count the total number of occurrences of values within a data set. We can do this for just one column in our dataframe or a combination of multiple columns. In the example below, we are using the columns ‘company’ and ‘description’ to obtain each combination’s total count.

By visually reviewing this list, we first get familiar with the standard active software in our environment (baselining) and immediately realize if things are potentially uncompliant with our organization’s Internet/computer usage policies. Some examples of software that could be forbidden are games, unapproved online file storage solutions, bitcoin mining sofware, etc. This list can also reveal other things like known hacking or penetration testing tools that shouldn’t be present in the computers assigned to regular users. Remote Access Tools that are not approved for use could also be a concern (e.g., TeamViewer, TightVNC, LogMeIn, pcAnywhere, etc.).

top_procs = df[['company', 'description']]\
.value_counts()\
.rename_axis(['company', 'description'])\
.reset_index(name='counts')
top_procs.head(10)
  • Find the top devices running unique processes

This hunt can help to identify how many devices have an unusual count of unique processes within a specific time frame. For example, we must keep in mind that servers and user workstations usually have different activity profiles linked to process execution. Also, different types of server operating systems behave differently. Typically the activity displayed by user workstations with the same OS should be relatively homogeneous.

We can use a combination of the “groupby” and “nunique” Pandas functions to achieve this.

evd_procs = df.groupby(['computer_name'])\
.sha1.nunique()\
.sort_values(ascending=False)\
.reset_index(name='counts')
evd_procs.head(10)
  • Find the top 10 executed processes across all devices
top_procs = df['process_path']\
.value_counts()\
.rename_axis('process_path')\
.reset_index(name='counts')
top_procs.head(10)
  • Top 10 processes using the process path and hash in the aggregation
top_procs = df.groupby(['process_path', 'sha1'])\
.size()\
.sort_values(ascending=False)\
.reset_index(name='counts')
top_procs.head(10)
  • Find processes with the same hash but executed from multiple different paths

With this, we can detect legitimate Windows tools like PowerShell that have been renamed/moved to a different location to evade detection and potentially used for evil purposes by an attacker.

evd_procs = df.groupby(['sha1'])\
.process_path.nunique()\
.sort_values(ascending=False)\
.reset_index(name='counts')
evd_procs.head(10)

It seems we have found a hash that was executed from 46 unique paths. Let’s investigate further and display all the unique paths for the process with hash “F95ED0E286AA68B4DF779D7E782363EDB5B9FF04”.

For this, we first create a conditional variable “has_hash” that will contain the logic we will use to filter the dataframe. Next, we can apply it to the main dataframe to obtain a new dataframe with only the events matching the filter we created.

has_hash = df['sha1'] == 'F95ED0E286AA68B4DF779D7E782363EDB5B9FF04'
procs_with_hash = df[has_hash]
procs_with_hash[['process_path', 'command_line']].head(10)

We know that DismHost.exe it’s the name of a legitimate Windows process (Dism Host Servicing Process). This hash was scanned in the past by VirusTotal and it wasn’t flagged by any antivirus engine:

https://www.virustotal.com/gui/file/21baef2bb5ab2df3aa4d95c8333aadadda61dee65e61ad2dbe5f3dbaddb163c7/detection

The file is also signed by Microsoft and the signature was valid. This seems to be a normal activity. It can be added to the hunting baseline to decrease the number of outliers in the future.

  • Find all PowerShell instances started by a different process than cmd.exe and explorer.exe

Typically IT administrators execute PowerShell from a command prompt console. PowerShell execution events where the parent process it’s not cmd.exe could help identifying other processes executing PowerShell and potentially malware.

We can achieve this using the Pandas function “query”. The syntax is quite different from SQL but there are some good references and examples online to get started. The guide below provides a good introduction to the translation of common SQL queries to Pandas.

ps_hunt = df.query('process_path.str.contains("powershell.exe") & ~parent_path.str.contains("cmd.exe") & ~parent_path.str.contains("explorer.exe") & ~parent_path.str.contains("Program Files")')
ps_hunt.head(5)

Advanced hunting

Let’s step up our hunt to the next level. We can calculate additional properties that will extend our original dataset. For example, we can calculate the length of the process_path and command_line columns and the entropy of the command_line.

Calculating new properties based on a single column using simple functions it’s pretty straightforward in pandas. We will first calculate the length of the process_path, command_line using the “str.len” function applied to all the values of a single column of the dataframe. We will create an additional column to store the new values.

df['proc_path_len'] = df['process_path'].str.len()
df['com_line_len'] = df['command_line'].str.len()

We can calculate the Shannon entropy of a string using a custom function. We will use then the pandas “apply” function to apply the entropy function to each value of the “command_line” column and store the results in a new column.

import mathdef get_entropy(row):
cline = str(row['command_line']).replace('"','')
prob = [ float(cline.count(c)) / len(cline) for c in dict.fromkeys(list(cline)) ]
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
df['cl_entropy'] = df.apply(get_entropy, axis=1)df.head(5)
  • Using scatter charts to visualize outliers

Charts are very useful tools for Threat Hunting. In this example, we will use a scatter plot chart to map the new properties calculated previously and try to visually detect outliers. Scatter plot charts can help us to spot anomalies even for multiple properties like in this case. We can visualize pandas data using the library Plotly.

import plotly.graph_objects as go
import plotly
fig = go.Figure(data=go.Scatter(x=df['proc_path_len'],
y=df['com_line_len'],
mode='markers',
marker_color=df['cl_entropy'],
text=df['process_path']))
fig.update_layout(title='Length vs entropy scatter chart',
xaxis_title="process_path length",
yaxis_title="log of the command_line length",
yaxis_type="log"
)
fig.show()

Some outliers can be seen clearly. As hunters we should look into these events and determine whether they are malicious or if they have any significance for the security of the company's devices.

  • Using histogram charts to discover behavioral anomalies

We can histogram charts to represent the Sysmon process creation data and try to spot anomalies. For example, specific hosts within the environment may be executing many processes if an attacker it’s performing recon in the host or towards the rest of the network. Such anomaly would be shown as a spike in the time series chart. Let’s do it!

import plotly.express as pxfig = px.histogram(df, x="timestamp", color="computer_name", nbins=200)fig.show()
  • Using unsupervised machine learning to explore the data set

In this example, we will use the DBSCAN clustering Machine Learning algorithm to explore our data set. This unsupervised ML algorithm applied to the new properties we calculated previously can help us identify abnormal event clusters or outliers within our data set that we couldn’t detect visually.

The central concept of the DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. You can learn more about the DBSCAN algorithm in its Wikipedia article.

Before we start, we need to verify that the columns with numeric properties don’t have null values (NaN). The presence of null values would complicate the ML analysis. If there are null values, we need to fix those before we can proceed further.

df.info()

We can also calculate some initial statistical measures of our numeric variables in the data set.

df.describe()

From the above output, you can derive several important measures like standard deviation, mean, and max of each variable. We can also see that all the variables are pretty much continuous. This is good because it’s complicated to obtain “sound” results with data sets also containing categorical data using distance-based ML algorithms. If discrete variables are present, then they should be transformed to produce meaningful interpretations. More information on unsupervised machine learning using mixed data can be found in this article.

Let’s import our new dependencies first.

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
import numpy as np

We need to extract the columns we will use for the ML analysis into a new dataframe.

df_dbscan = df[['proc_path_len', 'com_line_len', 'cl_entropy']]

Then we scale our dataset. Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not look like standard normally distributed data.

scaler = StandardScaler() 
df_scaled = scaler.fit_transform(df_dbscan.to_numpy())

Finally, we can build our DBSCAN clustering model.

db = DBSCAN(eps = 0.3, min_samples = 10).fit(df_scaled) 
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

Let’s visualize the DBSCAN clusters using the average proc_path_len, com_line_len, and cl_entropy. This will help us to render the image faster and see better the clusters.

import plotly.express as px# Map the cluster labels back to the original dataframe
df['clusters'] = labels
df_grouped = df.groupby(['clusters']).agg({'proc_path_len': ['mean'], 'com_line_len': ['mean'], 'cl_entropy': ['mean']})
df_grouped.columns = df_grouped.columns.droplevel(-1)
df_grouped.reset_index(inplace = True)
fig = px.scatter_3d(df_grouped, x='proc_path_len', y='com_line_len', z='cl_entropy',
color='clusters')
fig.show()

In the link below you can find a shot animation of the 3D scatter visualization used in this example

https://video.wixstatic.com/video/0cf544_e3813a9fb8094e52b01d4dcaadde233e/720p/mp4/file.mp4

Next steps

In our next post, we will hunt on other Sysmon event types. It’s in particular interesting to explore Sysmon network connection and DNS events. These events can help us discover if there are processes within the environment used for command and control communications, lateral movement, data exfiltration, or attacking other targets over the Internet.

You can follow our work in the Cyber Threat Hunting space on our company website. You can also request more information about our services using our online contact form or write us at sales@falconguard.cz.

--

--

Leonardo M. Falcon

Leonardo is a recognized expert and leader in the field of cybersecurity, entrepreneur, and founder at Falcon Guard (https://falconguard.cz)