Hunting on Sysmon events with Jupyter Notebooks (Part 2 - Process Execution)

Sysmon process execution events

Before we start hunting on Sysmon or any other log/data source, we need to understand well what is the format of the data and which fields are useful for us as hunters. Not all the fields in a log source are equally valuable for hunting; some are more relevant than others. For example, the PID field from process execution events is most commonly used during specific IT investigations.


  • Installed and running Sysmon service in a Windows system
  • Installed and running Winlogbeat service on the same device as Sysmon
  • Fully installed and configured Logstash

Explore Sysmon event ID 1 with the event viewer

  • Open the Windows event viewer and navigate to “Application and Services Logs → Windows -> Sysmon”
  • Click on “Filter Current Log…” on the right menu and set the filter to show only events with ID 1
  • Select an event in the middle panel and double click it to display its details in a new window

About pandas

Pandas is a Python library for high-level data manipulation developed by Wes McKinney. It is built on the Numpy package, and its vital data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging. Even create domain-specific time offsets and join time series without losing data;
  • Highly optimized for performance, with critical code paths written in Cython or C.

Slicing through Sysmon data in Jupyter

Before we can access the Sysmon data in our WSL-Ubuntu Jupyter environment we need to do one more thing. At the moment WSL it’s not able to handle natively network shares so it’s necessary to mount them first using DrvFs. Follow the steps below to mount your Logstash network share in WSL.

  • Create a local folder to be used as a mount point
  • Mount the remote Logstash samba folder
  • List the contents of the folder to verify that the files are accessible

Sysmon process creation hunting playbook

  • Create a new empty Python3 playbook in Jupyter Lab
  • Load the Sysmon log files from the shared drive
import glob
files = []
for f in glob.glob(“/mnt/sysmon-logs/winlogbeat-2020–08–24-*.json”):
import json
events = []
for f in files:
fin = open(f, ‘r’)
for line in fin.readlines():
event = json.loads(line.strip())
evt_id1 = []
for evt in events:
if evt[‘winlog’][‘provider_name’] == ‘Microsoft-Windows-Sysmon’:
if evt[‘winlog’][‘event_id’] == 1:
import pprint
pp = pprint.PrettyPrinter(indent=4)
{ 'api': 'wineventlog',
'channel': 'Microsoft-Windows-Sysmon/Operational',
'computer_name': 'WinTest01',
'event_data': { 'CommandLine': '"C:\\Program Files '
'(x86)\\Dropbox\\Update\\DropboxUpdate.exe" '
'/ua /installsource scheduler',
'Company': 'Dropbox, Inc.',
'CurrentDirectory': 'C:\\WINDOWS\\system32\\',
'Description': 'Dropbox Update',
'FileVersion': '',
'Hashes': 'SHA1=D3A77E94D08F2EB9A8276F32CA16F65D1CE8B524,MD5=A1F58FFF448E4099297D6EE0641D4D0E,SHA256=47839789332AAF8861F7731BF2D3FBB5E0991EA0D0B457BB4C8C1784F76C73DC,IMPHASH=907BD326A444DBC0E31CEF85B0646F45',
'Image': 'C:\\Program Files '
'IntegrityLevel': 'System',
'LogonGuid': '{5a87d633-dc4c-5f34-e703-000000000000}',
'LogonId': '0x3e7',
'OriginalFileName': 'DropboxUpdate.exe',
'ParentCommandLine': 'C:\\WINDOWS\\system32\\svchost.exe '
'-k netsvcs -p -s Schedule',
'ParentImage': 'C:\\Windows\\System32\\svchost.exe',
'ParentProcessGuid': '{5a87d633-dc4c-5f34-1b00-000000001c00}',
'ParentProcessId': '2044',
'ProcessGuid': '{5a87d633-16b8-5f3a-314e-000000001c00}',
'ProcessId': '30832',
'Product': 'Dropbox Update',
'RuleName': '-',
'TerminalSessionId': '0',
'UtcTime': '2020-08-17 05:33:44.304'},
'event_id': 1,
'opcode': 'Info',
'process': {'pid': 6268, 'thread': {'id': 7576}},
'provider_guid': '{5770385f-c22a-43e0-bf4c-06f5698ffbd9}',
'provider_name': 'Microsoft-Windows-Sysmon',
'record_id': 1201012,
'task': 'Process Create (rule: ProcessCreate)',
'user': { 'domain': 'NT AUTHORITY',
'identifier': 'S-1-5-18',
'name': 'SYSTEM',
'type': 'User'},
'version': 5}
  • Preparing events for pandas
header = ['timestamp', 'computer_name', 'process_path', 'parent_path', 'command_line', 'parent_command_line', 'user', 'sha1', 'company', 'description']
events_list = []
for evt in evt_id1:
new_evt = []
except KeyError:
  • Generating a pandas dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(events_list, columns=header)
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')

Baseline hunting

We are ready now to start having some fun :)

  • Top software vendors active in the environment
top_procs = df[['company', 'description']]\
.rename_axis(['company', 'description'])\
  • Find the top devices running unique processes
evd_procs = df.groupby(['computer_name'])\
  • Find the top 10 executed processes across all devices
top_procs = df['process_path']\
  • Top 10 processes using the process path and hash in the aggregation
top_procs = df.groupby(['process_path', 'sha1'])\
  • Find processes with the same hash but executed from multiple different paths
evd_procs = df.groupby(['sha1'])\
has_hash = df['sha1'] == 'F95ED0E286AA68B4DF779D7E782363EDB5B9FF04'
procs_with_hash = df[has_hash]
procs_with_hash[['process_path', 'command_line']].head(10)
  • Find all PowerShell instances started by a different process than cmd.exe and explorer.exe
ps_hunt = df.query('process_path.str.contains("powershell.exe") & ~parent_path.str.contains("cmd.exe") & ~parent_path.str.contains("explorer.exe") & ~parent_path.str.contains("Program Files")')

Advanced hunting

Let’s step up our hunt to the next level. We can calculate additional properties that will extend our original dataset. For example, we can calculate the length of the process_path and command_line columns and the entropy of the command_line.

df['proc_path_len'] = df['process_path'].str.len()
df['com_line_len'] = df['command_line'].str.len()
import mathdef get_entropy(row):
cline = str(row['command_line']).replace('"','')
prob = [ float(cline.count(c)) / len(cline) for c in dict.fromkeys(list(cline)) ]
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
df['cl_entropy'] = df.apply(get_entropy, axis=1)df.head(5)
  • Using scatter charts to visualize outliers
import plotly.graph_objects as go
import plotly
fig = go.Figure(data=go.Scatter(x=df['proc_path_len'],
fig.update_layout(title='Length vs entropy scatter chart',
xaxis_title="process_path length",
yaxis_title="log of the command_line length",
  • Using histogram charts to discover behavioral anomalies
import as pxfig = px.histogram(df, x="timestamp", color="computer_name", nbins=200)
  • Using unsupervised machine learning to explore the data set
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
import numpy as np
df_dbscan = df[['proc_path_len', 'com_line_len', 'cl_entropy']]
scaler = StandardScaler() 
df_scaled = scaler.fit_transform(df_dbscan.to_numpy())
db = DBSCAN(eps = 0.3, min_samples = 10).fit(df_scaled) 
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
import as px# Map the cluster labels back to the original dataframe
df['clusters'] = labels
df_grouped = df.groupby(['clusters']).agg({'proc_path_len': ['mean'], 'com_line_len': ['mean'], 'cl_entropy': ['mean']})
df_grouped.columns = df_grouped.columns.droplevel(-1)
df_grouped.reset_index(inplace = True)
fig = px.scatter_3d(df_grouped, x='proc_path_len', y='com_line_len', z='cl_entropy',

Next steps

In our next post, we will hunt on other Sysmon event types. It’s in particular interesting to explore Sysmon network connection and DNS events. These events can help us discover if there are processes within the environment used for command and control communications, lateral movement, data exfiltration, or attacking other targets over the Internet.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Leonardo M. Falcon

Leonardo M. Falcon

Leonardo is a recognized expert and leader in the field of cybersecurity, entrepreneur, and founder at Falcon Guard (