Expanding a pandas list column to rows
I quite frequently get pandas DataFrames that are lists of items — usually these come from data queries converted to a DataFrame. These are a bit painful to process — imagine a DataFrame like this one where multiple IP addresses are packed into a single column.
Extracting individual IP address to do something like Threat Intelligence or geo-location lookups is tricky to do — especially in a nice, pandas-like way. In particular, I wanted a solution that I could put into a pandas pipeline without loose bits of Python code cluttering up the place. An example of a pandas pipeline is something like this.
result_df = (
input_df
# Do a WhoIs lookup for IPs in the IpAddress column
.mp_pivot.run(IpAddress.whois, column='IpAddress', join='inner')
# Display a sample of the alerts
.mp_pivot.display(
title='Alerts',
query='Computer.str.startswith('MSTICAlerts')',
cols=['Computer', 'Account'],
head=10
)
# Save the current state of the DataFrame to 'var_df' variable
.mp_pivot.tee(var_name='var_df', clobber=True)
# Create an event timeline plot but don't change the data
.mp_pivot.tee_exec(
'mp_timeline.plot',
source_columns=['Computer', 'Account']
)
)
Each of the .mp_pivot
functions essentially takes a DataFrame as input from the previous line and outputs a processed DataFrame to be consumed by the next line. You can build up complex processing functions in a single, fluent expression. We use pipelines in the MSTICPy Python package to automate multiple operations on DataFrames. You can read more about how we use pipelines in the MSTICPy documentation on pivot functions.
If we were to try to use this pipeline with an input DataFrame with IP address lists instead of individual values, the WhoIs lookup (second line of the pipeline above) would not work and our pipeline would fail.
A good solution
I’ve seen a couple of answers to splitting lists into rows but nothing completely satisfactory. The best answer I’ve seen is from this post from Bartosz Mikulski — How to split a list inside a Dataframe cell into rows in Pandas. My solution is modification of his and most credit goes to him. His solution is shown here:
data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1) \
.melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
.drop("variable", axis = 1) \
.dropna()
This works well using the pandas melt
function to extract out column values to rows. However, I was a bit wary of using .apply(pd.Series)
since the performance of this can be an issue with large datasets.
A slightly better solution
Replacing .apply(pd.Series)
with to_list
gives us much better performance but now have we a couple of issues.
If the input data has rows with different numbers of list elements, we end up with Python None
objects all over the place. We need to get rid of these but pandas doesn’t have any clever way of dropping them efficiently (as it can with NaN values). The pandas replace
function can sort this out for us. I’ve added back Bartosz’ merge and melt lines since already these work perfectly.
orig_cols = df2.columns
(
pd.DataFrame(df2.IPAddresses.to_list())
.replace([None], np.nan)
.merge(df2, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name="IPAddress")
)
One tweak I’ve added is to replace the explicitly listed column names passed to the id_vars
parameter with the original columns from our input data, since we normally want to preserve all columns other than our list column.
Finally, to get rid of those annoying NaN rows we can use dropna
— but we want to limit its scope to only the newly added column (we don’t want columns that happen to have randomly scattered NaNs to disappear on us!)
(
pd.DataFrame(df2.IPAddresses.to_list())
.replace([None], np.nan)
.merge(df2, right_index = True, left_index = True)
.melt(id_vars=orig_cols, value_name="IPAddress")
.dropna(subset=["IPAddress"])
.drop(["IPAddresses"], axis=1)
)
This gives us our expanded and cleaned data.
Making the function generic
I said earlier that I wanted to use this functionality in pipelines, where the function will have no idea of what the specific column names in the input data are going to be. First here is a generic, stand-alone version of the function. Next we’ll look at turning this into a pandas accessor.
def list_to_rows(data, cols: Union[str, Iterable[str]]):
"""Expand a list column to individual rows."""
orig_cols = data.columns
if isinstance(cols, str):
cols = [cols]
for col in cols:
item_col = f"{col}_list_item$$"
ren_col = {item_col: col}
data = (
pd.DataFrame(data[col].to_list())
.replace([None], np.nan) # convert any Nones to NaN
.merge(data, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name=item_col)
.dropna(subset=[item_col]) # get rid of rows with NaNs
# drop our unneeded src column plus the melt "variable"
.drop([col, "variable"], axis=1)
.rename(columns=ren_col)
)
return data
It supports specifying multiple or single columns to expand — if multiple columns are given, it just re-applies the logic iteratively to each column. If a column turns out not to be a list, to_list
produces a single element list. This results a bit of unnecessary processing but doesn’t break anything.
However, if the column contains a mix of lists and strings, you’ll have a problem — in this case, I’d suggest pre-processing the DataFrame to split it into rows with list items and another with strings.
is_list_flt = df2.IPAddresses.apply(lambda x: isinstance(x, list))
df_with_lists = df2[is_list_flt]
df_with_strs = df2[~is_list_flt]
To do use this kind of logic in a single pandas expression you might consider using pandas where
and mask
functions but a) is more thinking than I care to do at the moment and b) is overkill for my needs.
As a pandas accessor method
To use this function in a pipeline, I need to turn it into a pandas accessor function — you can read more about these in the pandas documentation here. Note, there are alternatives to using accessors— see the pandas pipe function, for example — but since we already have pandas accessor classes in our library, this is the most convenient option.
def list_to_rows(self, cols: Union[str, Iterable[str]]):
"""Expand a list column to individual rows."""
orig_cols = self._df.columns
data = self._df
if isinstance(cols, str):
cols = [cols]
for col in cols:
item_col = f"{col}_list_item$$"
ren_col = {item_col: col}
data = (
pd.DataFrame(data[col].to_list())
.replace([None], np.nan) # convert any Nones to NaN
.merge(data, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name=item_col)
.dropna(subset=[item_col]) # get rid of rows with NaNs
# drop our unneeded src column plus the melt "variable"
.drop([col, "variable"], axis=1)
.rename(columns=ren_col)
)
return data
This method is part of a pandas accessor class — the class stores the input DataFrame in the self._df variable.
To use as an accessor, add it as method to a class like this. You need to apply the pandas accessor API class decorator and give the accessor a name.
@pd.api.extensions.register_dataframe_accessor("my_accessor")
class PivotAccessor:
"""Pandas api extension for Pivot functions.""" def __init__(self, pandas_obj):
"""Instantiate pivot extension class."""
self._df = pandas_obj def list_to_rows(self, cols: Union[str, Iterable[str]]):
...
If you import this file, pandas will magically add a “my_accessor” attribute to all DataFrames, with the accessor functions exposed as methods of that attribute. Now you can use “list_to_rows” on any pandas DataFrame.
df.my_accessor.list_to_rows(cols="IPAddresses")
Conclusion
I hope you took something useful from my first ever individual blog post. I’ll try to keep adding items about Python, pandas and other technologies that I use every day. If you’re interested in the use of Python, Jupyter notebooks and pandas for CyberSecurity, please check out MSTICPy.
Follow me on twitter — ianhellen