Expanding a pandas list column to rows

I quite frequently get pandas DataFrames that are lists of items — usually these come from data queries converted to a DataFrame. These are a bit painful to process — imagine a DataFrame like this one where multiple IP addresses are packed into a single column.

Extracting individual IP address to do something like Threat Intelligence or geo-location lookups is tricky to do — especially in a nice, pandas-like way. In particular, I wanted a solution that I could put into a pandas pipeline without loose bits of Python code cluttering up the place. An example of a pandas pipeline is something like this.

result_df = (
input_df
# Do a WhoIs lookup for IPs in the IpAddress column
.mp_pivot.run(IpAddress.whois, column='IpAddress', join='inner')
# Display a sample of the alerts
.mp_pivot.display(
title='Alerts',
query='Computer.str.startswith('MSTICAlerts')',
cols=['Computer', 'Account'],
head=10
)
# Save the current state of the DataFrame to 'var_df' variable
.mp_pivot.tee(var_name='var_df', clobber=True)
# Create an event timeline plot but don't change the data
.mp_pivot.tee_exec(
'mp_timeline.plot',
source_columns=['Computer', 'Account']
)
)

Each of the .mp_pivot functions essentially takes a DataFrame as input from the previous line and outputs a processed DataFrame to be consumed by the next line. You can build up complex processing functions in a single, fluent expression. We use pipelines in the MSTICPy Python package to automate multiple operations on DataFrames. You can read more about how we use pipelines in the MSTICPy documentation on pivot functions.

If we were to try to use this pipeline with an input DataFrame with IP address lists instead of individual values, the WhoIs lookup (second line of the pipeline above) would not work and our pipeline would fail.

I’ve seen a couple of answers to splitting lists into rows but nothing completely satisfactory. The best answer I’ve seen is from this post from Bartosz MikulskiHow to split a list inside a Dataframe cell into rows in Pandas. My solution is modification of his and most credit goes to him. His solution is shown here:

data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1) \
.melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
.drop("variable", axis = 1) \
.dropna()

This works well using the pandas melt function to extract out column values to rows. However, I was a bit wary of using .apply(pd.Series) since the performance of this can be an issue with large datasets.

Replacing .apply(pd.Series) with to_list gives us much better performance but now have we a couple of issues.

If the input data has rows with different numbers of list elements, we end up with Python None objects all over the place. We need to get rid of these but pandas doesn’t have any clever way of dropping them efficiently (as it can with NaN values). The pandas replace function can sort this out for us. I’ve added back Bartosz’ merge and melt lines since already these work perfectly.

orig_cols = df2.columns
(
pd.DataFrame(df2.IPAddresses.to_list())
.replace([None], np.nan)
.merge(df2, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name="IPAddress")
)

One tweak I’ve added is to replace the explicitly listed column names passed to the id_vars parameter with the original columns from our input data, since we normally want to preserve all columns other than our list column.

Finally, to get rid of those annoying NaN rows we can use dropna — but we want to limit its scope to only the newly added column (we don’t want columns that happen to have randomly scattered NaNs to disappear on us!)

(
pd.DataFrame(df2.IPAddresses.to_list())
.replace([None], np.nan)
.merge(df2, right_index = True, left_index = True)
.melt(id_vars=orig_cols, value_name="IPAddress")
.dropna(subset=["IPAddress"])
.drop(["IPAddresses"], axis=1)
)

This gives us our expanded and cleaned data.

I said earlier that I wanted to use this functionality in pipelines, where the function will have no idea of what the specific column names in the input data are going to be. First here is a generic, stand-alone version of the function. Next we’ll look at turning this into a pandas accessor.

def list_to_rows(data, cols: Union[str, Iterable[str]]):
"""Expand a list column to individual rows."""
orig_cols = data.columns
if isinstance(cols, str):
cols = [cols]
for col in cols:
item_col = f"{col}_list_item$$"
ren_col = {item_col: col}
data = (
pd.DataFrame(data[col].to_list())
.replace([None], np.nan) # convert any Nones to NaN
.merge(data, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name=item_col)
.dropna(subset=[item_col]) # get rid of rows with NaNs
# drop our unneeded src column plus the melt "variable"
.drop([col, "variable"], axis=1)
.rename(columns=ren_col)
)
return data

It supports specifying multiple or single columns to expand — if multiple columns are given, it just re-applies the logic iteratively to each column. If a column turns out not to be a list, to_list produces a single element list. This results a bit of unnecessary processing but doesn’t break anything.

However, if the column contains a mix of lists and strings, you’ll have a problem — in this case, I’d suggest pre-processing the DataFrame to split it into rows with list items and another with strings.

is_list_flt = df2.IPAddresses.apply(lambda x: isinstance(x, list))
df_with_lists = df2[is_list_flt]
df_with_strs = df2[~is_list_flt]

To do use this kind of logic in a single pandas expression you might consider using pandas where and mask functions but a) is more thinking than I care to do at the moment and b) is overkill for my needs.

To use this function in a pipeline, I need to turn it into a pandas accessor function — you can read more about these in the pandas documentation here. Note, there are alternatives to using accessors— see the pandas pipe function, for example — but since we already have pandas accessor classes in our library, this is the most convenient option.

def list_to_rows(self, cols: Union[str, Iterable[str]]):
"""Expand a list column to individual rows."""
orig_cols = self._df.columns
data = self._df
if isinstance(cols, str):
cols = [cols]
for col in cols:
item_col = f"{col}_list_item$$"
ren_col = {item_col: col}
data = (
pd.DataFrame(data[col].to_list())
.replace([None], np.nan) # convert any Nones to NaN
.merge(data, right_index=True, left_index=True)
.melt(id_vars=orig_cols, value_name=item_col)
.dropna(subset=[item_col]) # get rid of rows with NaNs
# drop our unneeded src column plus the melt "variable"
.drop([col, "variable"], axis=1)
.rename(columns=ren_col)
)
return data

This method is part of a pandas accessor class — the class stores the input DataFrame in the self._df variable.

To use as an accessor, add it as method to a class like this. You need to apply the pandas accessor API class decorator and give the accessor a name.

@pd.api.extensions.register_dataframe_accessor("my_accessor")
class PivotAccessor:
"""Pandas api extension for Pivot functions."""
def __init__(self, pandas_obj):
"""Instantiate pivot extension class."""
self._df = pandas_obj
def list_to_rows(self, cols: Union[str, Iterable[str]]):
...

If you import this file, pandas will magically add a “my_accessor” attribute to all DataFrames, with the accessor functions exposed as methods of that attribute. Now you can use “list_to_rows” on any pandas DataFrame.

df.my_accessor.list_to_rows(cols="IPAddresses")

I hope you took something useful from my first ever individual blog post. I’ll try to keep adding items about Python, pandas and other technologies that I use every day. If you’re interested in the use of Python, Jupyter notebooks and pandas for CyberSecurity, please check out MSTICPy.

Follow me on twitter — ianhellen

Principal security engineer @ Microsoft. Python and Jupyter notebooks in Cybersecurity.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store