Lessons from the Enron E-Mail dataset
2024-08-19 | tags: thoughts
A wall of numbered mailboxes.  Photo by Joel Dunn on Unsplash.

(Photo by Joel Dunn on Unsplash )

Someone recently brought up the Enron E-Mails dataset in conversation. This was a collection of ~600k (!) e-mails collected as part of the Enron investigation in the early 2000s.

Working through the Enron dataset way back when was solid practice in data munging, NLP, network analysis, the whole lot.

For me, the greatest and most lasting lesson was:

Real-World Data Is Messy

You can't get to the "fun" analysis till you've sorted that out.

From that we get two important sub-lessons:

1/ it's important to get to know a dataset (actually sift through it) before you try to use it

2/ thank your data engineers

For more details on the dataset, you can check out its Wikipedia entry and a Technology Review article from 2013.

A security professional's take on LLM risks

Someone who attended the BlackHat conference left with concerns about LLMs

Popular versus useful

It helps to know the difference