Table of <filename messageID> table of <messageID personNames> generated from Enron email corpus; and table ofharshness scores

Table of harshness scores of Enron emails generated by OASYS: harshness.enron(36MB).
Each row has a filename (followed by a colon), a score computed by the DWTF algorithm, a score by the Template_No_Topic algorithm, a score by the TF_No_Topic algorithm, and a hybrid score computed using the former 3 scores. Personally I doubt the accuracy of the hybrid scores, but believe more or less the TF_No_Topic scores based on my spot check. (Reference: Cesarano, Bonnie Dorr, Antonio Picariello, Diego Reforgiato, Amelia Sagoff, V.S. Subrahmanian (2006), OASYS: An Opinion Analysis System. AAAI-CAAW 2006, Palo Alto, CA.)

Table of <filename messageID> and table of generated from Enron email corpus.

Table of <messageID personNames>: enronMsgIDPersonNames.txt (52M). There are duplicated messageIDs in the table because multiple emails may use a same messageID. That is, messageIDs are not unique identifications of email messages; filenames are.
Here is the list of messageIDs that have duplicates and their corresponding filenames: duplicateMsgIDTable.txt
Here is the table of <filename messageID> with duplicated messageIDs: enronFilenameMsgIDTable.txt(35M).
Here is the table of <filename messageID> without duplicated messageIDs: enronFilenameMsgIDTableNoDup.txt(35M). Duplicated messageIDs were removed according to the system's order of reading the enron corpus' "maildir" directory and processing its subdirectories. That is, if a messageID has already been used by a message (identified by the system as a filename, such as /allen-p/inbox/1.), the files (with the same messageID) read by the system later will be ignored. Note, the system does not necessarily read the "allen-p" subdirectory before the "martin-t" subdirectory.

Lists mapping from email addresses to mentioned names. These are extracted from NameSearchAddress.out which is filtered down ONLY to cases in which there is a single unique email address for the menioned name (i.e., not zero, and not more than one).

mapping from Enron email address (i.e., @enron.com) to names: NameSearchAddr1To1Enron.sort
mapping from non-Enron email address to names: NameSearchAddr1To1NonEnron.sort

One to one mapping from Enron email address (i.e., @enron.com) to names: fromNameSearchAddr1To1Enron.filtered
One to one mapping from non-Enron email address to names: fromNameSearchAddr1To1NonEnron.filtered

Enron email corpus annotated by LingPipe: Annotated Enron corpus (496M).

Yejun Wu (wuyj AT glue DOT umd DOT edu)
3/31/06