Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)
Three steps were taken to parse the W3C lists, so my parsed corpus of email messages consists of 3 parts (see below).
To get an integrated parsed corpus, please unzip and extract them into a same folder. The format of each email
message is plain text. Each parsed email message has a "header" section like this:
received="Wed May 13 22:14:07 1998"
sent="Wed, 13 May 1998 21:16:46 -0500"
subject="modular HTML DTD [was: My future of HTML position paper]"
For part 1 and part 2, the first 12 lines were directly taken from the "comment" section of an original
html/xml file, but the "To:" and "CC:" lines were parsed from its corresponding "To" and "CC" elements.
The total number of emails is 174,311.
- part 1 has 161,645 email messages. This part was generated using a Java
SAXParser. A parsed file may contain question mark strings (e.g., ??????) which are typically Chinese-Japanese-Korean
(CJK) characters since my parser does not recognize CJK characters.
Please remove lists-106-9062592.txt~ and add the following to this part (see Paul Thomas's email at:
- part 2 has 9841 email messages. This part was generated with lynx and a
perl script because my SAXParser failed to parse these messages. An original html/xml file of this part
typically has an aforementioned "hearder" section. At the bottom of a parsed email message
may occassionally contain an unwanted line (typically starting with an asterisk) that is external to the email message.
- part 3 has 2813 email messages. This part was also generated with lynx and
a perl script. The orginal html files present X-Mail messages so they have a different header structure than those of
part 1 and part 2. However, the header of a parsed email message has been normalized as far as possible to be similar
to that of part 1 and part 2, but may occassionally contain one or two lines of X-Mail metadata. At the bottom of
a parsed email message may occassionally contain an unwanted line (typically starting with asterisks) that is
external to the email message. Several email messages with an empty value for the "X-From" field have been removed.
If you have an alternative implementation of parsing the W3C corpus (e.g., NekoHtml), I will be very interested in it.
Questions and Comments? Please contact wuyj AT glue DOT umd DOT edu, and wuyj AT umiacs DOT umd DOT edu. Thanks.
College of Information Studies,
University of Maryland Institute of Advanced Computer Studies,
University of Maryland, College Park