Parsed W3C Lists Corpus (for TRECENT 2005)

Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)

Three steps were taken to parse the W3C lists, so my parsed corpus of email messages consists of 3 parts (see below). To get an integrated parsed corpus, please unzip and extract them into a same folder. The format of each email message is plain text. Each parsed email message has a "header" section like this:

    docno="lists-000-0346833"
    received="Wed May 13 22:14:07 1998"
    isoreceived="19980514021407"
    sent="Wed, 13 May 1998 21:16:46 -0500"
    isosent="19980514021646"
    name="Dan Connolly"
    email="connolly@w3.org"
    subject="modular HTML DTD [was: My future of HTML position paper]"
    id="355A540E.25C9@w3.org"
    charset="us-ascii"
    inreplyto="88256604.000781D5.00@d53mta01.boulder.ibm.com"
    expires="-1"
 
    To:singer@almaden.ibm.com
    CC:html-future@w3.org

For part 1 and part 2, the first 12 lines were directly taken from the "comment" section of an original html/xml file, but the "To:" and "CC:" lines were parsed from its corresponding "To" and "CC" elements.

part 1 has 161,645 email messages. This part was generated using a Java SAXParser. A parsed file may contain question mark strings (e.g., ??????) which are typically Chinese-Japanese-Korean (CJK) characters since my parser does not recognize CJK characters.
Please remove lists-106-9062592.txt~ and add the following to this part (see Paul Thomas's email at: http://cio.nist.gov/esd/emaildir/lists/trec-ent/msg00041.html):
lists-092-2385725
lists-092-2387937
lists-092-3565216
lists-092-4721189
lists-100-7183374
lists-108-11336283
lists-108-11658647
lists-110-4324215
lists-110-4707268
lists-111-16545247
lists-111-5981159
lists-111-5985916
part 2 has 9841 email messages. This part was generated with lynx and a perl script because my SAXParser failed to parse these messages. An original html/xml file of this part typically has an aforementioned "hearder" section. At the bottom of a parsed email message may occassionally contain an unwanted line (typically starting with an asterisk) that is external to the email message.
part 3 has 2813 email messages. This part was also generated with lynx and a perl script. The orginal html files present X-Mail messages so they have a different header structure than those of part 1 and part 2. However, the header of a parsed email message has been normalized as far as possible to be similar to that of part 1 and part 2, but may occassionally contain one or two lines of X-Mail metadata. At the bottom of a parsed email message may occassionally contain an unwanted line (typically starting with asterisks) that is external to the email message. Several email messages with an empty value for the "X-From" field have been removed.

The total number of emails is 174,311.

If you have an alternative implementation of parsing the W3C corpus (e.g., NekoHtml), I will be very interested in it.

Questions and Comments? Please contact wuyj AT glue DOT umd DOT edu, and wuyj AT umiacs DOT umd DOT edu. Thanks.

Yejun Wu
College of Information Studies,
University of Maryland Institute of Advanced Computer Studies,
University of Maryland, College Park
6/27/05