Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)

Three steps were taken to parse the W3C lists, so my parsed corpus of email messages consists of 3 parts (see below). To get an integrated parsed corpus, please unzip and extract them into a same folder. The format of each email message is plain text. Each parsed email message has a "header" section like this:

    docno="lists-000-0346833"
    received="Wed May 13 22:14:07 1998"
    isoreceived="19980514021407"
    sent="Wed, 13 May 1998 21:16:46 -0500"
    isosent="19980514021646"
    name="Dan Connolly"
    email="connolly@w3.org"
    subject="modular HTML DTD [was: My future of HTML position paper]"
    id="355A540E.25C9@w3.org"
    charset="us-ascii"
    inreplyto="88256604.000781D5.00@d53mta01.boulder.ibm.com"
    expires="-1"
 
    To:singer@almaden.ibm.com
    CC:html-future@w3.org
For part 1 and part 2, the first 12 lines were directly taken from the "comment" section of an original html/xml file, but the "To:" and "CC:" lines were parsed from its corresponding "To" and "CC" elements. The total number of emails is 174,311.

If you have an alternative implementation of parsing the W3C corpus (e.g., NekoHtml), I will be very interested in it.

Questions and Comments? Please contact wuyj AT glue DOT umd DOT edu, and wuyj AT umiacs DOT umd DOT edu. Thanks.

Yejun Wu
College of Information Studies,
University of Maryland Institute of Advanced Computer Studies,
University of Maryland, College Park
6/27/05