Abstract—Because emails are private information, it is hard to acquire enough authentic data to build a corpus of emails. Email headers, however, do not involve email bodies, thus have less privacy. An email header, which contains the recipient, the sender, and a lot of other key information about the email sending process, has a high value for related research. This paper proposes an idea for building a corpus of email headers for the first time. The idea is to encrypt sensitive data via Secure Hash Algorithm when collecting key fields in email headers. All corpus data can be examined by volunteers themselves to confirm that no privacy remains. For ease of use, each data in this corpus contains is labeled with the number of recipients, the sending and receiving geographical locations, the user's social attributes such as country, language, job, professional, and so on, where some information of user's social attributes are obtained through questionnaires. The corpus can be applied to the research fields such as community discovery, users' relationship analysis, email classification, and spam email recognition, etc. Moreover, the method for building a corpus of email headers proposed in this paper can also be applied for other corpus data collection work where users' privacy protection is necessary.
Index Terms—Corpus, mail header, SHA, privacy protection, corpus labeling, social network, spam.
Yongchao Wang is with the Faculty of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Aichi, Japan, and the School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, Shaanxi, China (e-mail: wyc@xaut.edu.cn).
Xiao Zhao is with the Artificial Intelligence Institute, College of Electrical and Information Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China (e-mail: zhaoxiao@sust.edu.cn).
Feihang Ge is with the Faculty of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Aichi, Japan, and with the Zhejiang College of Construction, China (e-mail: gfhang@163.com).
Yuyan Chao is with the Graduate School of Environment Management, Nagoya Sangyo University, Owariasahi 488-8711, Japan (e-mail: chao@nagoya-su.ac.jp).
Lifeng He is with the Graduate School of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Japan, and Artificial Intelligence Institute, College of Electrical and Information Engineering, Shaanxi University of Science and Technology, Xi’an 710021, Shaanxi, China (e-mail: helifeng@ist.aichi-pu.ac.jp).
[PDF]
Cite:Yongchao Wang, Xiao Zhao, Feihang Ge, Yuyan Chao, and Lifeng He, "A Corpus of Email Headers with Personal Privacy Protection," Journal of Advances in Computer Networks vol. 5, no. 2, pp. 53-58, 2017.