• Jun 18, 2020 News!JACN Vol.8, No.1 has been published with online version.   [Click]
  • May 22, 2018 News!JACN has been included in EBSCO database.
  • Jul 03, 2017 News!JACN Vol.4, No.2 has been indexed by EI (inspec)!   [Click]
General Information
    • ISSN: 1793-8244 (Print)
    • Abbreviated Title:  J. Adv. Comput. Netw.
    • Frequency: Semiyearly
    • DOI: 10.18178/JACN
    • Editor-in-Chief: Dr. Ka Wai Gary Wong
    • Executive Editor: Ms. Nina Lee
    • Abstracting/ Indexing: INSPEC (IET),   Electronic Journals Library, Ulrich's Periodicals Directory, EBSCO, ProQuest, and Google Scholar.
    • E-mail: jacn@ejournal.net
Dr. Ka Wai Gary Wong
Division of Information and Technology Studies, Faculty of Education, The University of Hong Kong.
It's a honor to serve as the editor-in-chief of JACN. I'll work together with the editors and reviewers to help the journal progress
JACN 2017 Vol.5(2): 59-64 ISSN: 1793-8244
DOI: 10.18178/JACN.2017.5.2.241

A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop

Miguel Zenon Nicanor L. Saavedra and William Emmanuel S. Yu
Abstract—Hadoop's popularity as a distributed computing platform continues to grow as more and more data is generated each year. As a fault-tolerant and horizontally scalable ecosystem, it becomes a suitable platform for the analysis of big network data. While most network data are currently being analyzed by vertically scaled machines, Hadoop provides an alternative method of analysis, allowing large datasets to be analyzed in one horizontally-scaled cluster. This study attempts to benchmark and profile the current known methods for performing network analysis on Hadoop. After comparing three data storage formats; plain text, Parquet, and raw PCAP files; for use in Hadoop, the study has determined that the Parquet and text formats greatly outperform the use of raw PCAP files using the hadoop-pcap library which fails to complete tests with high volumes of data. This comes at the expense, however, of large data loss due to the need to create a well-defined schema for processing and the conversion time necessary to shift to a different format. However, Parquet still outperforms the text format by an average of approximately 30% in the scan and aggregate queries, and 70% and 40% respectively in the join and aggregate-join queries while showing a 8%-10% increase of performance in aggregate-join queries of over 60 minutes’ worth of PCAP data.

Index Terms—Big data, apache hadoop, apache hive, network analysis.

M. L. Saavedra is with the Department of Information Systems and Computer Science (DISCS) of the Ateneo De Manila University, Philippines (email: miguel.saavedra@obf.ateneo.edu). W. S. Yu is with Faculty at the DISCS of the Ateneo de Manila University, Philippines (email: wyu@ateneo.edu).


Cite:Miguel Zenon Nicanor L. Saavedra and William Emmanuel S. Yu, "A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop," Journal of Advances in Computer Networks vol. 5, no. 2, pp. 59-64, 2017.

Copyright © 2008-2020. Journal of Advances in Computer Networks.  All rights reserved.
E-mail: jacn@ejournal.net