• Feb 07, 2023 News!JACN will adopt Article-by-Article Work Flow. The benefit of article-by-article workflow is that a delay with one article may not delay the entire issue. Once a paper steps into production, it will be published online soon.   [Click]
  • May 30, 2022 News!JACN Vol.10, No.1 has been published with online version.   [Click]
  • Dec 24, 2021 News!Volume 9 No 1 has been indexed by EI (inspec)!   [Click]
General Information
    • ISSN: 1793-8244 (Print)
    • Abbreviated Title:  J. Adv. Comput. Netw.
    • Frequency: Semiyearly
    • DOI: 10.18178/JACN
    • Editor-in-Chief: Professor Haklin Kimm
    • Executive Editor: Ms. Cherry Chan
    • Abstracting/ Indexing: EBSCO, ProQuest, and Google Scholar.
    • E-mail: jacn@ejournal.net
Professor Haklin Kimm
East Stroudsburg University, USA
I'm happy to take on the position of editor in chief of JACN. We encourage authors to submit papers on all aspects of computer networks.

JACN 2017 Vol.5(2): 59-64 ISSN: 1793-8244
DOI: 10.18178/JACN.2017.5.2.241

A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop

Miguel Zenon Nicanor L. Saavedra and William Emmanuel S. Yu

Abstract—Hadoop's popularity as a distributed computing platform continues to grow as more and more data is generated each year. As a fault-tolerant and horizontally scalable ecosystem, it becomes a suitable platform for the analysis of big network data. While most network data are currently being analyzed by vertically scaled machines, Hadoop provides an alternative method of analysis, allowing large datasets to be analyzed in one horizontally-scaled cluster. This study attempts to benchmark and profile the current known methods for performing network analysis on Hadoop. After comparing three data storage formats; plain text, Parquet, and raw PCAP files; for use in Hadoop, the study has determined that the Parquet and text formats greatly outperform the use of raw PCAP files using the hadoop-pcap library which fails to complete tests with high volumes of data. This comes at the expense, however, of large data loss due to the need to create a well-defined schema for processing and the conversion time necessary to shift to a different format. However, Parquet still outperforms the text format by an average of approximately 30% in the scan and aggregate queries, and 70% and 40% respectively in the join and aggregate-join queries while showing a 8%-10% increase of performance in aggregate-join queries of over 60 minutes’ worth of PCAP data.

Index Terms—Big data, apache hadoop, apache hive, network analysis.

M. L. Saavedra is with the Department of Information Systems and Computer Science (DISCS) of the Ateneo De Manila University, Philippines (email: miguel.saavedra@obf.ateneo.edu). W. S. Yu is with Faculty at the DISCS of the Ateneo de Manila University, Philippines (email: wyu@ateneo.edu).


Cite:Miguel Zenon Nicanor L. Saavedra and William Emmanuel S. Yu, "A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop," Journal of Advances in Computer Networks vol. 5, no. 2, pp. 59-64, 2017.

Copyright © 2008-2024. Journal of Advances in Computer Networks.  All rights reserved.
E-mail: jacn@ejournal.net