Abstract—Hadoop's popularity as a distributed computing platform continues to grow as more and more data is generated each year. As a fault-tolerant and horizontally scalable ecosystem, it becomes a suitable platform for the analysis of big network data. While most network data are currently being analyzed by vertically scaled machines, Hadoop provides an alternative method of analysis, allowing large datasets to be analyzed in one horizontally-scaled cluster. This study attempts to benchmark and profile the current known methods for performing network analysis on Hadoop. After comparing three data storage formats; plain text, Parquet, and raw PCAP files; for use in Hadoop, the study has determined that the Parquet and text formats greatly outperform the use of raw PCAP files using the hadoop-pcap library which fails to complete tests with high volumes of data. This comes at the expense, however, of large data loss due to the need to create a well-defined schema for processing and the conversion time necessary to shift to a different format. However, Parquet still outperforms the text format by an average of approximately 30% in the scan and aggregate queries, and 70% and 40% respectively in the join and aggregate-join queries while showing a 8%-10% increase of performance in aggregate-join queries of over 60 minutes’ worth of PCAP data.
Index Terms—Big data, apache hadoop, apache hive, network analysis.
M. L. Saavedra is with the Department of Information Systems and Computer Science (DISCS) of the Ateneo De Manila University, Philippines (email: miguel.saavedra@obf.ateneo.edu).
W. S. Yu is with Faculty at the DISCS of the Ateneo de Manila University, Philippines (email: wyu@ateneo.edu).
[PDF]
Cite:Miguel Zenon Nicanor L. Saavedra and William Emmanuel S. Yu, "A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop," Journal of Advances in Computer Networks vol. 5, no. 2, pp. 59-64, 2017.