At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
44. 44
Yes, we’re hiring!
info@svds.com
THANK YOU
Stephen O’Sullivan
stephen@svds.com
@steveos
Demo code is here:
github.com/silicon-valley-data-
science/stampedecon-2015
Notes de l'éditeur
Description
You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
Do not support block compression
Once they are compressed they are not splittable anymore increasing read performance cost
Each data file contains the values for a set of rows
Within a data file, the values from each column are organized so that they are adjacent, enabling good compression values
No results query 1 (which is count no conditions). This is because stinger is has meta data about the amount of data in the table (only when it’s an internal table).