Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Performance Comparison of Intel® Enterprise Edition for Lustre* 
software and HDFS for MapReduce Applications 
Rekha Singh...
Hadoop 
Introduc-on 
§ Open 
source 
MapReduce 
framework 
for 
data-­‐intensive 
compu7ng 
§ Simple 
programming 
model...
Hadoop 
Introduc-on 
(cont.) 
§ Framework 
handles 
most 
of 
the 
execu7on 
§ Splits 
input 
logically 
and 
feeds 
map...
MapReduce Application Processing
Intel® Enterprise Edition for 
Lustre* software 
Hadoop Dist. File System 
Node Node Node 
Network Switch 
Disk Disk Disk ...
• Clustered, distributed 
computing and storage 
• No data replication 
• No local storage 
• Widely used for HPC 
applica...
Motivation 
q Could HPC and MR co-exist? 
q Need to evaluate use of Lustre software for MR application 
processing 
* Ot...
Using Intel® Enterprise Edition for Lustre* software with Hadoop 
HADOOP ‘ADAPTER’ FOR LUSTRE 
8 * Other names and brands ...
Hadoop 
over 
Intel 
EE 
for 
Lustre* 
Implementa-on 
§ Hadoop 
uses 
pluggable 
extensions 
to 
work 
with 
different 
f...
MR Processing in Intel® EE for Lustre* and HDFS 
Input Output 
LUSTRE 
(Global Namespace) 
Spill File : Reducer à N : 1 
...
Conclusions from Existing Evaluations 
q TestDFSIO: 100% better throughput 
q TeraSort: 10-15% better performance 
q Hi...
Problem Definition 
Performance comparison of Lustre and HDFS file systems for MR 
implementation of FSI workload using HP...
EXPERIMENTAL SETUP 
13
Hadoop + HDFS Setup 
• 1 cluster manager, 1 Name node 
(NN), 8 Data nodes (DN) 
including NN. 
• 8 nodes, each of Intel(R)...
Hadoop + Intel EE for Lustre* software - Setup 
• 1 Resource manager (RM), 1 History 
server (HS), 8 Node managers (NM) 
i...
Intel® Enterprise Edition for Lustre* 2.0 Setup 
q Four OSS, One MDS, 16 OSTs, 1 MDT. 
q OSS Node 
o CPU- Intel(R) Xeon(...
Cluster Parameters 
q Number of Compute nodes = 8 
q Map slots = 24 
q Reduce slots = 7 
q Rest of parameters such as ...
Job Configuration Parameters 
q Map Split size= 1GB 
q Block size = 128MB 
q Input Data is NOT compressed 
q Output Da...
Workload 
q Consolidated Audit Trail System (part of FINRA application) DB 
Schema 
§ Single table with 12 columns relat...
Workload Implementation 
q DB is a flat file with columns separated using a token 
q Data generator to generate data for...
Workload Size 
q Concurrency Tests: 
§ Query in isolation, concurrency =1 
§ Query in concurrent workload, concurrency ...
Performance Metric 
q MR job execution time in isolation 
q MR job average execution time in concurrent workload 
q CPU...
Performance Measurement 
q SAR data is collected from all nodes in the cluster. 
q MapReduce job log files are used for ...
Benchmarking Steps 
24 
Generate Data of given size 
Copy data to HDFS 
Start MR job of query on Name node 
with given con...
RESULT ANALYSIS 
25
26 
Degree of Concurrency = 1 
Intel® EE for Lustre* performs better on large stripe count 
* Other names and brands may b...
27 
Degree of Concurrency = 1 
Intel® EE for Lustre* delivered 3X HDFS for optimal SC settings 
* Other names and brands m...
28 
Degree of Concurrency = 1 
Intel® EE for Lustre* optimal SC gives 70% improvement over HDFS 
* Other names and brands ...
Hadoop + HDFS Setup Hadoop + Intel® EE for 
Lustre* software - Setup 
29 
Nodes = 8 
Nodes = 8+5 = 13 
Performance Linear ...
30 
Number of Compute Servers = 13 
Intel® EE for Lustre* 2X better than HDFS for same BOM 
* Other names and brands may b...
31 
Degree of Concurrency = 5 
Intel® EE for Lustre* was 5.5 times better than HDFS on 7 TB data size 
* Other names and b...
32 
Degree of Concurrency = 5 
Intel® EE for Lustre* was 5.5 times better than HDFS on 7 TB data size 
* Other names and b...
33 
퐶표푛푐푢푟푟푒푛푡 퐽표푏 퐴푣푒푟푎푔푒 퐸푥푒푐푢푡푖표푛 푇푖푚푒/푆푖푛푔푙푒 퐽표푏 퐸푥푒푐푢푡푖표푛 푇푖푚푒  = 2.5
34 
= 4.5
35 
Intel® EE for Lustre* software > HDFS for concurrency 
* Other names and brands may be claimed as the property of othe...
Conclusion 
q Increase in Stripe count improves Enterprise Edition for Lustre* 
software performance 
q Intel® EE for Lu...
Thank You 
rekha.singhal@tcs.com
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
The Importance of Fast, Scalable Storage for Today’s HPC
Next
Download to read offline and view in fullscreen.

Share

Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application

Download to read offline

In this deck from the LAD'14 Conference in Reims, Rekha Singhal from Tata Consultancy Services presents: Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application.

Learn more: http://insidehpc.com/lad14-video-gallery/

Watch the video presentation: http://inside-bigdata.com/2014/09/29/performance-comparison-intel-enterprise-edition-lustre-hdfs-mapreduce/

Related Books

Free with a 30 day trial from Scribd

See all

Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application

  1. 1. Performance Comparison of Intel® Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar * Other names and brands may be claimed as the property of others.
  2. 2. Hadoop Introduc-on § Open source MapReduce framework for data-­‐intensive compu7ng § Simple programming model – two func7ons: Map and Reduce § Map: Transforms input into a list of key value pairs – Map(D) → List[Ki , Vi] § Reduce: Given a key and all associated values, produces result in the form of a list of values – Reduce(Ki , List[Vi]) → List[Vo] § Parallelism hidden by framework – Highly scalable: can be applied to large datasets (Big Data) and run on commodity clusters § Comes with its own user-­‐space distributed file system (HDFS) based on the local storage of cluster nodes 2
  3. 3. Hadoop Introduc-on (cont.) § Framework handles most of the execu7on § Splits input logically and feeds mappers § Par77ons and sorts map outputs (Collect) § Transports map outputs to reducers (Shuffle) § Merges output obtained from each mapper (Merge) 3 Shuffle Input Split Map Input Split Map Input Split Map Reduce Reduce Output Output
  4. 4. MapReduce Application Processing
  5. 5. Intel® Enterprise Edition for Lustre* software Hadoop Dist. File System Node Node Node Network Switch Disk Disk Disk OSS OSS OSS Network Switch OST OST OST Node Node Node * Other names and brands may be claimed as the property of others.
  6. 6. • Clustered, distributed computing and storage • No data replication • No local storage • Widely used for HPC applications • Data moves to the computation • Data replication • Local storage • Widely used for MR applications 6 Intel® Enterprise Edition for Lustre* software Hadoop Dist. File System * Other names and brands may be claimed as the property of others.
  7. 7. Motivation q Could HPC and MR co-exist? q Need to evaluate use of Lustre software for MR application processing * Other names and brands may be claimed as the property of others.
  8. 8. Using Intel® Enterprise Edition for Lustre* software with Hadoop HADOOP ‘ADAPTER’ FOR LUSTRE 8 * Other names and brands may be claimed as the property of others.
  9. 9. Hadoop over Intel EE for Lustre* Implementa-on § Hadoop uses pluggable extensions to work with different file system types § Lustre is POSIX compliant: org.apache.hadoop.fs – Use Hadoop’s built-­‐in LocalFileSystem class – Uses na7ve file system support in Java § Extend and override default behavior: LustreFileSystem – Defines new URL scheme for Lustre – lustre:/// – Controls Lustre striping info – Resolves absolute paths to user-­‐defined directory – Leaves room for future enhancements § Allow Hadoop 9 to find it in config files FileSystem RawLocalFileSystem LustreFileSystem * Other names and brands may be claimed as the property of others.
  10. 10. MR Processing in Intel® EE for Lustre* and HDFS Input Output LUSTRE (Global Namespace) Spill File : Reducer à N : 1 Sort Merge Map O/P Spill Map Shuffle Merge Map O/P Map O/P Map O/P Map O/P Input Reduce Output Spill Local Disk Local Disk HDFS Repetitive * Other names and brands may be claimed as the property of others.
  11. 11. Conclusions from Existing Evaluations q TestDFSIO: 100% better throughput q TeraSort: 10-15% better performance q High Speed connecting Network Needed q Same BOM, HDFS is better for WordCount and BigMapOutput applications q Large number of compute nodes may challenge Enterprise Edition for Lustre* for software performance * Other names and brands may be claimed as the property of others.
  12. 12. Problem Definition Performance comparison of Lustre and HDFS file systems for MR implementation of FSI workload using HPDD cluster hosted in the Intel BigData Lab in Swindon (UK) using Intel® Enterprise Edition for Lustre* software Audit Trail System part of FINRA security specifications (publicly available) is used as a representative application. * Other names and brands may be claimed as the property of others.
  13. 13. EXPERIMENTAL SETUP 13
  14. 14. Hadoop + HDFS Setup • 1 cluster manager, 1 Name node (NN), 8 Data nodes (DN) including NN. • 8 nodes, each of Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM • 27 TB of cluster storage • 10 GB network among compute nodes • Red Hat 6.5, CDH 5.0.2 and HDFS 14
  15. 15. Hadoop + Intel EE for Lustre* software - Setup • 1 Resource manager (RM), 1 History server (HS), 8 Node managers (NM) including RM and HS. • 8 nodes, each of Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM • 165TB of usable Lustre storage • 10 GB network among compute nodes • Red Hat 6.5, CDH 5.0.2, Intel® Enterprise Edition for Lustre* software 2.0 * Other names and brands may be claimed as the property of others. 15
  16. 16. Intel® Enterprise Edition for Lustre* 2.0 Setup q Four OSS, One MDS, 16 OSTs, 1 MDT. q OSS Node o CPU- Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz , Memory - 128GB DDr3 1600mhz o Disk subsystem • 4 only LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) • 4 only 4TB SATA drives per controller raid 5 configuration per raid set o 4 OST per OSS node. * Other names and brands may be claimed as the property of others.
  17. 17. Cluster Parameters q Number of Compute nodes = 8 q Map slots = 24 q Reduce slots = 7 q Rest of parameters such as Shuffle percent, Merge Percent, Sort Buffer are all kept as default q HDFS § Replication Factor = 3 q Intel® EE for Lustre* software § stripe count = 1,4,16. § stripe size = 4MB * Other names and brands may be claimed as the property of others.
  18. 18. Job Configuration Parameters q Map Split size= 1GB q Block size = 128MB q Input Data is NOT compressed q Output Data is NOT compressed
  19. 19. Workload q Consolidated Audit Trail System (part of FINRA application) DB Schema § Single table with 12 columns related to share order. q Data consolidation query § Print share order details for share orders during a date range. § SELECT issue_symbol,orf_order_id, orf_order_received_ts FROM default.rt_query_extract WHERE issue_symbol like 'XLP' AND from_unixtime(cast((orf_order_received_ts/1000) as BIGINT),'yyyy-MM-ddhh:ii:ss') >= "2014-06-26 23:00:00" AND from_unixtime(cast((orf_order_received_ts/1000) as BIGINT),'yyyy-MM-ddhh:ii:ss') <= "2014-06-27 11:00:00";
  20. 20. Workload Implementation q DB is a flat file with columns separated using a token q Data generator to generate data for the DB q Tool to run queries concurrently q Query is implemented as Map and Reduce functions
  21. 21. Workload Size q Concurrency Tests: § Query in isolation, concurrency =1 § Query in concurrent workload, concurrency =5 § Thinktime = 10% of query execution time in isolation. q Data Size: § 100GB , 500GB, 1TB and 7TB
  22. 22. Performance Metric q MR job execution time in isolation q MR job average execution time in concurrent workload q CPU, Disk and Memory Utilization of the cluster
  23. 23. Performance Measurement q SAR data is collected from all nodes in the cluster. q MapReduce job log files are used for performance analysis q Intel® EE for Lustre* software nodes performance data is collected using Intel Manager q Hadoop performance data is collected using Intel Manager * Other names and brands may be claimed as the property of others.
  24. 24. Benchmarking Steps 24 Generate Data of given size Copy data to HDFS Start MR job of query on Name node with given concurrency On completion of job, collect Logs and performance data For different Concurrency levels For different Data sizes
  25. 25. RESULT ANALYSIS 25
  26. 26. 26 Degree of Concurrency = 1 Intel® EE for Lustre* performs better on large stripe count * Other names and brands may be claimed as the property of others.
  27. 27. 27 Degree of Concurrency = 1 Intel® EE for Lustre* delivered 3X HDFS for optimal SC settings * Other names and brands may be claimed as the property of others.
  28. 28. 28 Degree of Concurrency = 1 Intel® EE for Lustre* optimal SC gives 70% improvement over HDFS * Other names and brands may be claimed as the property of others.
  29. 29. Hadoop + HDFS Setup Hadoop + Intel® EE for Lustre* software - Setup 29 Nodes = 8 Nodes = 8+5 = 13 Performance Linear extrapolation for Nodes =13 * Other names and brands may be claimed as the property of others.
  30. 30. 30 Number of Compute Servers = 13 Intel® EE for Lustre* 2X better than HDFS for same BOM * Other names and brands may be claimed as the property of others.
  31. 31. 31 Degree of Concurrency = 5 Intel® EE for Lustre* was 5.5 times better than HDFS on 7 TB data size * Other names and brands may be claimed as the property of others.
  32. 32. 32 Degree of Concurrency = 5 Intel® EE for Lustre* was 5.5 times better than HDFS on 7 TB data size * Other names and brands may be claimed as the property of others.
  33. 33. 33 퐶표푛푐푢푟푟푒푛푡 퐽표푏 퐴푣푒푟푎푔푒 퐸푥푒푐푢푡푖표푛 푇푖푚푒/푆푖푛푔푙푒 퐽표푏 퐸푥푒푐푢푡푖표푛 푇푖푚푒  = 2.5
  34. 34. 34 = 4.5
  35. 35. 35 Intel® EE for Lustre* software > HDFS for concurrency * Other names and brands may be claimed as the property of others.
  36. 36. Conclusion q Increase in Stripe count improves Enterprise Edition for Lustre* software performance q Intel® EE for Lustre shows better performance for concurrent workload q Intel® EE for Lustre software = 3 X HDFS for single job q Intel® EE for Lustre software = 5.5 X HDFS for concurrent workload q Future work § Impact of large number of compute nodes (i.e. OSSs <<<< Nodes) * Other names and brands may be claimed as the property of others.
  37. 37. Thank You rekha.singhal@tcs.com
  • ssuser15375d

    Jun. 22, 2016
  • gkaul

    Oct. 15, 2014

In this deck from the LAD'14 Conference in Reims, Rekha Singhal from Tata Consultancy Services presents: Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application. Learn more: http://insidehpc.com/lad14-video-gallery/ Watch the video presentation: http://inside-bigdata.com/2014/09/29/performance-comparison-intel-enterprise-edition-lustre-hdfs-mapreduce/

Views

Total views

3,076

On Slideshare

0

From embeds

0

Number of embeds

7

Actions

Downloads

51

Shares

0

Comments

0

Likes

2

×