Overview

The industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Hadoop, Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR can be used to provision resources to run big data workloads on Hadoop clusters. EMR provisions EC2 instances to manage its workload. EMR is not a serverless service.

EMR helps in creating Hadoop clusters to analyze and process vast amounts of data
- Clusters can be made of hundreds of EC2 instances
EMR comes bundled with:
- Apache Spark
- HBase
- Presto
- Flink
EMR take care of all the provisioning and configuration
Auto-scaling and integrated with Spot instances

Use Cases

Data processing
machine learning
web indexing
big data

Node types and purchasing

Master node (long-running)
- Manage the cluster
- coordinate
- manage health
Core node (long-running)
- run tasks
- store data
Task node (optional, usually spot instances)
- Just to run tasks
Purchasing options
- on-demand
  - reliable
  - predictable
  - wont be terminated
- reserved
  - EMR will automatically use if available
- Spot instances
  - cheaper
  - less reliable
  - can be terminated
  - wise to leverage this on task nodes
Deployment
- Long running clusters
- Transient (temporary) clusters

🏡Kipp's Vault

Explorer

Elastic MapReduce

Overview

Use Cases

Node types and purchasing

ChatGPT Overview

Graph View

Table of Contents

Backlinks