Spark-Submit์ด๋ž€?

Spark job์„ ๋ฐฐ์น˜ ํ˜น์€ ์ŠคํŠธ๋ฆฌ๋ฐ ํ˜•ํƒœ๋กœ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๊ณต์‹ CLI ๋Ÿฐ์ฒ˜๋กœ notebook์ด๋‚˜ spark-shell๊ณผ ํ•จ๊ป˜ Spark job์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜
Spark์—์„œ ์ง€์›ํ•˜๋Š” ๋ชจ๋“  cluster manager๋ฅผ ๋™์ผํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์„œ ์œ ์šฉ

๋™์ž‘์›๋ฆฌ
  • ์‚ฌ์šฉ์ž๊ฐ€ ๋ช…๋ น์–ด ์ž…๋ ฅ โ†’ spark-submit ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ยทํด๋ž˜์ŠคํŒจ์Šค ์„ธํŒ… ํ›„ JVM ๋ถ€ํŠธ์ŠคํŠธ๋žฉ
  • SparkSubmit ํด๋ž˜์Šค๊ฐ€ ์ „๋‹ฌ ํŒŒ๋ผ๋ฏธํ„ฐ ํŒŒ์‹ฑ โ†’ ๋ฐฐํฌ ๋ชจ๋“œ(client/cluster)ยทํด๋Ÿฌ์Šคํ„ฐ ๋งค๋‹ˆ์ €์— ๋งž๋Š” LauncherBackend ์„ ํƒ
  • Driver JVM ๊ธฐ๋™ โ†’ Cluster Manager ์™€ ํ†ต์‹ ํ•ด Executor ๋„์šฐ๊ณ  ํƒœ์Šคํฌ ์Šค์ผ€์ค„
  • Job ์ข…๋ฃŒ ์‹œ ExitCode ๋ฐ˜ํ™˜ โ†’ ์…ธ ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌํ•ด ํŒŒ์ดํ”„๋ผ์ธ ์‹คํŒจ ๊ฐ์ง€ ๊ฐ€๋Šฅ
notebook ยท shell ๋Œ€๋น„ ํŠน์ง•
  • notebook / shell โ†’ REPL ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์„ธ์…˜, ์‚ฌ์šฉ์ž ์ž…๋ ฅ์ด ๋Š๊ธฐ๋ฉด ์ข…๋ฃŒ
  • spark-submit โ†’ ์Šคํฌ๋ฆฝํŠธยท์›Œํฌํ”Œ๋กœ ์—”์ง„(airflow, azkaban ๋“ฑ)์—์„œ ํ˜ธ์ถœํ•ด ๋น„๋Œ€ํ™”์‹ ์‹คํ–‰
  • ๋ฆฌ์†Œ์Šคยท์˜ต์…˜ ์ผ๊ด€์„ฑ ํ™•๋ณด, CI/CD ํฌํ•จ ์ž๋™ํ™”์— ์œ ๋ฆฌ

๊ธฐ๋ณธ ๊ตฌ์กฐ

spark-submit ๋ช…๋ น์–ด
spark-submit \
  --master <url|mode> \
  --deploy-mode <client|cluster> \
  --name <appโ€‘name> \
  --class <main-class> \
  --conf k=v            (์—ฌ๋Ÿฌ ๋ฒˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅ) \
  --packages g:a:v      (์—ฌ๋Ÿฌ ๋ฒˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅ) \
  --jars a.jar,b.jar    (์ฝค๋งˆ ๊ตฌ๋ถ„) \
  --py-files deps.zip   (PySpark) \
  --files config.json   (๋ชจ๋“  ๋…ธ๋“œ๋กœ ๋ฐฐํฌ) \
  --executor-memory 4g \
  --executor-cores 2 \
  --num-executors 5 \
  --driver-cores 4 \
  --driver-memory 2g \
  <application file>    (py ํŒŒ์ผยทJARยทR file) \
  [application args]
๋Œ€ํ‘œ์ ์ธ ์˜ต์…˜
  • --class : JAR ํŒŒ์ผ ์•ˆ์—์„œ ์–ด๋–ค Main ํด๋ž˜์Šค (main ๋ฉ”์„œ๋“œ ๊ฐ€์ง„ ์—”ํŠธ๋ฆฌํฌ์ธํŠธ)๋ฅผ ์‹คํ–‰ํ• ์ง€ spark-submit ์—๊ฒŒ ์•Œ๋ ค์ฃผ๋Š” ์Šค์œ„์น˜
    • ํ•˜๋‚˜์˜ JAR ์•ˆ์— ์—ฌ๋Ÿฌ main ํด๋ž˜์Šค ํฌํ•จ๋˜์–ด ์žˆ๋Š” ๋“ฑ ์–ด๋А ๊ฒƒ ์‹คํ–‰ํ•ด์•ผํ• ์ง€ ์• ๋งคํ•œ ๊ฒฝ์šฐ ํ•„์š”
    • Java, Scala ๊ธฐ๋ฐ˜ Spark ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—๋งŒ ์˜๋ฏธ์žˆ์Œ
  • --master : cluster์˜ master URL
    • local : ๋‹จ์ผ worker thread ์‚ฌ์šฉ (no parallelism)
    • local[K] : K worker threads
    • local[K, F]: K worker threads์™€ F๋ฒˆ์˜ maxFailure1
    • local[*] : ์ตœ๋Œ€ํ•œ ๋งŽ์€ worker threads
    • local[*, F] : ์ตœ๋Œ€ํ•œ ๋งŽ์€ worker threads์™€ F๋ฒˆ์˜ maxFailure
    • local-cluster[N, C, M] : ๋‹จ์ผ JVM์—์„œ N๊ฐœ์˜ worker, C cores per worker, M MiB ๋ฉ”๋ชจ๋ฆฌ per worker. unit test์šฉ local cluster mode
    • spark://HOST:PORT : Standalone cluster master์— ์—ฐ๊ฒฐ. 7077์ด default
    • spark://HOST1:PORT1,HOST2:PORT2 : Zookeeper ์‚ฌ์šฉํ•ด์„œ ๋Œ€๊ธฐ master๊ฐ€ ์žˆ๋Š” standalone cluster์— ์—ฐ๊ฒฐ
    • yarn : client ํ˜น์€ cluster ๋ชจ๋“œ๋กœ YARN ํด๋Ÿฌ์Šคํ„ฐ์— ์—ฐ๊ฒฐ. cluster location ํ™˜๊ฒฝ๋ณ€์ˆ˜๋กœ ์ฃผ์ž…ํ•„์š”
    • k8s://HOST:PORT : client ํ˜น์€ cluster ๋ชจ๋“œ๋กœ k8s cluster์— ์—ฐ๊ฒฐ
  • --deploy-mode : Spark Driver๋ฅผ worker node์— ๋ฐฐํฌํ• ์ง€ (cluster) ์•„๋‹˜ ์™ธ๋ถ€ client์— ๋ฐฐํฌํ• ์ง€ (client, ๊ธฐ๋ณธ๊ฐ’)
  • --conf : key=value ํ˜•ํƒœ์˜ spark configuration. ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ž…๋ ฅ ๊ฐ€๋Šฅ
    • spark.app.name=name : ์•ฑ ์ด๋ฆ„ ์˜ค๋ฒ„๋ผ์ด๋“œ
    • spark.sql.shuffle.partitions=200ย : ์…”ํ”Œ ํŒŒํ‹ฐ์…˜
    • spark.hadoop.fs.s3a.access.key=โ€ฆย : S3ย ์ž๊ฒฉ ์ฃผ์ž…
    • spark.driver.extraJavaOptions=-Duser.name=airflowย : ์ปจํ…Œ์ด๋„ˆ HOMEย ๋ฌธ์ œ ํ•ด๊ฒฐ
    • spark.ui.port=4041ย : UIย ํฌํŠธ ์ถฉ๋Œ ์‹œ ๋ณ€๊ฒฝ
  • application-jar : application๊ณผ ๋ชจ๋“  dependency๋ฅผ ํฌํ•จํ•œ bundled jar์˜ path (์˜ˆ์‹œ : hdfs://path or file:// path)
  • application-arguments : ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ์— ํ•„์š”ํ•œ ์ธ์ž
๊ธฐํƒ€ ์˜ต์…˜
  • --executor-memoryย : Executorย JVMย Heap (์˜ˆโ€ฏ4g,ย 512m)
  • --executor-coresย : Executorโ€ฏ๋‹น CPUย ์ฝ”์–ด ์ˆ˜
  • --num-executorsย : Executor ๊ฐœ์ˆ˜ (Standโ€‘aloneยทYARN)
  • --driver-memoryย : Driverย Heap
  • --total-executor-coresย : Mesos ์ „์šฉ ์ด ์ฝ”์–ด ์ˆ˜
  • --jarsย ย : ์ถ”๊ฐ€ JARย classpath
  • --packagesย : Mavenย Centralย Ivy ๋‹ค์šด๋กœ๋“œ
  • --repositoriesย : ์‚ฌ์„ค Mavenย ์ €์žฅ์†Œ URL
  • --py-filesย : PySpark ์˜์กด zip/egg
  • --filesย : ๋ชจ๋“  ๋…ธ๋“œ์— ํŒŒ์ผ ๋ฐฐํฌ ํ›„ย SparkFiles.get()ย ๋กœ ์ ‘๊ทผ
์˜ˆ์‹œ ์Šคํฌ๋ฆฝํŠธ
# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master "local[8]" \
  /path/to/examples.jar \
  100
 
# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
 
# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
 
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000
 
# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000
 
# Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://xx.yy.zz.ww:443 \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  http://path/to/examples.jar \
  1000

Footnotes

  1. Number of continuous failures of any particular task before giving up on the job โ†ฉ