Scale Data Science with Spark and R

About

sparklyr is an open-source and modern interface to scale data science and machine learning workflows using Apache Spark™, R, and a rich extension ecosystem.

It enables using Apache Spark with ease using R by providing access to core functionality like installing, connecting and managing Spark and using Spark’s MLlib, Spark Structured Streaming and Spark Pipelines from R.

Supports well-known R packages like dplyr, DBI and broom to reduce the cognitive overhead from having to re-learn libraries.

And enables a rich-ecosystem of extensions to use in Spark and R: XGBoost, MLeap, GraphFrames, H2O, and optionally enable Apache Arrow to significantly improve performance.

Through Spark, this allows you to scale your Data Science workflows in Hadoop YARN, Mesos, Kubernetes or Apache Livy.

Architecture

Get Started

To connect to a local cluster: Install R, Java 8, and run:

# Run once
install.packages("sparklyr")
sparklyr::spark_install()

# Connect to Spark local
library(sparklyr)
sc <- spark_connect(master = "local")

# Disconnect from Spark
spark_disconnect(sc)

To connect to any other Spark cluster:

# Connect to Hadoop YARN
sc <- spark_connect(master = "yarn")

# Connect to Mesos
sc <- spark_connect(master = "mesos://host:port")

# Connect to Kubernetes
sc <- spark_connect(master = "k8s://https://server")

# Connect to Apache Livy
sc <- spark_connect(master = "http://server/livy", method = "livy")

To connect through specific distributions, cloud providers and tools use the following resources:

Learning

Sponsors

Sponsors of current sparklyr committers.

Users

Many other organizations are using sparklyr to scale their Data Science and Machine Learning workflows, to add yours contact us.

Community