GraphFrames Overview

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

What are GraphFrames?

GraphX is to RDDs as GraphFrames are to DataFrames.

GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between users). If you are familiar with GraphX, then GraphFrames will be easy to learn. The key difference is that GraphFrames are based upon Spark DataFrames, rather than RDDs.

GraphFrames also provide powerful tools for running queries and standard graph algorithms. With GraphFrames, you can easily search for patterns within graphs, find important vertices, and more. Refer to the User Guide for a full list of queries and algorithms.

Will GraphFrames be part of Apache Spark?

The GraphX component of Apache Spark has no DataFrames- or Dataset-based equivalent, so it is natural to ask this question. The current plan is to keep GraphFrames separate from core Apache Spark for the time being:

That being said, GraphFrames follows the same code quality standards as Spark, and it is cross-compiled and published for a large number of Spark versions. It is easy for users to depend on it.

Downloading

Get GraphFrames from the Spark Packages website. This documentation is for GraphFrames version 0.4.0. GraphFrames depends on Apache Spark, which is available for download from the Apache Spark website.

GraphFrames should be compatible with any platform which runs Spark. Refer to the Apache Spark documentation for more information.

GraphFrames is compatible with Spark 1.6+. However, later versions of Spark include major improvements to DataFrames, so GraphFrames may be more efficient when running on more recent Spark versions.

GraphFrames is tested with Java 7, Python 2 and 3, and running against Spark 1.6+ (Scala 2.10) and Spark 2.0+ (Scala 2.11).

Applications, the Apache Spark shell, and clusters

See the Apache Spark User Guide for more information about submitting Spark jobs to clusters, running the Spark shell, and launching Spark clusters. The GraphFrame Quick-Start guide also shows how to run the Spark shell with GraphFrames supplied as a package.

Where to Go from Here

User Guides:

API Docs:

External Resources: