GraphFrames Overview
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
What are GraphFrames?
GraphFrames are to DataFrames as GraphX is to RDDs.
GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between users). If you are familiar with GraphX, then GraphFrames will be easy to learn. The key difference is that GraphFrames are based upon Spark DataFrames, rather than RDDs.
GraphFrames also provide powerful tools for running queries and standard graph algorithms. With GraphFrames, you can easily search for patterns within graphs, find important vertices, and more. Refer to the User Guide for a full list of queries and algorithms.
Will GraphFrames be part of Apache Spark?
The GraphX component of Apache Spark has no DataFrames- or Dataset-based equivalent, so it is natural to ask this question. The current plan is to keep GraphFrames separate from core Apache Spark for the time being:
-
we are still considering making small adjustments to the API. The GraphFrames project will be considered for inclusion into Spark once we are confident that the current API addresses current and future needs.
-
some important features present in GraphX such as partitioning are missing. We would like to offer some equivalent operations before considering merging with the Spark project.
-
GraphFrames is used as a testbed for advanced, graph-specific optimizations into Spark’s Catalyst engine. Having them in a separate project accelerates the development cycle.
That being said, GraphFrames follows the same code quality standards as Spark, and it is cross-compiled and published for a large number of Spark versions. It is easy for users to depend on it.
Downloading
Get GraphFrames from the Spark Packages website. This documentation is for GraphFrames version 0.8.0. GraphFrames depends on Apache Spark, which is available for download from the Apache Spark website.
GraphFrames should be compatible with any platform which runs Spark. Refer to the Apache Spark documentation for more information.
GraphFrames is compatible with Spark 1.6+. However, later versions of Spark include major improvements to DataFrames, so GraphFrames may be more efficient when running on more recent Spark versions.
GraphFrames is tested with Java 8, Python 2 and 3, and running against Spark 2.2+ (Scala 2.11).
Applications, the Apache Spark shell, and clusters
See the Apache Spark User Guide for more information about submitting Spark jobs to clusters, running the Spark shell, and launching Spark clusters. The GraphFrame Quick-Start guide also shows how to run the Spark shell with GraphFrames supplied as a package.
Where to Go from Here
User Guides:
- Quick Start: a quick introduction to the GraphFrames API; start here!
- GraphFrames User Guide: detailed overview of GraphFrames in all supported languages (Scala, Java, Python)
API Docs:
External Resources:
- Apache Spark Homepage
- Apache Spark Wiki
- Mailing Lists: Ask questions about Spark here