GraphFrames Quick-Start Guide

This quick-start guide shows how to get started using GraphFrames. After you work through this guide, move on to the User Guide to learn more about the many queries and algorithms supported by GraphFrames.

Getting started with Apache Spark and Spark packages

If you are new to using Apache Spark, refer to the Apache Spark Documentation and its Quick-Start Guide for more information.

If you are new to using Spark packages, you can find more information in the Spark User Guide on using the interactive shell. You just need to make sure your Spark shell session has the package as a dependency.

The following example shows how to run the Spark shell with the GraphFrames package. We use the --packages argument to download the graphframes package and any dependencies automatically.

$ ./bin/spark-shell --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
$ ./bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

The above examples of running the Spark shell with GraphFrames use a specific version of the GraphFrames package. To use a different version, just change the last part of the --packages argument; for example, to run with version 0.1.0-spark1.6, pass the argument --packages graphframes:graphframes:0.1.0-spark1.6.

Start using GraphFrames

The following example shows how to create a GraphFrame, query it, and run the PageRank algorithm.

// import graphframes package
import org.graphframes._
// Create a Vertex DataFrame with unique ID column "id"
val v = spark.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30)
)).toDF("id", "name", "age")
// Create an Edge DataFrame with "src" and "dst" columns
val e = spark.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow")
)).toDF("src", "dst", "relationship")
// Create a GraphFrame
import org.graphframes.GraphFrame
val g = GraphFrame(v, e)

// Query: Get in-degree of each vertex.
g.inDegrees.show()

// Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

// Run PageRank algorithm, and show results.
val results = g.pageRank.resetProbability(0.01).maxIter(20).run()
results.vertices.select("id", "pagerank").show()
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()