graphframes package

Contents

class graphframes.GraphFrame(v, e)[source]

Represents a graph with vertices and edges stored as DataFrames.

Parameters:
  • vDataFrame holding vertex information. Must contain a column named “id” that stores unique vertex IDs.
  • eDataFrame holding edge information. Must contain two columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.
>>> localVertices = [(1,"A"), (2,"B"), (3, "C")]
>>> localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]
>>> v = sqlContext.createDataFrame(localVertices, ["id", "name"])
>>> e = sqlContext.createDataFrame(localEdges, ["src", "dst", "action"])
>>> g = GraphFrame(v, e)
aggregateMessages(aggCol, sendToSrc=None, sendToDst=None)[source]

Aggregates messages from the neighbours.

When specifying the messages and aggregation function, the user may reference columns using the static methods in graphframes.lib.AggregateMessages.

See Scala documentation for more details.

Parameters:
  • aggCol – the requested aggregation output either as pyspark.sql.Column or SQL expression string
  • sendToSrc – message sent to the source vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)
  • sendToDst – message sent to the destination vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)
Returns:

DataFrame with columns for the vertex ID and the resulting aggregated message

bfs(fromExpr, toExpr, edgeFilter=None, maxPathLength=10)[source]

Breadth-first search (BFS).

See Scala documentation for more details.

Returns:DataFrame with one Row for each shortest path between matching vertices.
cache()[source]

Persist the dataframe representation of vertices and edges of the graph with the default storage level.

connectedComponents(algorithm='graphframes', checkpointInterval=2, broadcastThreshold=1000000)[source]

Computes the connected components of the graph.

See Scala documentation for more details.

Parameters:
  • algorithm – connected components algorithm to use (default: “graphframes”) Supported algorithms are “graphframes” and “graphx”.
  • checkpointInterval – checkpoint interval in terms of number of iterations (default: 2)
  • broadcastThreshold – broadcast threshold in propagating component assignments (default: 1000000)
Returns:

DataFrame with new vertices column “component”

degrees
The degree of each vertex in the graph, returned as a DataFrame with two columns:
  • “id”: the ID of the vertex
  • ‘degree’ (integer) the degree of the vertex

Note that vertices with 0 edges are not returned in the result.

Returns:DataFrame with new vertices column “degree”
edges

DataFrame holding edge information, with unique columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.

find(pattern)[source]

Motif finding.

See Scala documentation for more details.

Parameters:pattern – String describing the motif to search for.
Returns:DataFrame with one Row for each instance of the motif found
inDegrees
The in-degree of each vertex in the graph, returned as a DataFame with two columns:
  • “id”: the ID of the vertex
  • “inDegree” (int) storing the in-degree of the vertex

Note that vertices with 0 in-edges are not returned in the result.

Returns:DataFrame with new vertices column “inDegree”
labelPropagation(maxIter)[source]

Runs static label propagation for detecting communities in networks.

See Scala documentation for more details.

Parameters:maxIter – the number of iterations to be performed
Returns:DataFrame with new vertices column “label”
outDegrees
The out-degree of each vertex in the graph, returned as a DataFrame with two columns:
  • “id”: the ID of the vertex
  • “outDegree” (integer) storing the out-degree of the vertex

Note that vertices with 0 out-edges are not returned in the result.

Returns:DataFrame with new vertices column “outDegree”
pageRank(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)[source]

Runs the PageRank algorithm on the graph. Note: Exactly one of fixed_num_iter or tolerance must be set.

See Scala documentation for more details.

Parameters:
  • resetProbability – Probability of resetting to a random vertex.
  • sourceId – (optional) the source vertex for a personalized PageRank.
  • maxIter – If set, the algorithm is run for a fixed number of iterations. This may not be set if the tol parameter is set.
  • tol – If set, the algorithm is run until the given tolerance. This may not be set if the numIter parameter is set.
Returns:

GraphFrame with new vertices column “pagerank” and new edges column “weight”

persist(storageLevel=StorageLevel(False, True, False, False, 1))[source]

Persist the dataframe representation of vertices and edges of the graph with the given storage level.

shortestPaths(landmarks)[source]

Runs the shortest path algorithm from a set of landmark vertices in the graph.

See Scala documentation for more details.

Parameters:landmarks – a set of one or more landmarks
Returns:DataFrame with new vertices column “distances”
stronglyConnectedComponents(maxIter)[source]

Runs the strongly connected components algorithm on this graph.

See Scala documentation for more details.

Parameters:maxIter – the number of iterations to run
Returns:DataFrame with new vertex column “component”
svdPlusPlus(rank=10, maxIter=2, minValue=0.0, maxValue=5.0, gamma1=0.007, gamma2=0.007, gamma6=0.005, gamma7=0.015)[source]

Runs the SVD++ algorithm.

See Scala documentation for more details.

Returns:Tuple of DataFrame with new vertex columns storing learned model, and loss value
triangleCount()[source]

Counts the number of triangles passing through each vertex in this graph.

See Scala documentation for more details.

Returns:DataFrame with new vertex column “count”
triplets

The triplets (source vertex)-[edge]->(destination vertex) for all edges in the graph.

Returned as a DataFrame with three columns:
  • “src”: source vertex with schema matching ‘vertices’
  • “edge”: edge with schema matching ‘edges’
  • ‘dst’: destination vertex with schema matching ‘vertices’
Returns:DataFrame with columns ‘src’, ‘edge’, and ‘dst’
unpersist(blocking=False)[source]

Mark the dataframe representation of vertices and edges of the graph as non-persistent, and remove all blocks for it from memory and disk.

vertices

DataFrame holding vertex information, with unique column “id” for vertex IDs.