graphframes package¶
Subpackages¶
Contents¶
-
class
graphframes.
GraphFrame
(v, e)[source]¶ Represents a graph with vertices and edges stored as DataFrames.
Parameters: - v –
DataFrame
holding vertex information. Must contain a column named “id” that stores unique vertex IDs. - e –
DataFrame
holding edge information. Must contain two columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.
>>> localVertices = [(1,"A"), (2,"B"), (3, "C")] >>> localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")] >>> v = sqlContext.createDataFrame(localVertices, ["id", "name"]) >>> e = sqlContext.createDataFrame(localEdges, ["src", "dst", "action"]) >>> g = GraphFrame(v, e)
-
aggregateMessages
(aggCol, sendToSrc=None, sendToDst=None)[source]¶ Aggregates messages from the neighbours.
When specifying the messages and aggregation function, the user may reference columns using the static methods in
graphframes.lib.AggregateMessages
.See Scala documentation for more details.
Parameters: - aggCol – the requested aggregation output either as
pyspark.sql.Column
or SQL expression string - sendToSrc – message sent to the source vertex of each triplet either as
pyspark.sql.Column
or SQL expression string (default: None) - sendToDst – message sent to the destination vertex of each triplet either as
pyspark.sql.Column
or SQL expression string (default: None)
Returns: DataFrame with columns for the vertex ID and the resulting aggregated message
- aggCol – the requested aggregation output either as
-
bfs
(fromExpr, toExpr, edgeFilter=None, maxPathLength=10)[source]¶ Breadth-first search (BFS).
See Scala documentation for more details.
Returns: DataFrame with one Row for each shortest path between matching vertices.
-
cache
()[source]¶ Persist the dataframe representation of vertices and edges of the graph with the default storage level.
-
connectedComponents
(algorithm='graphframes', checkpointInterval=2, broadcastThreshold=1000000)[source]¶ Computes the connected components of the graph.
See Scala documentation for more details.
Parameters: - algorithm – connected components algorithm to use (default: “graphframes”) Supported algorithms are “graphframes” and “graphx”.
- checkpointInterval – checkpoint interval in terms of number of iterations (default: 2)
- broadcastThreshold – broadcast threshold in propagating component assignments (default: 1000000)
Returns: DataFrame with new vertices column “component”
-
degrees
¶ - The degree of each vertex in the graph, returned as a DataFrame with two columns:
- “id”: the ID of the vertex
- ‘degree’ (integer) the degree of the vertex
Note that vertices with 0 edges are not returned in the result.
Returns: DataFrame with new vertices column “degree”
-
dropIsolatedVertices
()[source]¶ Drops isolated vertices, vertices are not contained in any edges.
Returns: GraphFrame with filtered vertices.
-
edges
¶ DataFrame
holding edge information, with unique columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.
-
filterEdges
(condition)[source]¶ Filters the edges based on expression, keep all vertices.
Parameters: condition – String or Column describing the condition expression for filtering. Returns: GraphFrame with filtered edges.
-
filterVertices
(condition)[source]¶ Filters the vertices based on expression, remove edges containing any dropped vertices.
Parameters: condition – String or Column describing the condition expression for filtering. Returns: GraphFrame with filtered vertices and edges.
-
find
(pattern)[source]¶ Motif finding.
See Scala documentation for more details.
Parameters: pattern – String describing the motif to search for. Returns: DataFrame with one Row for each instance of the motif found
-
inDegrees
¶ - The in-degree of each vertex in the graph, returned as a DataFame with two columns:
- “id”: the ID of the vertex
- “inDegree” (int) storing the in-degree of the vertex
Note that vertices with 0 in-edges are not returned in the result.
Returns: DataFrame with new vertices column “inDegree”
-
labelPropagation
(maxIter)[source]¶ Runs static label propagation for detecting communities in networks.
See Scala documentation for more details.
Parameters: maxIter – the number of iterations to be performed Returns: DataFrame with new vertices column “label”
-
outDegrees
¶ - The out-degree of each vertex in the graph, returned as a DataFrame with two columns:
- “id”: the ID of the vertex
- “outDegree” (integer) storing the out-degree of the vertex
Note that vertices with 0 out-edges are not returned in the result.
Returns: DataFrame with new vertices column “outDegree”
-
pageRank
(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)[source]¶ Runs the PageRank algorithm on the graph. Note: Exactly one of fixed_num_iter or tolerance must be set.
See Scala documentation for more details.
Parameters: - resetProbability – Probability of resetting to a random vertex.
- sourceId – (optional) the source vertex for a personalized PageRank.
- maxIter – If set, the algorithm is run for a fixed number of iterations. This may not be set if the tol parameter is set.
- tol – If set, the algorithm is run until the given tolerance. This may not be set if the numIter parameter is set.
Returns: GraphFrame with new vertices column “pagerank” and new edges column “weight”
-
parallelPersonalizedPageRank
(resetProbability=0.15, sourceIds=None, maxIter=None)[source]¶ Run the personalized PageRank algorithm on the graph, from the provided list of sources in parallel for a fixed number of iterations.
See Scala documentation for more details.
Parameters: - resetProbability – Probability of resetting to a random vertex
- sourceIds – the source vertices for a personalized PageRank
- maxIter – the fixed number of iterations this algorithm runs
Returns: GraphFrame with new vertices column “pageranks” and new edges column “weight”
-
persist
(storageLevel=StorageLevel(False, True, False, False, 1))[source]¶ Persist the dataframe representation of vertices and edges of the graph with the given storage level.
-
pregel
¶ Get the
graphframes.lib.Pregel
object for running pregel.See
graphframes.lib.Pregel
for more details.
-
shortestPaths
(landmarks)[source]¶ Runs the shortest path algorithm from a set of landmark vertices in the graph.
See Scala documentation for more details.
Parameters: landmarks – a set of one or more landmarks Returns: DataFrame with new vertices column “distances”
-
stronglyConnectedComponents
(maxIter)[source]¶ Runs the strongly connected components algorithm on this graph.
See Scala documentation for more details.
Parameters: maxIter – the number of iterations to run Returns: DataFrame with new vertex column “component”
-
svdPlusPlus
(rank=10, maxIter=2, minValue=0.0, maxValue=5.0, gamma1=0.007, gamma2=0.007, gamma6=0.005, gamma7=0.015)[source]¶ Runs the SVD++ algorithm.
See Scala documentation for more details.
Returns: Tuple of DataFrame with new vertex columns storing learned model, and loss value
-
triangleCount
()[source]¶ Counts the number of triangles passing through each vertex in this graph.
See Scala documentation for more details.
Returns: DataFrame with new vertex column “count”
-
triplets
¶ The triplets (source vertex)-[edge]->(destination vertex) for all edges in the graph.
- Returned as a
DataFrame
with three columns: - “src”: source vertex with schema matching ‘vertices’
- “edge”: edge with schema matching ‘edges’
- ‘dst’: destination vertex with schema matching ‘vertices’
Returns: DataFrame with columns ‘src’, ‘edge’, and ‘dst’ - Returned as a
-
unpersist
(blocking=False)[source]¶ Mark the dataframe representation of vertices and edges of the graph as non-persistent, and remove all blocks for it from memory and disk.
-
vertices
¶ DataFrame
holding vertex information, with unique column “id” for vertex IDs.
- v –