Predicción de Características de Nodos Utilizando Node2Vec en Análisis de Grafos y Redes con Python¶
En el mundo de la Ciencia de Datos, el Aprendizaje Profundo y el Análisis de Grafos convergen en técnicas poderosas para extraer conocimiento de estructuras complejas. Un enfoque particularmente eficaz es la utilización de Node2Vec, una técnica de incorporación de nodos que captura relaciones y similitudes en un grafo. En este post, exploraremos cómo aplicar Node2Vec para predecir características de nodos en base a la posición que ocupan en el grafo.
Preparación del Entorno¶
Primero, es necesario instalar las bibliotecas requeridas:
networkx, node2vec, matplotlib y gensim
!pip install networkx node2vec matplotlib gensim
Descarga de Datos¶
En este ejemplo, se utilizará la base de datos de redes sociales "KARATE Club" para demostrar la predicción de características de nodos:
import networkx as nx
from node2vec import Node2Vec
G = nx.karate_club_graph()
Aplicación de Node2Vec¶
Ahora, vamos a aplicar Node2Vec para obtener incorporaciones de nodos significativas:
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)
model = node2vec.fit(window=10, min_count=1, batch_words=4)
Predicción de Características de Nodos¶
Supongamos que queremos predecir la participación en clubes de los nodos. Primero, necesitamos crear un conjunto de datos etiquetado:
import random
for node in G.nodes():
G.nodes[node]['label'] = random.choice([0, 1])
Luego, definimos una función para predecir la participación utilizando las incorporaciones de nodos:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def predict_participation(graph, model):
X = [model.wv[str(node)] for node in graph.nodes()]
y = [graph.nodes[node]['label'] for node in graph.nodes()]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return accuracy
5. Visualización de Resultados
Finalmente, podemos visualizar los resultados mediante un gráfico de dispersión de las incorporaciones de nodos:
import matplotlib.pyplot as plt
embeddings = [model.wv[str(node)] for node in G.nodes()]
X = [emb[0] for emb in embeddings]
Y = [emb[1] for emb in embeddings]
plt.figure(figsize=(10, 8))
plt.scatter(X, Y)
for i, txt in enumerate(G.nodes()):
plt.annotate(txt, (X[i], Y[i]), fontsize=8, alpha=0.5)
plt.title("Incorporaciones de Nodos utilizando Node2Vec")
plt.xlabel("Dimensión 1")
plt.ylabel("Dimensión 2")
plt.show()
Conclusiones
En este post, se exploró cómo utilizar Node2Vec para predecir características de nodos en un grafo. Mediante el uso de incorporaciones de nodos y técnicas de aprendizaje automático, se logró una precisión de predicción decente en la participación de clubes. La visualización de las incorporaciones de nodos también proporcionó una idea de cómo las características de los nodos se distribuyen en un espacio de baja dimensión.
A través de este ejemplo, se demuestra cómo las técnicas de incorporación de nodos, como Node2Vec, pueden ser aplicadas de manera efectiva en el análisis de grafos para realizar predicciones y comprender mejor las relaciones en conjuntos de datos complejos.
# Instalación de bibliotecas
!pip install networkx node2vec matplotlib scikit-learn netwulf
Requirement already satisfied: networkx in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (2.8.8)
Requirement already satisfied: node2vec in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (0.4.6)
Requirement already satisfied: matplotlib in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (3.7.2)
Requirement already satisfied: scikit-learn in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (1.3.0)
Collecting netwulf
Using cached netwulf-0.1.5.tar.gz (236 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: gensim<5.0.0,>=4.1.2 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from node2vec) (4.3.1)
Requirement already satisfied: joblib<2.0.0,>=1.1.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from node2vec) (1.3.2)
Requirement already satisfied: numpy<2.0.0,>=1.19.5 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from node2vec) (1.25.2)
Requirement already satisfied: tqdm<5.0.0,>=4.55.1 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from node2vec) (4.66.1)
Requirement already satisfied: contourpy>=1.0.1 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (4.42.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: scipy>=1.5.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from scikit-learn) (1.11.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from scikit-learn) (3.2.0)
Collecting simplejson>=3.0 (from netwulf)
Downloading simplejson-3.19.1-cp310-cp310-macosx_10_9_x86_64.whl (76 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.4/76.4 kB 1.5 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: smart-open>=1.8.1 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from gensim<5.0.0,>=4.1.2->node2vec) (6.3.0)
Requirement already satisfied: six>=1.5 in /Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Building wheels for collected packages: netwulf
Building wheel for netwulf (setup.py) ... done
Created wheel for netwulf: filename=netwulf-0.1.5-py3-none-any.whl size=237993 sha256=2b3e4b24abaf99677902e46cf3c7b3c83686628a4925f8d3b46895485b0601ec
Stored in directory: /Users/fernandocarazo/Library/Caches/pip/wheels/04/f3/75/3ee8148fe5296ab40fc164cc09572b0e31255242ccee47b354
Successfully built netwulf
Installing collected packages: simplejson, netwulf
Successfully installed netwulf-0.1.5 simplejson-3.19.1
import os
import networkx as nx
from node2vec import Node2Vec
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
/Users/fernandocarazo/opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Descargar la base de datos Cora
os.system("curl -O https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz")
os.system("tar -xvzf cora.tgz")
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 164k 100 164k 0 0 76129 0 0:00:02 0:00:02 --:--:-- 76422
x cora/
x cora/README
x cora/cora.cites
x cora/cora.content
0
import os
import networkx as nx
from node2vec import Node2Vec
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from urllib.request import urlretrieve
import tarfile
import seaborn as sns
import numpy as np
# Descargar y extraer la base de datos Cora
url = 'https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz'
file_path, _ = urlretrieve(url)
with tarfile.open(file_path, 'r:gz') as tar:
tar.extractall()
# idx_ran = np.random.choice(G.nodes(), 1000, replace=False)
# Cargar el grafo desde la base de datos Cora
G = nx.read_edgelist('cora/cora.cites')
G.number_of_edges(), G.number_of_nodes()
(5278, 2708)
# Set of random nodes
# G = G.subgraph(idx_ran)
# Cargar las características y etiquetas desde la base de datos Cora
features_df = pd.read_csv('cora/cora.content', sep='\t', header=None)
labels_df = pd.read_csv('cora/cora.cites', sep='\t', header=None,)
# Dividir el DataFrame en características (X) y etiquetas (y)
X = features_df.iloc[:, 1:-1]
y = features_df.iloc[:, -1]
y
0 Neural_Networks
1 Rule_Learning
2 Reinforcement_Learning
3 Reinforcement_Learning
4 Probabilistic_Methods
...
2703 Genetic_Algorithms
2704 Genetic_Algorithms
2705 Genetic_Algorithms
2706 Case_Based
2707 Neural_Networks
Name: 1434, Length: 2708, dtype: object
# Calculamos las posiciones de los nodos para el grafo
pos = nx.spring_layout(G, seed=42)
# También se puede visualizar con netwulf
import netwulf as nf
nf.visualize(G)
# Graficado del grafo
colors = y.astype('category').cat.codes
plt.figure(figsize=(20,20))
nx.draw(G, node_size=5, pos=pos, node_color=colors, cmap='Set1')
- dimensions: tamaño del vector de embeddings
- walk_length: longitud de los caminos aleatorios
- num_walks: número de caminos aleatorios
- workers: número de procesadores usados para el entrenamiento
- window: tamaño de la ventana para el algoritmo Word2Vec
- min_count: número mínimo de veces que debe aparecer una palabra en el corpus
- batch_words: tamaño del lote de palabras para el algoritmo Word2Vec
# Inicializar y entrenar el modelo Node2Vec
node2vec = Node2Vec(G, dimensions=30, walk_length=10, num_walks=10, workers=4)
# Calcular los embbedings de los nodos
embeddings = node2vec.fit(window=10, min_count=1, batch_words=4)
Computing transition probabilities: 3%|▎ | 77/2708 [00:00<00:03, 681.68it/s]
Computing transition probabilities: 100%|██████████| 2708/2708 [00:00<00:00, 3366.45it/s] Generating walks (CPU: 3): 100%|██████████| 2/2 [00:00<00:00, 6.37it/s] Generating walks (CPU: 1): 100%|██████████| 3/3 [00:00<00:00, 6.51it/s] Generating walks (CPU: 2): 100%|██████████| 3/3 [00:00<00:00, 6.43it/s] Generating walks (CPU: 4): 100%|██████████| 2/2 [00:00<00:00, 8.49it/s]
nodes_embs = embeddings.wv.vectors
# Combinar las características y embeddings en un solo DataFrame
combined_df = pd.concat([X, pd.DataFrame(nodes_embs)], axis=1)
combined_df
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.795933 | 0.415507 | 0.386307 | -0.002911 | 0.348737 | -0.176543 | -0.572042 | 0.311663 | 0.001379 | -0.537796 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.708579 | -0.030970 | 0.032896 | 0.395219 | 0.258655 | 0.610313 | 0.104422 | 0.323234 | 0.029849 | -0.391190 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.314460 | -0.329790 | 0.277298 | -0.220116 | -0.535589 | 0.677442 | 0.126622 | 0.001399 | -0.307883 | -1.106844 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.371517 | -0.356573 | -0.512346 | 0.324187 | -0.277084 | 0.295629 | -0.101541 | 0.213723 | 0.791657 | -0.732938 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.434078 | -0.329435 | -0.676453 | 0.075276 | -0.032991 | 0.129072 | -0.591269 | 0.333925 | -0.214008 | -0.650508 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2703 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.040460 | 0.586297 | 0.837810 | 0.549309 | -0.047041 | 0.253388 | -0.153505 | 0.024145 | -0.864390 | -0.555191 |
| 2704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.010375 | -0.255602 | 1.187112 | 0.567876 | -1.027066 | 0.093529 | -1.075994 | -0.029149 | 0.516387 | -0.182631 |
| 2705 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.799883 | -0.932348 | -0.027764 | 0.010531 | 0.526442 | 0.202738 | -0.725372 | 0.014941 | 0.030126 | -1.108884 |
| 2706 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | -0.500194 | -0.201946 | -0.158183 | 0.342811 | 0.210704 | 0.389000 | -0.003850 | 0.426401 | 0.113739 | -0.203314 |
| 2707 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.852381 | -0.081238 | 0.889547 | 0.125725 | -0.578516 | -0.028220 | -1.426841 | 0.500267 | 0.780594 | -0.824779 |
2708 rows × 1463 columns
# Visualización de los embeddings con t-SNE
tsne = TSNE(n_components=3, perplexity=30, n_iter=300)
X_tsne = tsne.fit_transform(nodes_embs)
plt.figure(figsize=(10, 8))
# change categorical to numerical
from sklearn.preprocessing import LabelEncoder
# numerical_y = LabelEncoder().fit_transform(y)
numerical_y = y.astype('category').cat.codes
# plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette='viridis')
plt.title('Visualización de los Embeddings con t-SNE')
plt.show()
# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Inicializar y entrenar el modelo RandomForest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Realizar predicciones en el conjunto de prueba
y_pred = rf_model.predict(X_test)
# Calcular la precisión de las predicciones
accuracy = accuracy_score(y_test, y_pred)
print(f'Precisión del modelo RandomForest: {accuracy}')
Precisión del modelo RandomForest: 0.7601476014760148
combined_df.shape
(2708, 1463)
# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(combined_df, y, test_size=0.2, random_state=42)
# Inicializar y entrenar el modelo RandomForest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Realizar predicciones en el conjunto de prueba
y_pred = rf_model.predict(X_test)
# Calcular la precisión de las predicciones
accuracy = accuracy_score(y_test, y_pred)
print(f'Precisión del modelo RandomForest: {accuracy}')
Precisión del modelo RandomForest: 0.7195571955719557