**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https://paperswithcode.com/method/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https://paperswithcode.com/method/batch-normalization).

**CondConv**, or **Conditionally Parameterized Convolutions**, are a type of [convolution](https://paperswithcode.com/method/convolution) which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of $n$ experts $(\alpha_1 W_1 + \ldots + \alpha_n W_n) * x$, where $\alpha_1, \ldots, \alpha_n$ are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.

CondConv

CondConv: Conditionally Parameterized Convolutions for Efficient Inference

DDPG

Continuous control with deep reinforcement learning

**TL;DR: CT-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way according to the commute times distance (or effective resistance). We address it learning a differentiable way to compute the CT-embedding of the graph.**

### Summary

**CT-Layer** is able to Learn the *Commute Times distance*  between nodes (i.e. *effective resistance distance*) in a **differentiable** way, instead of the common spectral version, and in a **parameter free** manner, which is not the cased of the heat kernel. This approach allow to solve it as an optimization problem inside a GNN, leading to have a new layer which is able to learn how rewire a given graph in an optimal, and **inductive** way. 

In addition, **CT-Layer** also is able to learn *Commute Times embeddings*, and then calculate it for any graph in an inductive way. The Commute Times embedding is also related with the *eigenvalues* and *eigenvectors* of the Laplacian of the graph, because CT embedding is just the eigenvectors scaled. Therefore, CT-Layer is also able to learn hot to calculate the spectrum of the Laplacian in a differentiable way. Therefore, this embedding must satisfy orthogonality and normality.

Finally, recent connections has been found between commute times distance and **curvature** (which is non-differentiable), establishing equivalences between them. Therefore, **CT-Layer** can also be seen as the differentiable version of the curvature rewiring.

**We are going through a quick overview of the layer, but I suggest go to the paper for a detailed explanation. **

### Spectral CT- Embedding downsides
CT-embedding $\mathbf{Z}$ is computed spectrally  in the literature (until the proposal of this method) or it is approximated using the heat kernel (very dependent on hyperparameter $t$). This fact does not allow us to propose differentiable methods using that measure:
$$
\mathbf{Z}=\sqrt{vol(G)}\mathbf{\Lambda}^\frac{1}{2}\mathbf{F}^T \textrm{ given } \mathbf{L}=\mathbf{F}\mathbf{\Lambda}\mathbf{F}^T
$$

Then, CT-distance  is given by the Euclidean distances between the embeddings $CT_{uv} = ||\mathbf{z_u}-\mathbf{z_v}||^2$. The spectral form is: 

$$
\frac{CT_{uv}}{vol(G)} = \sum_{i=2}^n \frac{1}{\lambda_i} (\mathbf{f}(u)-\mathbf{f}(v))^2 
$$
being $\mathbf{f}$ the eigenvectors of the graph Laplacian. 

This embedding and distances gives us desirable properties of the graph, such an understanding of the structure, or an embedding based on the spectrum which minimizes Dirichlet energies. However, **the spectral computation is not differentiable**.

### CT-Layer as an optimization problem: Differentiable, learnable and inductive CT-Layer
Giving that $\mathbf{Z}$ minimizes Dirichlet energies s.t. being orthogonal and normalized, we can formulate this problem as constraining neighboring nodes to have a similar embeddings s.t. $\mathbf{Z}\mathbf{Z}^T=\mathbf{I}$.

$$
\mathbf{Z} = \arg\min_{\mathbf{Z}^T\mathbf{Z}=\mathbf{I}} \frac{\sum\_{u,v} ||\mathbf{z_u}-\mathbf{z_v}||^2\mathbf{A}\_{uv}}{\sum\_{u,v} \mathbf{Z}^2\_{uv} d_u}=\frac{Tr[\mathbf{Z}^T\mathbf{L}\mathbf{Z}]}{Tr[\mathbf{Z}^T\mathbf{D}\mathbf{Z}]}
$$

With the above elements we have a definition of **CT-Layer**, our rewiring layer: 
Given the matrix $\mathbf{X}\_{n\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\mathbf{Z}\_{n\times O(n)}=\tanh(\textrm{MLP}(\mathbf{X}))$ learns the association $\mathbf{X}\rightarrow \mathbf{Z}$ while $\mathbf{Z}$ is optimized according to the loss 
$$
L\_{CT} = \frac{Tr[\mathbf{Z}^T\mathbf{L}\mathbf{Z}]}{Tr[\mathbf{Z}^T\mathbf{D}\mathbf{Z}]} + \left\|\frac{\mathbf{Z}^T\mathbf{Z}}{\|\mathbf{Z}^T\mathbf{Z}\|\_F} - \mathbf{I}\_n\right\|\_F
$$
 This results in the following **resistance diffusion** $\mathbf{T}^{CT} = \mathbf{R}(\mathbf{S})\odot \mathbf{A}$ (Hadamard product between the resistance distance and the adjacency) which provides as input to the subsequent MP layer a learnt convolution matrix.

As explained before, $\mathbf{Z}$ is the **commute times embedding matrix** and the pairwise euclidian distance of that learned embeddings are the **commute times distances** or resistance distances. **Therefore, once trained this layer, it will be able to calculate the commute times embedding for a new graph, and rewire that new and unseen graph in a principled way based on the commute times distance.**

## Preservation of Structure
Does this rewiring preserve the original structure? Let $G' = \textrm{Sparsify}(G, q)$ be a sampling algorithm of graph $G = (V, E)$, where edges $e \in E$ are sampled with probability $q\propto R_e$ (**proportional to the effective resistance, i.e. commute times**).
Then, for $n = |V|$ sufficiently large and $1/\sqrt{n}< \epsilon\le 1$, we need O(n\log n/\epsilon^2)$ samples to satisfy:

$$
\forall \mathbf{x}\in\mathbb{R}^n:\; (1-\epsilon)\mathbf{x}^T\mathbf{L}\_G\mathbf{x}\le\mathbf{x}^T\mathbf{L}\_{G'}\mathbf{x}\le (1+\epsilon)\mathbf{x}^T\mathbf{L}\_G\mathbf{x}
$$

The intuitions behind is that Dirichlet energies in $G'$ are bounded in $(1\pm \epsilon)$ of the Dirichlet energies of the original graph $G$.

Source	Continuous control with deep reinforcement learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com