What is: Temporal Distribution Matching?

Temporal Distribution Matching, or TDM, is a module used in the AdaRNN architecture to match the distributions of the discovered periods to build a time series prediction model $\mathcal{M}$ Given the learned time periods, the TDM module is designed to learn the common knowledge shared by different periods via matching their distributions. Thus, the learned model $\mathcal{M}$ is expected to generalize well on unseen test data compared with the methods which only rely on local or statistical information.

Within the context of AdaRNN, Temporal Distribution Matching aims to adaptively match the distributions between the RNN cells of two periods while capturing the temporal dependencies. TDM introduces the importance vector $\mathbf{\alpha} \in \mathbb{R}^{\hat{V}}$ to learn the relative importance of $V$ hidden states inside the RNN, where all the hidden states are weighted with a normalized $\alpha$ . Note that for each pair of periods, there is an $\mathbf{\alpha}$ , and we omit the subscript if there is no confusion. In this way, we can dynamically reduce the distribution divergence of cross-periods.

Given a period-pair $\left(\mathcal{D}\_{i}, \mathcal{D}\_{j}\right)$ , the loss of temporal distribution matching is formulated as:

\mathcal{L}\_{t d m}\left(\mathcal{D}\_{i}, \mathcal{D}\_{j} ; \theta\right)=\sum_{t=1}^{V} \alpha\_{i, j}^{t} d\left(\mathbf{h}\_{i}^{t}, \mathbf{h}\_{j}^{t} ; \theta\right)

where $\alpha\_{i, j}^{t}$ denotes the distribution importance between the periods $\mathcal{D}\_{i}$ and $\mathcal{D}\_{j}$ at state $t$ .

All the hidden states of the RNN can be easily computed by following the standard RNN computation. Denote by $\delta(\cdot)$ the computation of a next hidden state based on a previous state. The state computation can be formulated as

\mathbf{h}\_{i}^{t}=\delta\left(\mathbf{x}\_{i}^{t}, \mathbf{h}\_{i}^{t-1}\right)

The final objective of temporal distribution matching (one RNN layer) is:

\mathcal{L}(\theta, \mathbf{\alpha})=\mathcal{L}\_{\text {pred }}(\theta)+\lambda \frac{2}{K(K-1)} \sum\_{i, j}^{i \neq j} \mathcal{L}\_{t d m}\left(\mathcal{D}\_{i}, \mathcal{D}\_{j} ; \theta, \mathbf{\alpha}\right)

where $\lambda$ is a trade-off hyper-parameter. Note that in the second term, we compute the average of the distribution distances of all pairwise periods. For computation, we take a mini-batch of $\mathcal{D}_{i}$ and $\mathcal{D}\_{j}$ to perform forward operation in RNN layers and concatenate all hidden features. Then, we can perform TDM using the above equation.

Source	AdaRNN: Adaptive Learning and Forecasting of Time Series
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com