Movatterモバイル変換

サーベイ論文：Deep Learningを用いた経路予測の研究動向箕浦大晃平川翼山下隆義藤吉弘亘PRMU研究会October 9-10, 2020中部大学機械知覚&ロボティクスグループ

予測対象の未来の経路を予測する技術経路予測2応用先自動運転・事故防止・自律走行・ナビゲーションロボット

経路予測のカテゴリ3Bayesian-based Deep Learning-based Planning-based内部状態観測状態Update Updateex. model : Kalman FilterPast FuturePredictionmodelInputOutputex. model : LSTM，CNN ex. Model : IRL, RRT*StartGoalXOX観測状態にノイズを付与した値から未来の内部状態を更新し予測値を逐次推定予測対象の過去の軌跡から未来の行動を学習スタートからゴールまでの報酬値を最適化本サーベイの対象

Deep Learningによる経路予測の必要な要素4一人称視点車載カメラ視点鳥瞰視点View pointand the yellow-orange heatmaps aren ones are ground truth multi-futureSingle-Future Multi-Future18.51 / 35.84 166.1 / 329.528.68 / 49.87 184.5 / 363.2PIE JAADMethod MSE CMSE CFMSE MSE CMSE CFM0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5Linear 123 477 1365 950 3983 223 857 2303 1565 611LSTM 172 330 911 837 3352 289 569 1558 1473 576B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561PIEtraj 58 200 636 596 2477 110 399 1248 1183 478Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is capredicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding boxpredicted sequence and only the last time step respectively.MSEMethod 0.5s 1s 1.5sLinear 0.87 2.28 4.27LSTM 1.50 1.91 3.00PIEspeed 0.63 1.44 2.65Long Short-Term MemoryConvolutional Neural NetworkGated Recurrent UnitTemporal Convolutional NetworkModel!"!#!$!%!&!'t=t input layert=t-1 hidden layerInput Gatet=t output layert=t+1 hidden layerMemory CellForget GateOutput Gate対象クラス対象間のインタラクション静的環境情報Context

Deep Learningによる経路予測の必要な要素5一人称視点車載カメラ視点鳥瞰視点View point対象クラス対象間のインタラクション静的環境情報ContextLong Short-Term MemoryConvolutional Neural NetworkGated Recurrent UnitTemporal Convolutional NetworkModeland the yellow-orange heatmaps aren ones are ground truth multi-futureSingle-Future Multi-Future18.51 / 35.84 166.1 / 329.528.68 / 49.87 184.5 / 363.2PIE JAADMethod MSE CMSE CFMSE MSE CMSE CFM0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5Linear 123 477 1365 950 3983 223 857 2303 1565 611LSTM 172 330 911 837 3352 289 569 1558 1473 576B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561PIEtraj 58 200 636 596 2477 110 399 1248 1183 478Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is capredicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding boxpredicted sequence and only the last time step respectively.MSEMethod 0.5s 1s 1.5sLinear 0.87 2.28 4.27LSTM 1.50 1.91 3.00PIEspeed 0.63 1.44 2.65!"!#!$!%!&!'t=t input layert=t-1 hidden layerInput Gatet=t output layert=t+1 hidden layerMemory CellForget GateOutput Gate

移動対象同士の衝突を避ける経路の予測●移動対象間の距離値や方向からインタラクション情報を求める経路予測におけるインタラクションとは6

Deep Learningを用いた予測手法の傾向と分類72016interactionother2020Social-LSTM[A. Alahi+, CVPR, 2016]DESIRE[N. Lee+, CVPR, 2017]Conv.Social-Pooling[N. Deo+, CVPRW, 2018]SoPhie[A. Sadeghian+, CVPR, 2019]Social-BiGAT[V. Kosaraju+, NeurIPS, 2019]Social-STGCNN[A. Mohamedl+, CVPR, 2020]Social-GAN[A. Gupta+, CVPR, 2018]Next[J. Liang+, CVPR, 2019]STGAT[Y. Huang+, ICCV, 2019]Trajectron[B. Ivanovic+, ICCV, 2019]Social-Attention[A. Vemula+, ICRA, 2018]Multi-Agent Tensor Fusion[T. Zhao+, CVPR, 2019]MX-LSTM[I. Hasan+, CVPR, 2018]CIDNN[Y. Xu+, CVPR, 2018]SR-LSTM[P. Zhang+, CVPR, 2019]Group-LSTM[N. Bisagno+, CVPR, 2018]Reciprocal Network[S. Hao+, CVPR, 2020]PECNet[K. Mangalam+, ECCV, 2020]RSBG[J. SUN+, CVPR, 2020]STAR[C. Yu+, ECCV, 2020]Behavior CNN[S. Yi+, ECCV, 2016]Future localization in ﬁrst-person videos[T. Yagi+, CVPR, 2018]Fast and Furious[W. Luo+, CVPR, 2018]OPPU[A. Bhattacharyya+, CVPR, 2018]Object Attributes and Semantic Segmentation[H. Minoura+, VISAPP, 2019]Rule of the Road[J. Hong+, CVPR, 2019]Multiverse[J. Liang+, CVPR, 2020]Trajectron++[T. Salzmann+, ECCV, 2020]AttentionモデルPoolingモデル近年ではAttentionモデルと複数経路を予測する手法が主流multimodal paths

インタラクションを用いた経路予測モデル8予測対象と他対象間の位置情報を共にPoolingすることで，衝突を避ける経路予測が可能他対象についてのAttentionを求めることで，予測対象が誰にどの程度着目してったかを視覚的に捉えることが可能Pooling モデル Attention モデル

●各カテゴリに属する経路予測手法毎の特徴をまとめる- インタラクションあり• Poolingモデル• Attentionモデル- インタラクションなし (Other)●定量的評価のためのデータセット，評価指標も紹介●代表的モデルを使用して，各モデルの精度と予測結果について議論本サーベイの目的9Deep Learningを用いた経路予測手法の動向調査

本サーベイの目的10Deep Learningを用いた経路予測手法の動向調査●各カテゴリに属する経路予測手法毎の特徴をまとめる- インタラクションあり• Poolingモデル• Attentionモデル- インタラクションなし (Other)●定量的評価のためのデータセット，評価指標も紹介●代表的モデルを使用して，各モデルの精度と予測結果について議論

複数の歩行者の移動経路を同時に予測●歩行者同士の衝突を避けるためにSocial-Pooling layer (S-Pooling)を提案- 予測対象周辺の他対象の位置と中間層出力を入力- 次時刻のLSTMの内部状態に歩行者同士の空間的関係が保持- 衝突を避ける経路予測が可能Social LSTM [A. Alahi+, CVPR, 2016]11Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!Linear!Social-LSTM!GT!SF [73]!A. Alahi, et al., “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” CVPR, 2016.

対象間のインタラクションに加え周囲の環境情報を考慮●交差点や道沿い端などの障害物領域を避ける経路予測を実現●CVAEでエンコードすることで複数の経路を予測可能Ranking & Reﬁnement Moduleで予測経路にランキング付け●経路を反復的に改善することで予測精度向上を図るDESIRE [N. Lee+, CVPR, 2017]12InputKLD Lossfc+softmaxr1 rtr2fcYSample Generation Module Ranking & Re nement ModuleRNN Encoder1GRU GRU GRURNN Encoder2GRU GRU GRURNN Decoder1GRU GRU GRURNN Decoder2GRU GRU GRUCVAEfcfczXYRegressionScoringfc fc fcYYReconLossCNNSCF SCF SCFFeaturePooling(I)Iterative FeedbackconcatmaskadditionFigure 2. The overview of proposed prediction framework DESIRE. First, DESIRE generates multiple plausible prediction samples ˆY via aCVAE-based RNN encoder-decoder (Sample Generation Module). Then the following module assigns a reward to the prediction samplesat each time-step sequentially as IOC frameworks and learns displacements vector ∆ ˆY to regress the prediction hypotheses (RankingDESIRE-S Top1DESIRE-S Top10DESIRE-SI Top1DESIRE-SI Top10LinearRNN EDRNN ED-SIXYMethodLinearRNN EDRNN ED-SICVAE 1CVAE 10%DESIRE-S-ITDESIRE-S-ITDESIRE-S-ITDESIRE-S-ITDESIRE-SI-IDESIRE-SI-IDESIRE-SI-IDESIRE-SI-ILinearRNN EDRNN ED-SICVAE 1CVAE 10%DESIRE-S-ITLinearRNN EDRNN ED-SIXYXY(a) GTFigure 6. KITTI resulDESIRE-S Top1DESIRE-S Top10DESIRE-SI Top1DESIRE-SI Top10LinearRNN EDRNN ED-SIXY真値予測値インタラクションありTop1Top10インタラクションなしTop1Top10N. Lee, et al., “DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents,” CVPR, 2017.

高速道路上で隣接する自動車同士のインタラクションを考慮した予測手法●インタラクション情報に空間的意味合いを持たせるConvolution Social Poolingを提案- LSTM Encoderで得た軌跡特徴量を固定サイズのSocial Tensorに格納- CNNでインタラクションの特徴量を求める- 予測車の特徴量と連結し，LSTM Decoderで経路を予測Convolutional Social-Pooling [N. Deo+, CVPRW, 2018]13Figure 3. Proposed Model: The encoder is an LSTM with shared weights that learns vehicle dynamics based on track histories. Theconvolutional social pooling layers learn the spatial interdependencies of of the tracks. Finally, the maneuver based decoder outputs amulti-modal predictive distribution for the future motion of the vehicle being predictedConvolutional Social Pooling for Vehicle Trajectory PredictionNachiket Deo Mohan M. TrivediUniversity of California, San DiegoLa Jolla, 92093ndeo@ucsd.edu mtrivedi@ucsd.eduAbstractForecasting the motion of surrounding vehicles is a crit-ical ability for an autonomous vehicle deployed in complextraffic. Motion of all vehicles in a scene is governed by thetraffic context, i.e., the motion and relative spatial config-uration of neighboring vehicles. In this paper we proposean LSTM encoder-decoder model that uses convolutionalsocial pooling as an improvement to social pooling lay-ers for robustly learning interdependencies in vehicle mo-tion. Additionally, our model outputs a multi-modal predic-tive distribution over future trajectories based on maneuverclasses. We evaluate our model using the publicly availableNGSIM US-101 and I-80 datasets. Our results show im-provement over the state of the art in terms of RMS valuesof prediction error and negative log-likelihoods of true fu-Figure 1. Imagine the blue vehicle is an autonomous vehicle inthe traffic scenario shown. Our proposed model allows it to makemulti-modal predictions of future motion of it’s surrounding ve-hicles, along with prediction uncertainty shown here for the redv1[cs.CV]15May2018N. Deo, et al., “Convolutional Social Pooling for Vehicle Trajectory Prediction,” CVPRW, 2018.

歩行者の視線情報を活用した経路予測手法●頭部を中心とした視野角内の他対象のみPooling処理- 予測対象の頭部方向，他対象との距離値からPooling処理する対象を選択●軌跡，頭部方向，インタラクション情報をLSTMへ入力- 視野角内にいる他対象との衝突を避ける経路予測を実現- 視線情報を任意に変更することで，任意方向に向かった経路予測が可能MX-LSTM [I. Hasan+, CVPR, 2018]143. Our approachIn this section we present the MX-LSTM, capable ofjointly forecasting positions and head orientations of an in-dividual thanks to the presence of two information streams:Tracklets and vislets.3.1. Tracklets and visletsGiven a subject i, a tracklet (see Fig. 1a) ) is formedby consecutive (x, y) positions on the ground plane,{x(i)t }t=1,...,T , x(i)t = (x, y) ∈ R2, while a vislet is formedby anchor points {a(i)t }t=1,...,T , with a(i)t = (ax, ay) ∈ R2indicating a reference point at a ﬁxed distance r from thecorresponding x(i)t , towards which the face is oriented1. Inb)da))(ita)(itx)(itr)(1itxc)1txtatxttt(i) (i)e(x,i)t = φ x(i)t , Wxe(a,i)t = φ a(i)t , Wawhere the embedding function φ consists in ajection through the embedding weigths Wx andD-dimensional vector, multiplied by a RELU nwhere D is the dimension of the hidden space.3.2. VFOA social poolingThe social pooling introduced in [3] is an effto let the LSTM capture how people move inscene avoiding collisions. This work considers ainterest area around the single pedestrian, in whden states of the the neighbors are consideredthose which are behind the pedestrian. In our caprove this module using the vislet informationing which individuals to consider, by building atum of attention (VFOA), that is a triangle originx(i)t , aligned with a(i)t , and with an aperture givengle γ and a depth d; these parameters have beencross-validation on the training partition of the Tdataset (see Sec. 5).Our view-frustum social pooling is a No × Nsor, in which the space around the pedestrian is dFigure 3. Qualitative results: a) MX-LSTM b) Ablation qualitative study on Individual MX-LSTM (better in color).I. Hasan, et al., “MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,” CVPR, 2018.

グループに関するインタラクションを考慮した経路予測手法●運動傾向が類似する歩行者同士をグループとみなす●予測対象が属するグループ以外の個人の情報をPooling- 異なるグループとの衝突を避ける経路を予測Group-LSTM [N. Bisagno+, ECCVW, 2018]15Group LSTM 7Fig. 3. Representation of the Social hidden-state tensor Hit . The black dot represents the pedes-trian of interest pedi. Other pedestrians pedj (∀j = i) are shown in different color codes, namelygreen for pedestrians belonging to the same set, and red for pedestrians belonging to a differentGroup LSTM 9ing to the studies in interpersonal distances [15, 10], socially correlated people tendto stay closer in their personal space and walk together in crowded environments ascompared to pacing with unknown pedestrians. Pooling only unrelated pedestrians willfocus more on macroscopic inter-group interactions rather than intra-group dynamics,thus allowing the LSTM network to improve the trajectory prediction performance.Collision avoidance influences the future motion of pedestrians in a similar manner iftwo pedestrians are walking together as in a group.In Tables 2, 3 and Fig. 4, we display some demos of predicted trajectories whichhighlight how our Group-LSTM is able to predict pedestrian trajectories with betterprecision, showing how the prediction is improved when we pool in the social tensor ofeach pedestrian only pedestrians not belonging to his group.In Table 2, we show how the prediction of two pedestrians walking together in thecrowd improves when they are not pooled in each other’s pooling layer. When the twopedestrians are pooled together, the network applies on them the typical repulsion forceto avoid colliding with each other. Since they are in the same group, they allow the otherpedestrian to stay closer in they personal space.In Fig. 4 we display the sequences of two groups walking toward each other. InTable 3, we show how the prediction for the two groups is improved with respect to theSocial LSTM. While both prediction are not very accurate, our Group LSTM performbetter because it is able to forecast how pedestrian belonging to the same group willstay together when navigating the environment.Name Scene Our Group-LSTM Social-LSTMETHUnivFrame2425Table 2. ETH dataset: the prediction is improved when pooling in the social tensor of each pedes-trian only pedestrians not belonging to his group. The green dots represent the ground truth tra-jectories; the blue crosses represent the predicted paths.5 ConclusionIn this work, we tackle the problem of pedestrian trajectory prediction in crowdedscenes. We propose a novel approach, which combines the coherent filtering algorithmwith the LSTM networks. The coherent filtering is used to identify pedestrians walkingtogether in a crowd, while the LSTM network is used to predict the future trajectories10 Niccoló Bisagno, Bo Zhang and Nicola Conci(a) (b) (c) (d)Fig. 4. Sequences taken from the UCY dataset. It displays an interaction example between twogroups, which will be further analyzed in Table 3.Name Scene Our Group-LSTM Social-LSTMUCYUnivFrame1025Table 3. We display how the prediction is improved for two groups walking in opposite direc-tions. The green dots represent the ground truth trajectories, while the blue crosses represent thepredicted paths.N. Bisagno, et al., “Group LSTM: Group Trajectory Prediction in Crowded Scenarios,” ECCVW, 2018.

GANを用いて複数経路を予測する手法●Generator：複数の予測経路をサンプリング- LSTM Encoderの特徴量を用いて，Pooling Moduleでインタラクション情報を出力- 各出力とノイズベクトルを連結し，LSTM Decoderで未来の複数の予測経路を出力●Discriminator：予測経路と実際の経路を判別- 敵対的に学習させることで，実際の経路と騙す予測経路を生成することを期待Social-GAN [A. Gupta+, CVPR, 2018]16Figure 2: System overview. Our model consists of three key components: Generator (G), Pooling Module, and Discriminator(D). G takes as input past trajectories Xi and encodes the history of the person i as Hti . The pooling module takes as inputall Htobsi and outputs a pooled vector Pi for each person. The decoder generates the future trajectory conditioned on HtobsiFigure 5: Comparison between our model wavoidance scenarios: two people meeting (1meeting at an angle (4). For each exampleto pooling, SGAN-P predicts socially accepFigure 5: Comparison between our model without pooling (SGAN, top) and with pooli予測分布A. Gupta, et al., “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,” CVPR, 2018.

歩行者や自動車等の異なる移動対象とのインタラクションを考慮した予測手法●インタラクションに加えシーンコンテキストを共同でモデル化- 動的と静的の2つの物体との衝突を避ける経路を予測可能●Multi-Agent Tensor Fusion- CNNでシーンに関するコンテキスト情報を抽出- 移動対象毎の位置情報から空間的グリッドにLSTMの出力を格納- コンテキスト情報と空間的グリッドをチャネル方向に連結し，CNNでFusion- Fusionした特徴量からLSTM Decoderで経路を予測Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019]17Figure 5: Ablative results on Stanford Drone dataset. From left to right are results from MATF Multi Agent Scene, MAT入力値真値予測値T. Zhao, et al., “Multi-Agent Tensor Fusion for Contextual Trajectory Prediction,” CVPR, 2019.

2つのネットワークを結合した相互学習による経路予測手法●Forward Prediction Network：一般的な軌道予測手法 (観測 → 予測)●Backward Prediction Network：一般的な軌道予測手法の逆 (予測 → 観測)相互制約に基づいてAdversarial Attackの概念に基づくモデルを構築●入力軌跡をiterativeに変更●モデルの出力と一致させることで，新しい概念(相互攻撃)と呼ぶモデルを開発Reciprocal Network [S. Hao+, CVPR, 2020]18orks for Human Trajectory Predictioniqun Zhao, and Zhihai Heersity of Missouri,hezhi}@mail.missouri.eduwardwardormsy dif-prop-earn- Figure 1. Illustration of our idea of reciprocal learning for human3. The generator is constructed by a decoder LSTM. Sim-ilar to the conditional GAN [24], a white noise vector Zis sampled from a multivariate normal distribution. Then, amerge layer is used in our proposed network which concate-nates all encoded features mentioned above with the noisevector Z. We take this as the input to the LSTM decoderto generate the candidate future paths for each human. Thediscriminator is built with an LSTM encoder which takesthe input as randomly chosen trajectory from either groundtruth or predicted trajectories and classiﬁes them as “real”or “fake”. Generally speaking, the discriminator classiﬁesthe trajectories which are not accurate as “fake” and forcesthe generator to generate more realistic and feasible trajec-tories.Within the framework of our reciprocal learning for hu-man trajectory prediction, let Gθ: X → Y and Gφ: Y →Figure 4. Illustration of the proposed attack method.S. Hao, et al., “Reciprocal Learning Networks for Human Trajectory Prediction,” CVPR, 2020.

Predicted Endpoint Conditioned Network (PECNet)●予測最終地点 (エンドポイント)を重視した学習を行う経路予測手法- でエンドポイントを予測し，Past Encodingの出力と連結 (concat encoding)- 連結した特徴量からSocial Pooling内の各パラメタ特徴量を取得- 歩行者 x 歩行者のSocial Maskで歩行者間のインタラクションを求める- concat encodingとインタラクション情報からで経路を予測PECNet [K. Mangalam+, ECCV, 2020]196 K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik, A. GaidonDlatentPECNet: Pedestrian Endpoint Conditioned Trajectory Prediction Network 13lower prediction error than way-points in the middle! This in a nutshell, con-firms the motivation of this work.E↵ect of Number of samples (K): All the previous works use K = 20 sam-ples (except DESIRE which uses K = 5) to evaluate the multi-modal predictionsfor metrics ADE & FDE. Referring to Figure 5, we see the expected decreas-ing trend in ADE & FDE with time as K increases. Further, we observe thatour proposed method achieves the same error as the previous works with muchsmaller K. Previous state-of-the-art achieves 12.58 [39] ADE using K = 20 sam-ples which is matched by PECNet at half the number of samples, K = 10. Thisfurther lends support to our hypothesis that conditioning on the inferred way-point significantly reduces the modeling complexity for multi-modal trajectoryforecasting, providing a better estimate of the ground truth.Lastly, as K grows large (K ! 1) we observe that the FDE slowly gets closerto 0 with more number of samples, as the ground truth Gc is eventually found.However, the ADE error is still large (6.49) because of the errors in the rest ofthe predicted trajectory. This is in accordance with the observed ADE (8.24) forthe oracle conditioned on the last observed point (i.e. 0 FDE error) in Fig. 4.Design choice for VAE: We also evaluate our design choice of using the in-ferred future way-points ˆGc for training subsequent modeules (social pooling &prediction) instead of using the ground truth Gc. As mentioned in Section 3.2,this is also a valid choice for training PECNet end to end. Empirically, we findFig. 6. Visualizing Multimodality: We show visualizations for some multi-modal入力値真値予測値K. Mangalam, et al., “It is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction,” ECCV, 2020.Pfuture

グラフ構造を時空間方向に拡張した経路予測手法●Node：対象の位置情報●Edge：対象間の空間情報，時間方向へ伝播する対象自身の情報●NodeとEdgeからAttentionを求め，注目対象を導出- 注目対象を回避する経路予測が可能- Attentionを求めることで視覚的な説明が可能Social-Attention [A. Vemula+, ICRA, 2018]21ponding factor graphel(a)(b)(a)(b)Fig. 3. Architecture of EdgeRNN (left), Attention module (middle) and NodeRNN (right)and Interacting Gaussian Processes [4]. Hence, we choseSocial LSTM as the baseline to compare the performanceof our method.C. Quantitative ResultsThe prediction errors for all the methods on the 5 crowdA. Vemula, et al., “Social Attention: Modeling Attention in Human Crowds,” ICRA, 2018.

移動対象の行動による危険度をAttentionで推定し，行動の特徴に重み付け●Motion Encoder Moduleで対象毎の行動をエンコード●Location Encoder Moduleで対象毎の位置情報をエンコード- 予測対象と全他対象の内積を求め，Softmaxで他対象の特徴に重み付け●2つのModuleを連結し，次時刻以降の経路を予測CIDNN [Y. Xu+, CVPR, 2018]22Nth1fc 2fc 3fc1tS1fc 2fc 3fcitS1fc 2fc 3fcNtS1thithÖÖ,1ita,i ita,i NtaÖÖ1 1 11 2, ,..., tS S SLSTMLSTM1tz1 2, ,...,i i itS S SLSTMLSTMitz1 2, ,...,N N NtS S SLSTMLSTMNtzÖÖÖÖDisplacement Prediction Module1itSd +itcfc#ƒƒƒLocation Encoder Module Motion Encoder ModuleCrowdInteractionƒ #InnerproductScalarmultiplicationSumÖÖÖÖFigure 2. The architecture of crowd interaction deep neural network (CIDNN).Successful CasesFigure 3. Qualitative results: history trajSuccessful CasesFigure 3. Qualitative results: history trajectory (red), ground truth (blueSuccessful CasesFigure 3. Qualitative results: history trajectory (red), ground truth (blue), and predicted trajectories from ou入力値真値予測値Y. Xu, et al., “Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction,” CVPR, 2018.

現時刻のインタラクション情報から予測対象の未来の予測経路を更新●States refinement module内の2つの機構で高精度な経路予測を実現- 他対象との衝突を防ぐPedestrian-aware attention (PA)- 他対象の動きから，予測対象自身が経路を選択するMotion gate (MG)●MGで衝突を起こしそうな対象の動きから経路を選択●PAで予測対象近隣の他対象に着目SR-LSTM [P. Zhang+, CVPR, 2019]235,36,39]. Vemulafrom the hiddengives an impor-et al. [33] utilizeght the importantpairwise velocitywho are in simi-ms to selects mo-strian during thelly aware neigh-d in previous ap-ramework. Thislution NetworksLSTMLSTMLSTMLSTMSRt t+1LSTMLSTMStates refinement moduleLSTM statesInput thelocation toLSTMOuput theprediction...selects the features, where each row is related to a certaindimension of hidden feature.In Fig.6, the first column shows the trajectory patternscaptured by hidden features started from origin and endedat the dots, which are extracted in similar way as Fig.2(a).The motion gate for a feature considers pairwise input tra-jectories with similar configurations. Some examples forhigh response of the gate are shown in the other columns ofFig.6. In these pairwise trajectory samples, the red and blueones are respectively the trajectories of pedestrian i and j,and the time step we calculate the motion gate are shownwith dots (where the trajectory ends). These pairwise sam-ples are extracted by searching from database with highestactivation for the motion gate neuron. High response of gatemeans that the corresponding feature is selected.Figure 6. Selected feature patterns by motion gate. Each row isrelated to a hidden neuron (feature) of LSTM. Column 1: Activa-tion trajectory pattern of the hidden feature. Column 2-6: Pairwisetrajectory examples (end with solid dots) having high activation to3) Row 3: Thiconsiders moretion. 4) Row 4hidden featureattention on thwalk towards hPedestrian-ples of the pedLSTM in Fig.7to the close netention, 2) theoften largely forefinement tenbors with groulonger time ranFigure 7. Illustrmagenta represethe dashed circlement. Larger cirrepresents the taones are his/hertheir walking dir5. Conclusioselects the features, where each row is related to a certaindimension of hidden feature.In Fig.6, the first column shows the trajectory patternscaptured by hidden features started from origin and endedat the dots, which are extracted in similar way as Fig.2(a).The motion gate for a feature considers pairwise input tra-jectories with similar configurations. Some examples forhigh response of the gate are shown in the other columns ofFig.6. In these pairwise trajectory samples, the red and blueones are respectively the trajectories of pedestrian i and j,and the time step we calculate the motion gate are shownwith dots (where the trajectory ends). These pairwise sam-ples are extracted by searching from database with highestactivation for the motion gate neuron. High response of gatemeans that the corresponding feature is selected.3) Row 3: This case is similar to row 2. This gate elementconsiders more distant neighbor walking in opposite direc-tion. 4) Row 4: The neighbor in blue is static, the selectedhidden feature shows that pedestrian i in red potentially payattention on this stationary neighbor in case he is about towalk towards him/her.Pedestrian-wise attention. We illustrate some exam-ples of the pedestrian-wise attention expected by our SR-LSTM in Fig.7. It shows that 1) dominant attention is paidto the close neighbors, while the others also take slight at-tention, 2) the attention given by the first refinement layeroften largely focuses on the close neighbors, and the secondrefinement tends to strengthen the effect of farther neigh-bors with group behavior or may influence the pedestrian inlonger time range.Pedestrian-aware attentionMotion gate 予測対象予測対象P. Zhang, et al., “SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction,” CVPR, 2019.

将来の経路と行動を同時に予測するモデルを提案●Person Behavior Module：歩行者の外見情報と骨格情報をエンコード●Person Interaction Module：周辺の静的環境情報と自動車等の物体情報をエンコード●Visual Feature Tensor Q：上記2つの特徴と過去の軌跡情報をエンコード●Trajectory Generator：将来の経路を予測●Activity Prediction：予測最終時刻の行動を予測Next [J. Liang+, CVPR, 2019]24Figure 2. Overview of our model. Given a sequence of frames containing the person for prediction, our model utilizes person behaviormodule and person interaction module to encode rich visual semantics into a feature tensor. We propose novel person interaction modulethat takes into account both person-scene and person-object relations for joint activities and locations prediction.3. Approach RoIAlignCNNFigure 6. (Better viewed in color.) Qualitative comparison between our method and the baselines. Yellow path is the observable trajectoryand Green path is the ground truth trajectory during the prediction period. Predictions are shown as Blue heatmaps. Our model also predictsthe future activity, which is shown in the text and with the person pose template.Figure 7. (Better viewed in color.) Qualitative analysis of oMethod ETH HOTEL UNModelLinear 1.33 / 2.94 0.39 / 0.72 0.82LSTM 1.09 / 2.41 0.86 / 1.91 0.61J. Liang, et al., “Peeking into the Future: Predicting Future Person Activities and Locations in Videos,” CVPR, 2019.

歩行者同士のインタラクションに加え，静的環境情報を考慮した予測手法●Physical Attention：静的環境に関するAttentionを推定●Social Attention：動的物体に関するAttentionを推定各AttentionとLSTM Encoderの出力から将来の経路を予測SoPhie [A. Sadeghian+, CVPR, 2019]25Physical Attention:!""#$Social Attention:!""%&GeneratorAttention Module for i-th personGAN Moduleconcat.DiscriminatorLSTMLSTMLSTMLSTMLSTMLSTMconcat.concat.concat.zzzdecoder1st agenti-th agentN-th agentAttention Module for 1st personAttention Module for n-th personCNNFeature Extractor Modulei-th agentcalc. relativerelativerelativerelativeLSTMLSTMLSTMencoder(a) (b) (c)N-th agent1st agentFigure 2. An overview of SoPhie architecture. Sophie consists of three key modules including: (a) A feature extractor module, (b) Anattention module, and (c) An LSTM based GAN module.3.3. Feature extractors where πj is the index of the other agents sorted according totheir distances to the target agent i. In this framework, eachNexus 6 LiFigure 3. Using the generator to sample trajectories and the discriminmaps for SDD scenes. Maps are presented in red, and generated onlyGround Truth Social LSTM Social GAN Sophie (Ours)Figure 4. Comparison of Sophie’s predictions against the groundtruth trajectories and two baselines. Each pedestrian is displayedwith a different color, where dashed lines are observed trajecto-A. Sadeghian, et al., “SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints,” CVPR, 2019.

インタラクションを時間方向へ伝搬した予測手法インタラクションを考慮するためにGraph Attention Network (GAT)を適用●GAT：グラフ構造を取り入れたAttentionに基づくGraph Convolutional Networks- シーン全体にいる他対象の関係の重要度をAttention機構で学習●GATで求めた特徴を時間方向に伝播することで時空間のインタラクションを考慮- 衝突の可能性がある対象の情報を過去の経路から導出可能STGAT [Y. Huang+, ICCV, 2019]26GAT GAT GAT GATcccEncoder State DecoderGAT Graph Attention NetworkConcat Noisec· · ·<latexit sha1_base64="RyHhXBLV/0cOSbkT6YVaYO8lbL8=">AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPXgoePFYwX5AG8pms2nXbnbD7kQoof/BiwdFvPp/vPlv3LY5aOuDgcd7M8zMC1PBDXret7Oyura+sVnaKm/v7O7tVw4OW0ZlmrImVULpTkgME1yyJnIUrJNqRpJQsHY4up367SemDVfyAccpCxIykDzmlKCVWj0aKTT9StWreTO4y8QvSBUKNPqVr16kaJYwiVQQY7q+l2KQE42cCjYp9zLDUkJHZMC6lkqSMBPks2sn7qlVIjdW2pZEd6b+nshJYsw4CW1nQnBoFr2p+J/XzTC+DnIu0wyZpPNFcSZcVO70dTfimlEUY0sI1dze6tIh0YSiDahsQ/AXX14mrfOaf1Hz7i+r9ZsijhIcwwmcgQ9XUIc7aEATKDzCM7zCm6OcF+fd+Zi3rjjFzBH8gfP5A60zjys=</latexit>tobs<latexit sha1_base64="vcpPldxy0fXoWuPkuol+dvCEp1Q=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqYI4BLx4jmAckS5idTJIhszPLTK8QlnyEFw+KePV7vPk3TpI9aGJBQ1HVTXdXlEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0axptMS206EbVcCsWbKFDyTmI4jSPJ29Hkbu63n7ixQqtHnCY8jOlIiaFgFJ3Uxn6mIzvrlyt+1V+ArJMgJxXI0eiXv3oDzdKYK2SSWtsN/ATDjBoUTPJZqZdanlA2oSPedVTRmNswW5w7IxdOGZChNq4UkoX6eyKjsbXTOHKdMcWxXfXm4n9eN8VhLcyESlLkii0XDVNJUJP572QgDGcop45QZoS7lbAxNZShS6jkQghWX14nratqcF31H24q9VoeRxHO4BwuIYBbqMM9NKAJDCbwDK/w5iXei/fufSxbC14+cwp/4H3+ALeBj8c=</latexit>t2<latexit sha1_base64="7ItLwn8Q8RXHU3NW/LT2SA7MEc4=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWMF0xbaUDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsbm1vbObmmvvH9weHRcOTltmyTTjPsskYnuhtRwKRT3UaDk3VRzGoeSd8LJ3dzvPHFtRKIecZryIKYjJSLBKFrJx0Fenw0qVbfmLkDWiVeQKhRoDSpf/WHCspgrZJIa0/PcFIOcahRM8lm5nxmeUjahI96zVNGYmyBfHDsjl1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7YheKsvr5N2veZd19yHm2qzUcRRgnO4gCvw4BaacA8t8IGBgGd4hTdHOS/Ou/OxbN1wipkz+APn8wfJU46h</latexit>t1<latexit sha1_base64="Bdkp2HUnpwjOv5mbz6SwzTMJ8ag=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4Pcmw2qNbfuLkDWiVeQGhRoDapf/WHCspgrZJIa0/PcFIOcahRM8lmlnxmeUjahI96zVNGYmyBfHDsjF1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7EheKsvr5P2Vd27rrsPN7Vmo4ijDGdwDpfgwS004R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wfHzo6g</latexit>t3<latexit sha1_base64="OcT9ss7O545agx6P8OIS9udeh5k=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4P8ejao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKNGkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvuu4+3NSajSKOMpzBOVyCB7fQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx/K2I6i</latexit>zM-LSTM G-LSTMFigure 2. The architecture of our proposed STGAT model. The framework is based on seq2seq model and consists of 3 parts: EncodIntermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and GAT. The Intermediate Stencapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories basedestrians in a scene are considered as nodes on theaph at every time-step. The edges on the graph repre-st of human-human interactions.~h1<latexit sha1_base64="16Yxx8+YpYlp2tGJX9qHHqgdbbY=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMF+wFtKJvtpl262Q27k0IJ+RlePCji1V/jzX/jts1Bqw8GHu/NMDMvTAQ36HlfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByN/c7U6YNV/IRZwkLYjKSPOKUoJV6/Smj2TgfZH4+qNa8ureA+5f4BalBgeag+tkfKprGTCIVxJie7yUYZEQjp4LllX5qWELohIxYz1JJYmaCbHFy7p5ZZehGStuS6C7UnxMZiY2ZxaHtjAmOzao3F//zeilGN0HGZZIik3S5KEqFi8qd/+8OuWYUxcwSQjW3t7p0TDShaFOq2BD81Zf/kvZF3b+sew9XtcZtEUcZTuAUzsGHa2jAPTShBRQUPMELvDroPDtvzvuyteQUM8fwC87HN46CkWw=</latexit>~h5<latexit sha1_base64="R1g6gAHeZA/axOfzhGwObt3Ay0g=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIf6LLoxmUF+4A0lMn0ph06mQkzk0IJ+Qw3LhRx69e482+ctllo64ELh3Pu5d57woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjtpapotCikkvVDYkGzgS0DDMcuokCEoccOuH4fuZ3JqA0k+LJTBMIYjIULGKUGCv5vQnQbJT3s+u8X625dXcOvEq8gtRQgWa/+tUbSJrGIAzlRGvfcxMTZEQZRjnklV6qISF0TIbgWypIDDrI5ifn+MwqAxxJZUsYPFd/T2Qk1noah7YzJmakl72Z+J/npya6DTImktSAoItFUcqxkXj2Px4wBdTwqSWEKmZvxXREFKHGplSxIXjLL6+S9kXdu6y7j1e1xl0RRxmdoFN0jjx0gxroATVRC1Ek0TN6RW+OcV6cd+dj0Vpyiplj9AfO5w+UlpFw</latexit>~h4<latexit sha1_base64="zHvzoSStxfsE4+HkOUWqlEWKApg=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0m0oMeiF48V7AekoWy2k3bpZjfsbgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJR8tUUWhTyaXqhUQDZwLahhkOvUQBiUMO3XByP/e7U1CaSfFkZgkEMRkJFjFKjJX8/hRoNs4HWSMfVGtu3V0ArxOvIDVUoDWofvWHkqYxCEM50dr33MQEGVGGUQ55pZ9qSAidkBH4lgoSgw6yxck5vrDKEEdS2RIGL9TfExmJtZ7Foe2MiRnrVW8u/uf5qYlug4yJJDUg6HJRlHJsJJ7/j4dMATV8ZgmhitlbMR0TRaixKVVsCN7qy+ukc1X3ruvuY6PWvCviKKMzdI4ukYduUBM9oBZqI4okekav6M0xzovz7nwsW0tOMXOK/sD5/AGTEZFv</latexit>~h3<latexit sha1_base64="4MzhxHH1VdXzxV7PaR8PPHgLBTc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0msoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJ56ruNeru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGRjJFu</latexit>~h2<latexit sha1_base64="hAcukdL3sj/GIPn2AtYuEsqEXSA=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJp1H3ruru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGQB5Ft</latexit>~h6<latexit sha1_base64="Kuq84mWaskTChJNeNnyTi5aZ6oc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lU1GPRi8cK9gPSUDbbSbt0sxt2N4US8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqKLSo5FJ1Q6KBMwEtwwyHbqKAxCGHTji+n/mdCSjNpHgy0wSCmAwFixglxkp+bwI0G+X97DrvV2tu3Z0DrxKvIDVUoNmvfvUGkqYxCEM50dr33MQEGVGGUQ55pZdqSAgdkyH4lgoSgw6y+ck5PrPKAEdS2RIGz9XfExmJtZ7Goe2MiRnpZW8m/uf5qYlug4yJJDUg6GJRlHJsJJ79jwdMATV8agmhitlbMR0RRaixKVVsCN7yy6ukfVH3Luvu41WtcVfEUUYn6BSdIw/doAZ6QE3UQhRJ9Ixe0ZtjnBfn3flYtJacYuYY/YHz+QOWG5Fx</latexit>~h1<latexit sha1_base64="KoV/fuxRJ+UPzkEixIjJIx41qvU=">AAAB+HicbVBNS8NAEJ34WetHox69LBbRU0lU0GPRi8cK9gPaGDbbTbt0swm7m0IN+SVePCji1Z/izX/jts1BWx8MPN6bYWZekHCmtON8Wyura+sbm6Wt8vbO7l7F3j9oqTiVhDZJzGPZCbCinAna1Exz2kkkxVHAaTsY3U799phKxWLxoCcJ9SI8ECxkBGsj+XalN6YkG+aP2WnuZ27u21Wn5syAlolbkCoUaPj2V68fkzSiQhOOleq6TqK9DEvNCKd5uZcqmmAywgPaNVTgiCovmx2eoxOj9FEYS1NCo5n6eyLDkVKTKDCdEdZDtehNxf+8bqrDay9jIkk1FWS+KEw50jGapoD6TFKi+cQQTCQztyIyxBITbbIqmxDcxZeXSeu85l7UnPvLav2miKMER3AMZ+DCFdThDhrQBAIpPMMrvFlP1ov1bn3MW1esYuYQ/sD6/AHv15NC</latexit>~12<latexit sha1_base64="7wH2WVdWISdwYnqsMTVyR5rAKM0=">AAAB+nicbVBNS8NAEJ34WetXqkcvi0XwVJIq6LHoxWMF+wFtCJvtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBlhRzgRtaaY57SaS4ijgtBOMb2d+Z0KlYrF40NOEehEeChYygrWRfLvSn1CS9TFPRjj3M7ee+3bVqTlzoFXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy+an5+jMKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0yiYEd/nlVdKu19yLmnN/WW3cFHGU4ARO4RxcuIIG3EETWkDgEZ7hFd6sJ+vFerc+Fq1rVjFzDH9gff4AT7aUBQ==</latexit>~13<latexit sha1_base64="Oj5OOfl5UD0Mrh/Gd9yVwPeeGOY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ+5F7ts1p+7MgVaJW5AaFGj69ld/EJM0okITjpXquU6ivQxLzQineaWfKppgMsZD2jNU4IgqL5ufnqNTowxQGEtTQqO5+nsiw5FS0ygwnRHWI7XszcT/vF6qw2svYyJJNRVksShMOdIxmuWABkxSovnUEEwkM7ciMsISE23SqpgQ3OWXV0n7vO5e1J37y1rjpoijDMdwAmfgwhU04A6a0AICj/AMr/BmPVkv1rv1sWgtWcXMEfyB9fkDUTuUBg==</latexit>~11<latexit sha1_base64="Z1DxO/2cTeAPqRO0wzlGCoK0+XY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ66b+3bNqTtzoFXiFqQGBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXumniiaYjPGQ9gwVOKLKy+an5+jUKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0KiYEd/nlVdI+r7sXdef+sta4KeIowzGcwBm4cAUNuIMmtIDAIzzDK7xZT9aL9W59LFpLVjFzBH9gff4ATjGUBA==</latexit>~14<latexit sha1_base64="bUiB7GM/EueyHNLIaRxhSyx8/6A=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIt6LHoxWMF+wFtCJPtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBqAoZ4K2NNOcdhNJIQo47QTj25nfmVCpWCwe9DShXgRDwUJGQBvJtyv9CSVZH3gygtzP3Hru21Wn5syBV4lbkCoq0PTtr/4gJmlEhSYclOq5TqK9DKRmhNO83E8VTYCMYUh7hgqIqPKy+ek5PjPKAIexNCU0nqu/JzKIlJpGgemMQI/UsjcT//N6qQ6vvYyJJNVUkMWiMOVYx3iWAx4wSYnmU0OASGZuxWQEEog2aZVNCO7yy6ukfVFzL2vOfb3auCniKKETdIrOkYuuUAPdoSZqIYIe0TN6RW/Wk/VivVsfi9Y1q5g5Rn9gff4AUsCUBw==</latexit>~15<latexit sha1_base64="cC09I5yAhT8haElJRiWSguUZ4i0=">AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclcQHuiy6cVnBPqAJYTK9bYdOJmFmUikxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHCmtON8Wyura+sbm6Wt8vbO7t6+XTloqTiVFJo05rHshEQBZwKammkOnUQCiUIO7XB0O/XbY5CKxeJBTxLwIzIQrM8o0UYK7Io3Bpp5hCdDkgeZe5kHdtWpOTPgZeIWpIoKNAL7y+vFNI1AaMqJUl3XSbSfEakZ5ZCXvVRBQuiIDKBrqCARKD+bnZ7jE6P0cD+WpoTGM/X3REYipSZRaDojoodq0ZuK/3ndVPev/YyJJNUg6HxRP+VYx3iaA+4xCVTziSGESmZuxXRIJKHapFU2IbiLLy+T1lnNPa859xfV+k0RRwkdoWN0ilx0heroDjVQE1H0iJ7RK3qznqwX6936mLeuWMXMIfoD6/MHVEWUCA==</latexit>~16<latexit sha1_base64="Th2k0L1quCqnOCL5jErebK97YEU=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIV9Vj04rGC/YAmhM122i7dbMLuplJifooXD4p49Zd489+4bXPQ1gcDj/dmmJkXJpwp7Tjf1srq2vrGZmmrvL2zu7dvVw5aKk4lhSaNeSw7IVHAmYCmZppDJ5FAopBDOxzdTv32GKRisXjQkwT8iAwE6zNKtJECu+KNgWYe4cmQ5EHmXuaBXXVqzgx4mbgFqaICjcD+8noxTSMQmnKiVNd1Eu1nRGpGOeRlL1WQEDoiA+gaKkgEys9mp+f4xCg93I+lKaHxTP09kZFIqUkUms6I6KFa9Kbif1431f1rP2MiSTUIOl/UTznWMZ7mgHtMAtV8YgihkplbMR0SSag2aZVNCO7iy8ukdVZzz2vO/UW1flPEUUJH6BidIhddoTq6Qw3URBQ9omf0it6sJ+vFerc+5q0rVjFziP7A+vwBVcqUCQ==</latexit>n illustration of graph attention layer. It allows a nodefferent importance to different nodes within a neigh-propose to use another LSTM to model the templations between interactions explicitly. We term tas G-LSTM:gti = G-LSTM(gt 1i , ˆmti;Wg)where ˆmti is from Eq. 5. Wg is the G-LSTM weshared among all the sequences.In Encoder component, two LSTMs (M-LLSTM) are used to model the motion pattern of etrian, and the temporal correlations of interactiontively. We combine these two parts to accomplishof spatial and temporal information. At time-stepare two hidden variables (mTobsi , gTobsi ) from twoeach pedestrian. In our implementation, these two予測対象他対象他対象他対象他対象他対象Y. Huang, et al., “STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction,” ICCV, 2019.

複数対象を動的なグラフ構造で効率的にモデル化●NHE：観測時刻のNode特徴をLSTMへ入力●NFE：学習時にNodeの未来の真の軌跡をエンコードするためにBiLSTMを適用●EE：特定範囲内の全対象からAttentionを求める- 重要度の高いEdge情報を取得- 時刻毎にEdge情報は変動●各特徴からDecoderで経路を予測- 内部のCVAEでマルチモーダルな経路を予測- Gaussian Mixture Modelで予測経路を洗練環境情報を追加したTrajectron++ [T. Salzmann+, ECCV, 2020]が提案Trajectron [B. Ivanovic+, CVPR, 2019]27Overall, we chose to make our model part of the “graphas architecture” methods, as a result of their stateful graphrepresentation (leading to efﬁcient iterative predictions on-line) and modularity (enabling model reuse and extensiveparameter sharing).3. Problem FormulationIn this work, we are interested in jointly reasoning andgenerating a distribution of future trajectories for each agentin a scene simultaneously. We assume that each scene ispreprocessed to track and classify agents as well as obtaintheir spatial coordinates at each timestep. As a result, eachagent i has a classiﬁcation type Ci (e.g. “Pedestrian”). LetXti = (xti, yti ) represent the position of the ithagent at timet and let Xt1,...,N represent the same quantity for all agentsin a scene. Further, let X(t1:t2)i = (Xt1i , Xt1+1i , . . . , Xt2i )denote a sequence of values for time steps t 2 [t1, t2].As in previous works [1, 16, 49], we take as input theprevious trajectories of all agents in a scene X(1:tobs)1,...,N andaim to produce predictions bX(tobs+1:tobs+T )1,...,N that match thetrue future trajectories X(tobs+1:tobs+T )1,...,N . Note that we havenot assumed N to be static, i.e. we can have N = f(t).!"/$!%/&!'/(!"-!% !'-!%LegendModeled Node!"/$ Node $ is of type !"!"-!% Edge is of type !"-!%Edge being createdNormal Edge!"/)!"-!%Attention!(#$%)!(#$')!(#)!(#(')!(#(%)!(#())FCFCEENHENFEEncoderℎ+ℎ,-(.|0)1(.|0, 3)FC.!(#)., ℎ+4!(#(')! #('., ℎ+4!(#(%)! #(%., ℎ+4!(#(')GMM GMM GMM4!(#(%)4!(#())Decoder 5)5)5)6(#$%)+8(#$%)6(#$')+8(#$')6(#)+8(#)5'-5)9(#$%)9(#$')9(#)5%-5)LegendLSTM CellModulating FunctionFC Fully-Connected LayerProjection to a GMMConcatenationRandomly sampledTrain time onlyPredict time onlyTrain and PredictGMMMMMMM MM+Figure 2. Top: An example graph with four nodes. a is our mod-T. Salzmann, et al., “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data,” ECCV, 2020.B. Ivanovic, et al., “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs,” ICCV, 2019.

単純にノイズベクトルを付与すると，高い分散を持つ経路を予測してしまう●既存研究は真にマルチモーダルな分布を学習できていない予測経路とノイズベクトル間の潜在的表現を学習●ノイズベクトルから生成した予測経路をLSTM Encoderへ入力●元のノイズベクトルと類似するようにマッピング●真にマルチモーダルな経路を生成可能Social-BiGAT [V. Kosaraju+, NeurIPS, 2019]28Figure 2: Architecture for the proposal Social-BiGAT model. The model consists of a single generator, twoFigure 4: Generated trajectories visualized for the S-GAN-P, Sophie, and Social-BiGAT models across fourmain scenes. Observed trajectories are shown as solid lines, ground truth future movements are shown as dashedlines, and generated samples are shown as contour maps. Different colors correspond to different pedestrians.V. Kosaraju, et al., “Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks,” NeurIPS, 2019.

Spatial-Temporal Graphを用いてモデル化●Graph Convolution Network (GCN)でインタラクションに関する特徴抽出- 隣接行列からインタラクション情報を求める●GCNで得た特徴からTemporal Convolutional Network (TCN)で予測分布を出力- LSTMは予測経路を逐次出力するが，TCNは予測経路を並列に出力- 推論速度を大幅に改善Social-STGCNN [A. Mohamed+, CVPR, 2020]29Figure 2. The Social-STGCNN Model. Given T frames, we construct the spatio-temporal graph representing G = (V, A). Then G isforwarded through the Spatio-Temporal Graph Convolution Neural Networks (ST-GCNNs) creating a spatio-temporal embedding. Followingthis, the TXP-CNNs predicts future trajectories. P is the dimension of pedestrian position, N is the number of pedestrians, T is the numberˆA. Mohamed, et al., “Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction,” CVPR, 2020.

歩行者間の関係を調査するグループベースのインタラクションをモデル化●グループ：同じ目的地に行く歩行者，遠方の同方向へ進む歩行者， etc.●グループの判別のために，人がグループ情報をアノテーションRelational Social Representationでグループのインタラクションを求める●周囲の静的環境情報や過去の軌跡情報と連結し，経路を予測RSBG [J. Sun+, CVPR, 2020]30BiLSTMCNNRSBGGeneratorCoordinatesImage patchCoordinates GCNLSTMIndividual Representation DecoderRelational Social RepresentationFeaturesRSBGJ. Sun, et al., “Recursive Social Behavior Graph for Trajectory Prediction,” CVPR, 2020.

経路予測で用いるLSTMは2点の問題がある●LSTMは複雑な時間依存性のモデル化が困難●Attentionモデルの予測手法がインタラクションを完全にモデル化できないTransformerを時空間的Attentionへ拡張し，経路予測タスクに応用●Temporal Transformerで軌跡特徴をエンコード●Spatial Transformerで時刻毎に独立したインタラクションを抽出●2つのTransformerを用いることで，LSTMを用いた予測手法の予測精度を大幅に改善STAR [C. Yu+, ECCV, 2020]316 Yu C., Ma X., Ren J., Zhao H., Yi S.(a) Temporal Transformer (b) Spatial TransformerFig. 3. STAR has two main components, Temporal Transformer and Spatial Trformer. (a) Temporal Transformer treats each pedestrians independently and extrthe temporal dependencies by Transformer model (h is the embedding of pedestYu C., Ma X., Ren J., Zhao H., Yi S.Temporal Transformer Spatial TransformerC. Yu, et al., “Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction,” ECCV, 2020.

CNNを用いた予測手法●過去の軌跡情報をエンコードし，スパースなボクセルに格納●ConvolutionとMax-Poolingを複数回行い，Deconvolutionで予測経路を出力●Location bias mapにより特定シーンの物体情報と潜在的特徴表現で要素積をとる- 特定シーンによって変化する歩行者の振る舞いを考慮Behavior-CNN [S. Yi+, ECCV, 2016]33output of CNN, as they are of variable lengths and observed in different periods.3 Pedestrian Behavior Modeling and PredictionThe overall framework is shown in Fig. 2. The input to our system is pedestrianwalking paths in previous frames (colored curves in Fig. 2(a)). They could beobtained by simple trackers such as KLT [41]. They are then encoded into adisplacement volume (Fig. 2(b)) with the proposed walking behavior encodingscheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume asFig. 2. System flowchart. (a) Pedestrian walking paths in previous frames. Three exam-ples are shown in different colors. Rectangles indicate current locations of pedestrians.(b) The displacement volume encoded from pedestrians’ past walking paths in (a).(c) Behavior-CNN. (d) The predicted displacement volume by Behavior-CNN. (e) Pre-dicted future pedestrian walking paths decoded from (d).Three bottom convolution layers, conv1, conv2, and conv3, are to be con-volved with input data of size X × Y × 2M. conv1 contains 64 filters of size3 × 3 × 2M, while both conv2 and conv3 contain 64 filters of size 3 × 3 × 64.Zeros are padded to each convolution input in order to guarantee feature mapsof these layers be of the same spatial size with the input. The three bottomconvolution layers are followed by max pooling layers max-pool with stride 2.The output size of max-pool is X/2 × Y/2 × 64. In this way, the receptive fieldof the network can be doubled. Large receptive field is necessary for the taskof pedestrian walking behavior modeling because each individual’s behavior aresignificantly influenced by his/her neighbors. A learnable location bias map ofsize X/2×Y/2 is channel-wisely added to each of the pooled feature maps. Everyspatial location has one independent bias value shared across channels. With thelocation bias map, location information of the scene can be automatically learnedby the proposed Behavior-CNN. As for the three top convolution layers, conv4and conv5 contain 64 filters of size 3 × 3 × 64, while conv6 contains 2M∗filtersof size 3 × 3 × 64 to output the predicted displacement volume. Zeros are alsoS. Yi, et al., “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks,” ECCV, 2016.

1人称視点における対面の歩行者のための位置予測●1人称視点特有の手掛かりを位置予測に利用1. 対面の歩行者の位置に影響するエゴモーション2. 対面の歩行者のスケール3. 対面の歩行者の姿勢●上記3つの情報を用いたマルチストリームモデルで将来の位置予測Future localization in first-person videos [T. Yagi+, CVPR, 2018]34Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locationsof a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego-motion of camera wearers and e) poses of the target persons as a salient cue for the prediction.Channel-wiseConcatenationInputOutputLocation-Scale Stream Ego-Motion Stream Pose StreamFigure 3. Proposed Network Architecture. Blue blocks corre-tain direction at a constant speed, our best guess based ononly previous locations would be to expect them to keep go-ing in that direction in subsequent future frames too. How-ever, visual distances in first-person videos can correspondto different physical distances depending on where peopleare observed in the frame.In order to take into account this perspective effect, wepropose to learn both locations and scales of target peo-ple jointly. Given a simple assumption that heights ofpeople do not differ too much, scales of observed peo-ple can make a rough estimate of how large movementsthey made in the actual physical world. Formally, letLin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar-get locations. Then, we extend each location lt ∈ R2+ ofa) Inputb) Predictione) Posec-1) Locationc-2) Scaled) Ego-motionFigure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locationsof a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego-motion of camera wearers and e) poses of the target persons as a salient cue for the prediction.Channel-wiseConcatenationInputOutputLocation-Scale Stream Ego-Motion Stream Pose StreamFigure 3. Proposed Network Architecture. Blue blocks corre-spond to convolution/deconvolution layers while gray blocks de-scribe intermediate deep features.tain direction at a constant speed, our best guess based ononly previous locations would be to expect them to keep go-ing in that direction in subsequent future frames too. How-ever, visual distances in first-person videos can correspondto different physical distances depending on where peopleare observed in the frame.In order to take into account this perspective effect, wepropose to learn both locations and scales of target peo-ple jointly. Given a simple assumption that heights ofpeople do not differ too much, scales of observed peo-ple can make a rough estimate of how large movementsthey made in the actual physical world. Formally, letLin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar-get locations. Then, we extend each location lt ∈ R2+ ofa target person by adding the scale information of that per-son st ∈ R+, i.e., xt = (lt , st) . Then, the ‘location-scale’ input stream in Figure 3 learns time evolution inXin = (xt0−Tprev+1, . . . , xt0 ), and the output stream gen-erates Xout = (xt0+1 − xt0 , . . . , xt0+Tfuture− xt0 ).Ground Truth OursSocial LSTMInput NNeighborPast observations Predictions(a)(b)(c)(d)(e)Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of targetpeople (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predictlocations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We comparedT. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.

車載カメラ映像に映る歩行者の将来の位置を予測する手法●歩行者の矩形領域，自車の移動量，車載カメラ画像を入力●歩行者の将来の矩形領域を出力OPPU [A. Bhattacharyya+, CVPR, 2018]35Figure 2: Two stream architecture for prediction of future pedestrian bounding boxes.ion of Pedestrian Trajectories sequence ˆv (containing information about past pedestrianLast Observation: t Prediction: t + 5 Prediction: t + 10 Prediction: t + 15Figure 4: Rows 1-3: Point estimates. Blue: Ground-truth, Red: Kalman Filter (Table 1 row 1), Yellow: One-stream model(Table 1 row 4), Green: Two-stream model (mean of predictive distribution, Table 4 row 3). Rows 4-6: Predictive distributionsof our two-stream model as heat maps. (Link to video results in the Appendix).sequences have low error (note, log(530) ≈ 6.22 the MSE that, the predicted uncertainty upper bounds the error of theA. Bhattacharyya, et al., “Long-Term On-Board Prediction of People in Trafﬁc Scenes under Uncertainty,” CVPR, 2018.

自動車検出，追跡，経路予測を同時に推論するモデルを提案●3D点群データを入力に使用- 3次元空間でスパースな特徴表現- ネットワークの計算コストを抑制- リアルタイムで3つのタスクを同時に計算可能Fast and Furious [W. Luo+, CVPR, 2018]36Real Time End-to-End 3D Detection, Tracking and Motionorecasting with a Single Convolutional NetWenjie Luo, Bin Yang and Raquel UrtasunUber Advanced Technologies GroupUniversity of Toronto{wenjie, byang10, urtasun}@uber.comracta novel deep neural networkabout 3D detection, track-iven data captured by a 3Dabout these tasks, our holis-occlusion as well as sparseh performs 3D convolutionsbird’s eye view representa-is very efﬁcient in terms ofn. Our experiments on a newed in several north americanW. Luo, et al., “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net,” CVPR, 2018.

異なる移動物体を属性とみなし，属性毎の特徴的な経路を予測●異なる移動物体が保有する潜在的特徴を考慮- 歩行者：歩道や車道を歩く- 自動車：車道を走る●属性をone-hot vectorで表現するため，属性毎にモデルを作成する必要がない- 計算コストの抑制●属性毎の特徴的な経路を予測するために，シーンラベルを利用Object Attributes and Semantic Environment [H. Minoura+, VISAPP, 2019]37H. Minoura, et al., “Path predictions using object attributes and semantic environment,” VISAPP, 2019.

一般道における自動車の経路予測手法●異なるシーンコンテキストを考慮するCNNとGRUによるモデルを提案- 予測車・他車の位置，コンテキスト，道路情報をチャネル方向に連結- 連結したテンソルを時刻毎にCNNでエンコード- エンコードした特徴をGRUで時間方向へ伝播し，将来の経路を予測Rules of the Road [J. Hong+, CVPR, 2019]38Figure 2: Entity and world context representation. For an example scene (visualized left-most), the world is represented withthe tensors shown, as described in the text.ontext representation. For an example scene (visualized left-most), the world is represented withbed in the text.One-shot (b) with RNN decoder(a) Gaussian Regression (b) GMM-CVAEFigure 4: Examples of Gaussian Regression and GMM-CVAE methods. Ellipses represent a standard deviation of uncertainty, aonly drawn for the top trajectory; only trajectories with probability > 0.05 are shown, with cyan the most probable.We see that uncerellipses are larger when turning than straight, and often follow the direction of velocity. In the GMM-CVAE example, different saシーン例予測車の位置他車の位置コンテキスト道路情報J. Hong, et al., “Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions,” CVPR, 2019.

複数の尤もらしい経路を予測するMultiverseを提案●過去のセマンティックラベルと固定グリッドをHistory Encoder (HE)へ入力- Convolution Recurrent Neural Networkで時空間特徴をエンコード- セマンティックラベルを入力することで，ドメインシフトに対して強固になる●HEの出力と観測最終のセマンティックラベルをCoarse Location Decoder (CLD)へ入力- GATでグリッドの各格子に重み付けし，重みマップを生成●Fine Location Decoder (FLD)でグリッドの各格子に距離ベクトルを格納- CLD同様，GATで各格子に重み付け●FLDとCLDから複数の経路を予測Multiverse [J. Liang+, CVPR, 2020]39Figure 2: Overview of our model. The input to the model is the ground truth location history, and a set of video frames,which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN.J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.

経路予測で最も使用されるデータセット41ETH Dataset UCY Dataset Stanford Drone Dataset• サンプル数：786市街地の歩行者を撮影したデータセット市街地の歩行者を撮影したデータセット• シーン数：2• 対象種類- pedestrian• サンプル数：750• シーン数：3• 対象種類- pedestrianスタンフォード大学構内を撮影したデータセット• サンプル数：10,300• シーン数：8• 対象種類- pedestrian, car, cyclist, bus, skater, cartR. Alexandre, et al., “Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes,” ECCV, 2016.A. Lerner, et al., “Crowds by Example,” CGF, 2007.S. Pellegrini, et al., “You’ll Never Walk Alone: Modeling Social Behavior for Multi-target Tracking,” ICCV, 2009.

固定カメラやドローンで撮影●豊富なデータを取得できるため，非常に大規模なデータセット俯瞰視点のデータセット42lts from NN+map(prior) m-aseline. The orange trajectoryRed represents ground truth forn represents the multiple fore-3 s. Top left: The car starts toArgoverse DatasetThe inD Dataset: A Drone Dataset of NaturalisticRoad User Trajectories at German IntersectionsJulian Bock1, Robert Krajewski1, Tobias Moers2, Steffen Runde1, Lennart Vater1 and Lutz Eckstein1Fig. 1: Exemplary result of road user trajectories in the inD dataset. The position and speed of each road user is measuredaccurately over time and shown by bounding boxes and tracks. For privacy reasons, the buildings were made unrecognizable.Abstract—Automated vehicles rely heavily on data-drivenmethods, especially for complex urban environments. Largedatasets of real world measurement data in the form of roaduser trajectories are crucial for several tasks like road userprediction models or scenario-based safety validation. So far,though, this demand is unmet as no public dataset of urbanroad user trajectories is available in an appropriate size, qualityand variety. By contrast, the highway drone dataset (highD) hasrecently shown that drones are an efficient method for acquiringnaturalistic road user trajectories. Compared to driving studiesor ground-level infrastructure sensors, one major advantage ofusing a drone is the possibility to record naturalistic behavior,as road users do not notice measurements taking place. Due tothe ideal viewing angle, an entire intersection scenario can bemeasured with significantly less occlusion than with sensors atground level. Both the class and the trajectory of each roaduser can be extracted from the video recordings with highprecision using state-of-the-art deep neural networks. Therefore,we propose the creation of a comprehensive, large-scale urbanintersection dataset with naturalistic road user behavior usingcamera-equipped drones as successor of the highD dataset. Theresulting dataset contains more than 11500 road users includingvehicles, bicyclists and pedestrians at intersections in Germanyand is called inD. The dataset consists of 10 hours of measurementdata from four intersections and is available online for non-commercial research at: http://www.inD-dataset.com1The authors are with the Automated Driving Department, Institute forAutomotive Engineering RWTH Aachen University (Aachen, Germany).(E-mail: {bock, krajewski, steffen.runde, vater, eckstein}@ika.rwth-aachen.de).2The author is with the Automated Driving Department, fka GmbH(Aachen, Germany). (E-mail: tobias.moers@fka.de).Index Terms—Dataset, Trajectories, Road Users, MachineLearningI. INTRODUCTIONAutomated driving is expected to reduce the number andseverity of accidents significantly [13]. However, intersectionsare challenging for automated driving due to the large com-plexity and variety of scenarios [15]. Scientists and companiesare researching how to technically handle those scenarios byan automated driving function and how to proof safety ofthese systems. An ever-increasing proportion of the approachesto tackle both challenges are data-driven and therefore largeamounts of measurement data are required. For example, re-cent road user behaviour models, which are used for predictionor simulation, use probabilistic approaches based on largescale datasets [2], [11]. Furthermore, current approaches forsafety validation of highly automated driving such as scenario-based testing heavily rely on large-scale measurement data ontrajectory level [3], [5], [17].However, the widely used ground-level or on-boardmeasurement methods have several disadvantages. Theseinclude that road users can be (partly) occluded by otherroad users and do not behave naturally as they notice beingpart of a measurement due to conspicuous sensors [5].We propose to use camera-equipped drones to record roaduser movements at urban intersections (see Fig. 2). Droneswith high-resolution cameras allow to record traffic from aarXiv:1911.07602v1[cs.CV]18Nov2019inD Datasetysis. The red trajectories are single-future method predictions and the yellow-orange heatmaps arections. The yellow trajectories are observations and the green ones are ground truth multi-futureetails.Method Single-Future Multi-FutureOur full model 18.51 / 35.84 166.1 / 329.5The Forking Paths DatasetFigure 7: Example output of the motion prediction solutionsupplied as part of the software development kit. A convo-lution neural network takes rasterised scenes around nearbyvehicles as input, and predicts their future motion.ity and multi-threading to make it suitable for distributedmachine learning.Customisable scene visualisation and rasterisation.We provide several functions to visualise and rasteriseLyft Level 5 Dataset• サンプル数：300K• シーン数：113• 対象種類• 追加情報- car- 車線情報，地図データ，センサー情報• サンプル数：13K• シーン数：4• 対象種類- pedestrian, car, cyclist• サンプル数：3B• シーン数：170,000• 対象種類• 追加情報- pedestrian, car, cyclist- 航空情報- セマンティックラベル• サンプル数：0.7K• シーン数：7• 対象種類• 追加情報- pedestrian- 複数経路情報- セマンティックラベル一般道を撮影したデータセット交差点を撮影したデータセット一般道を撮影したデータセットシミュレータで作成されたデータセットJ. Houston, et al., “One Thousand and One Hours: Self-driving Motion Prediction Dataset,” CoRR, 2020.J. Bock, et al., “The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections,” CoRR, 2019.M.F. Chang, et al., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” CVPR, 2019.J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.

自動車前方の移動対象の経路予測を目的車載カメラ視点のデータセット43• サンプル数：1.8K• シーン数：53• 対象種類• 追加情報- pedestrian- 車両情報，インフラストラクチャ一般道を撮影したデータセットApolloscape DatasetFigure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding bwith ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2.centric views captured from a mobile platform.In the TITAN dataset, every participant (individuals,vehicles, cyclists, etc.) in each frame is localized us-ing a bounding box. We annotated 3 labels (person, 4-wheeled vehicle, 2-wheeled vehicle), 3 age groups for per-son (child, adult, senior), 3 motion-status labels for both 2and 4-wheeled vehicles, and door/trunk status labels for 4-wheeled vehicles. For action labels, we created 5 mutuallyexclusive person action sets organized hierarchically (Fig-ure 2). In the first action set in the hierarchy, the annota-tor is instructed to assign exactly one class label among 9atomic whole body actions/postures that describe primitiveaction poses such as sitting, standing, standing, bending,etc. The second action set includes 13 actions that involvesingle atomic actions with simple scene context such as jay-walking, waiting to cross, etc. The third action set includes7 complex contextual actions that involve a sequence ofatomic actions with higher contextual understanding, suchagent i at each past time step from 1 to Tobs, where (cu,and (lu, lv) represent the center and the dimension ofbounding box, respectively. The proposed TITAN framwork requires three inputs as follows: Iit=1:Tobsfor thetion detector, xit for both the interaction encoder and pobject location encoder, and et = {αt, ωt} for the egmotion encoder where αt and ωt correspond to the accelation and yaw rate of the ego-vehicle at time t, respectiveDuring inference, the multiple modes of future boundbox locations are sampled from a bi-variate Gaussian gerated by the noise parameters, and the future ego-motioêt are accordingly predicted, considering the multi-monature of the future prediction problem.Henceforth, the notation of the feature embedding fution using multi-layer perceptron (MLP) is as follows: Φwithout any activation, and Φr, Φt, and Φs are associawith ReLU, tanh, and a sigmoid function, respectively.TITAN DatasetLSTM 172 330 911 837 3352 289 569 155B-LSTM[5] 101 296 855 811 3259 159 539 153PIEtraj 58 200 636 596 2477 110 399 124Table 3: Location (bounding box) prediction errors over varying future time steps. Mpredicted time steps, CMSE and CFMSE are the MSEs calculated over the center ofpredicted sequence and only the last time step respectively.Method 0.5Linear 0.8LSTM 1.5PIEspeed 0.6Table 4: Speed predicton the PIE dataset. Lasresults are reported in kmis generally better on boudegrees of freedom.Context in trajectorPIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. Tconditions and traffic situations. We only show the trajectories of several instances indrawn in green and the prediction results of other methods (ED,SL,SA) are shown wtrajectories of our TP algorithm (pink lines) are the closest to ground truth in most ofstance layer to cainstances and uselarities of movemetype and guide thein spatial and temferred in our desigprevious state-of-tracy of trajectory pheterogeneous tra一般道を撮影したデータセット一般道を撮影したデータセット• サンプル数：81K• シーン数：100,000• 対象種類- pedestrian, car, cyclist• サンプル数：645K• シーン数：700• 対象種類• 追加情報- pedestrian, car, cyclist- 行動ラベル，歩行者の年齢Y. Ma, et al., “TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents,” AAAI, 2019.A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019.S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020.

前方の歩行者の経路予測を目的●被験者にウェアラブルカメラを装着1人称視点のデータセット44• サンプル数：5K• シーン数：87• 対象種類• 追加情報- pedestrian- 姿勢情報，エゴモーション歩道を撮影したデータセット𝑡 𝑡0 6 𝑡 𝑡0 10PredictionsFirst-Person Locomotion DatasetT. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.

予測された矩形領域と真の矩形領域の中心座標で評価●車載カメラ映像における経路予測で利用●矩形領域の重なり率からF値で評価もできる評価指標45Displacement Error Negative log-likelihoodMean Square Error Collision rate(a) Average Displacement Error (b) Final Displacement ErrorADE FDE真値と予測値とのユークリッド距離誤差●Average Displacement Error (ADE)：予測時刻間の平均誤差●Final Displacement Error (FDE)：予測最終時刻の誤差!"#$…!"#&#ofSamples:'Prediction Horizon: (Figure 5. An illustration of our probabilistic evaluation methodol-ogy. It uses kernel density estimates at each timestep to computethe log-likelihood of the ground truth trajectory at each timestep,averaging across time to obtain a single value.Figure 6. Mean NLL for each dataset. Error bars are bootstrapped95% conﬁdence intervals. 2000 trajectories were sampled permodel at each prediction timestep. Lower is better.ADE and FDE are useful metrics for comparing determinis-tic regressors, they are not able to compare the distributionsproduced by generative models, neglecting aspects such asvariance and multimodality [40]. To bridge this gap in eval-DatasetADE / FDE,SGAN [16]ETH 0.64 / 1.13Hotel 0.43 / 0.91Univ 0.53 / 1.12Zara 1 0.29 / 0.58Zara 2 0.27 / 0.56Average 0.43 / 0.86Table 1. Quantitative ADE andmetric where N = 100.Both of our methods signithe ETH datasets, the UCY(P <.001; two-tailed t-testand SGAN’s mean NLL). OnFull model is identical in pesame t-test). However, on thmodel performs worse thanWe believe that this is causedtions more often than in othertruth trajectories to frequentltions whereas SGAN’s higheto have density there. Acrosuration outperforms our zbesmodel’s full multimodal mofor strong performance on thWe also evaluated our mto determine how much the pprediction horizon. The resube seen, our Full model sigat every timestep (P <.001;ence between our and SGAN推定した分布の元での真値の対数尤度の期待値●ADEとFDEで複数経路を評価するのはマルチモーダル性を無視●Negative log-likelihoodで複数経路の予測の評価指標として利用truthpredictionMSEL2 normL2 norm真値予測値従来の評価指標提案する評価指標2つのDisplacement Errorで評価非線形経路のDisplacement Error，2つの物体との衝突率で評価L2 normL2 norm真値予測値従来の評価指標提案する評価指標2つのDisplacement Errorで評価Displacement Errorは全サンプルに対し平均を求める●インタラクション情報がどの予測経路に効果的か評価できない予測値が各物体と衝突したか否かの衝突率で評価●動的物体：映像中の他対象●静的物体：建物や木などの障害物動的物体静的物体

代表的なモデルを用いて，精度検証を行う評価実験47モデル名インタラクション Deep Learning 環境データセットLSTM - ✔ - ETH/UCY, SDDRED - ✔ - ETH/UCY, SDDConstVel - - - ETH/UCY, SDDSocial-LSTM ✔ ✔ - ETH/UCY, SDDSocial-GAN ✔ ✔ - ETH/UCY, SDDSTGAT ✔ ✔ - ETH/UCY, SDDTrajectron ✔ ✔ - ETH/UCY, SDDEnv-LSTM - ✔ ✔ SDDSocial-STGCNN ✔ ✔ - ETH/UCYPECNet ✔ ✔ - ETH/UCY精度比較を行うモデルAttentionモデルPoolingモデル

データセット●ETH/UCY●SDD- 歩行者限定エポック数：300バッチサイズ：64最適化手法：Adam●学習率：0.001観測時刻：3.2秒予測時刻：4.8秒評価指標●Displacement Error，Collision rate●複数の予測経路をサンプリングする手法では，サンプリングした中で最良のものを使用実験条件48

ETH/UCYにおけるDisplacement Error49SceneMethod C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVGLSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08SingleModel20OutputsSingle Model：インタラクションを考慮しないREDが最も誤差を低減20 Outputs：ADEでSTGAT，FDEでPECNetが最も誤差を低減ADE / FDE [m]

ETH/UCYにおけるDisplacement Error50SceneMethod C-ETH ETH HOTEL UCY ZARA01 ZARA02 AVGLSTM 0.56 / 1.15 0.91 / 1.57 0.29 / 0.56 0.84 / 1.56 1.33 / 2.69 0.77 / 1.50 0.78 / 1.51RED 0.58 / 1.22 0.70 / 1.46 0.17 / 0.33 0.61 / 1.32 0.45 / 0.99 0.36 / 0.79 0.48 / 1.02ConstVel 0.57 / 1.26 0.57 / 1.26 0.19 / 0.33 0.67 / 1.43 0.49 / 1.07 0.50 / 1.09 0.50 / 1.07Social-LSTM 0.90 / 1.70 1.30 / 2.55 0.47 / 0.98 1.00 / 2.01 0.92 / 1.69 0.78 / 1.61 0.90 / 1.76Social-GAN 0.49 / 1.05 0.53 / 1.03 0.35 / 0.77 0.79 / 1.63 0.47 / 1.01 0.44 / 0.94 0.51 / 1.07PECNet 0.68 / 1.12 0.71 / 1.14 0.14 / 0.21 0.63 / 1.19 0.47 / 0.83 0.36 / 0.67 0.50 / 0.86STGAT 0.48 / 1.05 0.51 / 1.01 0.19 / 0.31 0.61 / 1.33 0.47 / 1.00 0.39 / 0.78 0.46 / 0.91Trajectron 0.52 / 1.14 0.56 / 1.18 0.26 / 0.51 0.63 / 1.37 0.50 / 1.05 0.39 / 0.84 0.48 / 1.02Social-STGCNN 0.68 / 1.27 0.83 / 1.35 0.22 / 0.34 0.84 / 1.46 0.61 / 1.12 0.54 / 0.93 0.62 / 1.08SingleModel20OutputsADE / FDE [m]Poolingモデルと比較してAttentionモデルによる経路予測手法が有効

LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16ETH/UCYにおけるCollision rateと予測結果例51ModelObjectConstVelLSTMPECNetREDSocial-GANSocial-LSTMSTGATSocial-STGCNNTrajectron入力値真値予測値Collision rate [%]

LSTM RED ConstVel Social-LSTM Social-GAN PECNet STGAT Trajectron Social-STGCNN動的物体 0.42 0.78 0.55 0.89 0.99 0.71 1.10 0.54 1.63静的物体 0.08 0.07 0.09 0.16 0.08 0.12 0.08 0.13 0.16ETH/UCYにおけるCollision rateと予測結果例52ModelObject入力値真値予測値Collision rate [%]ConstVelLSTMPECNetREDSocial-GANSocial-LSTMSTGATSocial-STGCNNTrajectron予測誤差が低い手法 ≠ 衝突率が低い手法- 真値と類似しない場合に予測誤差は増加するが，衝突率は減少することが起こり得る動的物体に関するCollision rateで衝突していないと判定された経路動的物体に関するCollision rateで衝突したと判定された経路

SDDにおけるDisplacement Error53SceneMethod bookstore coupa deathCircle gates hyang little nexus quad AVGLSTM 7.00 / 14.8 8.44 / 17.5 7.52 / 15.9 5.78 / 11.9 8.78 / 18.4 10.8 / 23.1 6.61 / 13.1 16.1 / 30.2 8.88 / 18.1RED 7.91 / 17.1 9.51 / 20.4 8.22 / 17.8 5.72 / 12.0 9.14 / 19.5 11.8 / 25.8 6.24 / 12.7 4.81 / 10.9 7.92 / 17.0ConstVel 6.63 / 12.9 8.17 / 16.2 7.29 / 14.0 5.76 / 10.9 9.21 / 18.1 10.9 / 22.1 7.14 / 13.7 5.31 / 8.89 7.56 / 14.6Social-LSTM 33.6 / 74.0 34.8 / 76.0 33.4 / 74.7 35.6 / 83.2 35.4 / 75.9 36.7 / 77.5 32.3 / 71.3 32.4 / 71.3 34.3 / 75.5Env-LSTM 13.5 / 30.1 17.2 / 36.7 15.3 / 32.2 17.3 / 36.5 12.6 / 27.9 14.2 / 31.1 10.8 / 24.0 8.05 / 19.0 13.8 / 30.0Social-GAN 18.4 / 36.7 19.5 / 39.1 18.6 / 37.2 18.6 / 37.3 20.1 / 40.6 20.0 / 40.8 18.1 / 36.3 13.2 / 26.2 18.3 / 36.8STGAT 7.58 / 14.6 9.00 / 17.4 7.57 / 14.4 6.33 / 11.7 9.17 / 17.9 10.9 / 21.8 7.37 / 14.0 4.83 / 7.95 7.85 / 15.0Trajectron 6.18 / 13.2 7.24 / 15.5 6.43 / 13.6 6.29 / 13.0 7.72 / 16.5 9.38 / 20.8 6.55 / 13.4 6.80 / 15.1 7.07 / 15.1SingleModel20OutputsADE / FDE [pixel]Single Model：Deep Learningを使用しないConstVelが最も誤差を低減20 Outputs：ADEでTrajectron，FDEでSTGATが最も誤差を低減

撮影箇所に影響される●ETH/UCYは低所で撮影- 人の経路がsensitiveになる●SDDは高所で撮影- 人の経路がinsensitiveになる- 人の動きが線形になり，線形予測するConstVelの予測誤差が低下SDDでConstVelの予測誤差が低い要因は何か54ETH/UCY SDD

SDDにおけるCollision rateと予測結果例55ModelObjectCollision rate [%]Social-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM真値入力値予測値Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM入力値真値予測値LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58

SDDにおけるCollision rateと予測結果例56ModelObjectSocial-GANRED ConstVel TrajectronSTGATLSTM Social-LSTM真値入力値予測値Env-LSTMConstVelLSTM RED Social-GANSocial-LSTM STGAT Trajectron Env-LSTM入力値真値予測値LSTM RED ConstVel Social-LSTM Social-GAN STGAT Trajectron Env-LSTM動的物体 10.71 11.27 10.98 15.12 10.91 10.97 12.80 12.12静的物体 2.82 2.86 2.40 20.33 6.41 2.17 1.71 1.58環境情報を導入することで，障害物との接触を避ける経路予測が可能Collision rate [%]

Deep Learningの発展によるデータセットの大規模化経路予測の評価指標の再考●予測精度が良い最も良い予測手法●コミュニティ全体で考え直す必要がある複数経路を予測するアプローチの増加●Multiverseを筆頭に複数経路を重視する経路予測手法が増加する？今後の経路予測は？572016interactionmultimodal pathsother2020Social-LSTM[A. Alahi+, CVPR, 2016]DESIRE[N. Lee+, CVPR, 2017]Conv.Social-Pooling[N. Deo+, CVPRW, 2018]SoPhie[A. Sadeghian+, CVPR, 2019]Social-BiGAT[V. Kosaraju+, NeurIPS, 2019]Social-STGCNN[A. Mohamedl+, CVPR, 2020]Social-GAN[A. Gupta+, CVPR, 2018]Next[J. Liang+, CVPR, 2019]STGAT[Y. Huang+, ICCV, 2019]Trajectron[B. Ivanovic+, ICCV, 2019]Social-Attention[A. Vemula+, ICRA, 2018]Multi-Agent Tensor Fusion[T. Zhao+, CVPR, 2019]MX-LSTM[I. Hasan+, CVPR, 2018]CIDNN[Y. Xu+, CVPR, 2018]SR-LSTM[P. Zhang+, CVPR, 2019]Group-LSTM[N. Bisagno+, CVPR, 2018]Reciprocal Network[S. Hao+, CVPR, 2020]PECNet[K. Mangalam+, ECCV, 2020]RSBG[J. SUN+, CVPR, 2020]STAR[C. Yu+, ECCV, 2020]Behavior CNN[S. Yi+, ECCV, 2016]Future localization in ﬁrst-person videos[T. Yagi+, CVPR, 2018]Fast and Furious[W. Luo+, CVPR, 2018]OPPU[A. Bhattacharyya+, CVPR, 2018]Object Attributes and Semantic Segmentation[H. Minoura+, VISAPP, 2019]Rule of the Road[J. Hong+, CVPR, 2019] Multiverse[J. Liang+, CVPR, 2020]Trajectron++[T. Salzmann+, ECCV, 2020]Multimodal paths + (interaction)・・・

Deep Learningを用いた経路予測手法の動向調査●各カテゴリに属した経路予測手法の特徴を調査- インタラクションあり• Poolingモデル• Attentionモデル- インタラクションなし (Other)●定量的評価のためのデータセット，評価指標を紹介- Deep Learningの発展により大規模なデータセットが増加●代表的モデルを使用して，各モデルの精度と予測結果について議論- AttentionモデルはPoolingモデルより予測誤差が低い- 最も衝突率が低いモデル予測誤差が低いモデル- SensitiveなデータセットでDeep Learningによる予測手法は効果的まとめ58

Deep Learningを用いた経路予測の研究動向

Movatterモバイル変換

Change Language

Deep Learningを用いた経路予測の研究動向

Recommended

More Related Content

What's hot

Similar to Deep Learningを用いた経路予測の研究動向

Recently uploaded

In this document

Deep Learningを用いた経路予測の研究動向