1. This article proposes a Multi-modal Fusion Network (M2FNet) for Emotion Recognition in Conversations (ERC).
2. The M2FNet extracts emotion-relevant features from visual, audio, and text modalities.
3. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
The article appears to be reliable and trustworthy as it provides evidence for its claims in the form of results from experiments conducted on two benchmark datasets - MELD and IMOCAP. Furthermore, the authors provide a detailed description of their proposed method, which is based on multi-modal fusion of audio, video, and text data for emotion recognition in conversations. The authors also provide an analysis of their results compared to existing methods, which further adds to the credibility of the article.
However, there are some potential biases that should be noted. For example, the authors do not discuss any possible risks associated with their proposed method or any potential limitations that may arise due to its reliance on multiple modalities. Additionally, they do not explore any counterarguments or present both sides equally when discussing their results compared to existing methods. Finally, there is no mention of any promotional content or partiality in the article which could potentially affect its trustworthiness and reliability.