The paper studies the metrics of deep learning applied to the problem of emotional state recognition of a person based on voice data. The focus is on the metrics widely used in face recognition such as ArcFace, CosFace, SphereFace, and AM-Softmax, but rarely implemented for speech emotion analysis. The experiments implied a model based on LSTM and CNN, and the use of RAVDESS dataset. It has been established that SphereFace has the highest accuracy during the training and the testing (compared to other metrics) reaching higher values of Top-1 and Top-5 accuracy. The results show that the application of metrics improves the quality of classification of emotional states, opening up prospects for their use in real applications, such as health care and customer service automation.