Response to Reviewer 3

Q: I have several concerns regarding to validity of the proposed approach. - P1. Comparing Table 2 (training with full set) and Table 3 (training with distilled data), it has a large gap for the UAR performance. Since the model architectures are still the same (CNN-6, ResNet-9 and VGG-15), I see neither efficiency benefit (e.g., memory reduction) nor performance improvements. Why don’t we just deploy the CNN-6 model trained with full dataset?

A: In this manuscript, we employ the dataset distillation, instead of model distillation, to achieve comparable models’ performances with much less data samples. Model architectures are the same, but the dataset trained on is much less, which means less data storage consumption and faster model development process. For clearer expression, we add ‘IPC=150150150150 means merely 28.2%percent28.228.2\,\%28.2 % of the training data samples and 14.9%percent14.914.9\,\%14.9 % of training plus validation data samples’ in the caption of Table 3.

Q: - P2. When it comes to real-world deployment on edge devices, the model generalization capability is critical. Based on the observation from P1, the drops could be more significant when conduct out-of-domain data evaluation.

A: In Table 3, transferability of the distilled data has been observed, which means the distilled data guided by one model can be applied to develop another model efficiently. Such observation has been discussed above the Table 3 in the original manuscript. In this manuscript, we focus on data distillation and the transferability/generalisation of the distilled data, instead of the model distillation/generalisation.

Q: - P3. No baseline comparison with other data distillation or efficient ML (e.g., pruning) approaches. Table 2 only shows the teacher model performance trained on the full dataset. Everything else are only self-compared to the proposed method.

A: We compare with model distillation method on the DEMoS dataset in Table 2. To the best of the authors’ knowledge, this is the first dataset distillation work in the Speech Emotion Recognition area and therefore, we focus on the comparison with models developed on the full dataset (Table 2 and 3) and analysis of the distillation process/settings (Figure 2 and 3).

Q: Difficult to read, requires major revision

A: Sorry for the confusion. We check the manuscript and made some changes (in red) to improve the readability.

Q: The experimental results are difficult to follow, there is no clear efficiency/accuracy trade-off comparison when it claims as a resource constraint ML topic.

A: We add the percentage of the data samples under different IPC values in the manuscript to emphasize the small volume of the distilled dataset. The trade-off between data volume and models performances can be seen directly in Table 2 and 3.

Q: I am not convinced for the effectiveness of proposed method due to lacks of experiments and performance degeneration.

A: Performance does drop because of the dataset distillation [loo2023dataset, sajedi2023datadam, liu2023dream]. This is a trade-off as discussed.

Response to Reviewer 4

Q: More configure about IPC could be set. I worried about 150 are not optimum

A: There is a trade-off between the IPC and efficiency. Larger IPC of course would lead to better models performance, but lager IPC also leads to more dataset needed, which is not the goal of the dataset distillation. In many works in the dataset distillation [loo2023dataset, sajedi2023datadam, liu2023dream], IPC is sampled from {1,10,50}11050\{1,10,50\}{ 1 , 10 , 50 } only, but in this work, we extend the IPC values to {1,5,10,50,100,150}151050100150\{1,5,10,50,100,150\}{ 1 , 5 , 10 , 50 , 100 , 150 } for better evaluation of the proposed method in the SER.

Q: More mainstream database such as IEMOCAP could be considered.

A: We focus on the efficiency of the proposed method on the speech emotion recognition area and dive into the analysis the models performances under different settings. We also do experiments to test the transferability of the distilled data. Because of the page limit, we cannot add one more daataset at this time and will verify the method on other databases in future work.

Q: Proposed methods could be explained according to flowchart or pseudo code.

A: A flowchart of the proposed framework is indicated in Figure 1.

Response to Reviewer 6

Q: This study proposes a data distillation method using Teacher & Student Trajectories in SER, which takes into account the issues of (1) insufficient processing capability of edge devices, and (2) privacy protection in speech data by transmitting only model parameters from the edge device to the cloud. I believe this framework has novelty in the research field.

A: Thank you very much.

Q: The results of recognition experiments demonstrate that, for (1), even using a synthesized dataset with data distillation for trainings does not significantly decrease recognition performance.

A: Thank you very much.

Q: However, there is a lack of explanation and discussion regarding (2), which raises the following questions: - What needs to be prepared in advance for both the Cloud and edge device? - What computations are performed at recognition time on both the Cloud and edge device? - Specifically, what information is sent from the edge device to the Cloud using this method? - Does this process effectively protect user privacy? Descriptions addressing these points are also necessary.

A: The data distillation is finished during model training, which is done on edge devices in a setting of edge computing. Data distillation can not only extract a ‘compressed’ dataset, but also run on edge devices rather than on a server, protecting user privacy.

The second issue you mentioned is mainly for computing resources constraint on the edge devices discussed in Section Introduction. The distilled synthesised smaller-scale datasets for model training diminishes the likelihood of privacy leakage, which effectively protect user privacy. However, the ‘cloud’ and ‘model parameters’ you mentioned is more about federated learning, which we only mentioned it in Section Related Work for prior jobs combining federated learning with data distillation.

Q: In this paper, the recognition results are favorable. I would like the authors to sincerely address the issues raised even in the Introduction in Discussion, leveraging these positive outcomes.

A: Thank you very much.