An Empirical Comparison of Generative Approaches
for Product Attribute-Value Identification

Kassem Sabeh Free University of Bozen-Bolzano, Italy
{ksabeh, jgamper}@unibz.it
Robert Litschko LMU Munich, Germany
{robert.litschko, b.plank}@lmu.de
Mouna Kacimi Wonder Technology Srl, Italy
[email protected]
Barbara Plank LMU Munich, Germany
{robert.litschko, b.plank}@lmu.de
Johann Gamper Free University of Bozen-Bolzano, Italy
{ksabeh, jgamper}@unibz.it
Abstract

Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: https://github.com/kassemsabeh/pavi-avg

An Empirical Comparison of Generative Approaches
for Product Attribute-Value Identification



1 Introduction

Product attributes are a crucial component of e-commerce platforms, facilitating applications such as product search Chen et al. (2023), product recommendation Truong et al. (2022), and product-related question answering Deng et al. (2023). They provide useful details about product features, enabling customers to compare products and make informed purchasing decisions. Product attribute and value identification (PAVI) refers to the task of identifying both the attributes and their corresponding values from an input context, such as a product title or description. For example, given the product title "Fossil Men’s Watch Analog Display Slim Case Design with Brown Leather Band" (see Figure 1), a model should identify the attributes Brand, Band Color, and Band Material, with the corresponding values Fossil, Brown, and Leather.

Most existing work focuses on product attribute-value extraction (PAVE) Zheng et al. (2018); Xu et al. (2019); Wang et al. (2020); Yang et al. (2022), which extracts the value of a given attribute from the input context. Despite extensive research on PAVE Blume et al. (2023); Yang et al. (2023); Brinkmann et al. (2023), PAVI is a more realistic and complex task since it requires the attribute to be generated and not assumed to be part of the input. While recent studies have explored generative models for PAVI, these efforts are limited in scope and often lack comprehensive evaluation across different datasets and settings Roy et al. (2024); Shinzato et al. (2023). Moreover, existing work focus primarily on end-to-end models without exploring alternative generative strategies. Consequently, it remains unclear which types of PAVI models are effective in practice, as comprehensive experiments and comparisons are lacking.

Refer to caption
Figure 1: An example of a product title with tagged attribute-value pairs.

In this paper, we address these gaps by proposing three generative approaches for PAVI and conducting a comprehensive evaluation across multiple datasets. Inspired by recent advancements on question and answer generation methods Bartolo et al. (2021), we compare between three strategies based on fine-tuning encoder-decoder language models such as T5 Raffel et al. (2020) and BART Lewis et al. (2020). Our proposed approaches are: (1) pipeline attribute-value generation (AVG), which decomposes the task into value extraction and attribute generation, and builds a separate model for each sub-task; (2) multitask AVG, which uses a single shared model that is trained on both sub-tasks; (3) end2end AVG, which uses a single model to generate the attribute-value pairs. We evaluate the performance of these approaches on three real-world product datasets: AE-110K, OA-mine, and MAVE. All the models and datasets are publicly released on HuggingFace111https://huggingface.co/av-generation and available as a demo222https://bit.ly/4bWFjNV.

2 Related Work

Pipe. Multi. E2E AE-110k OA-MINE MAVE Open
Shinzato et al. (2023)
Roy et al. (2024)
Ours
Table 1: Comparison between our work and prior studies for generative-based PAVI.

Most existing approaches for attribute-value extraction use sequence tagging Huang et al. (2015); Xu et al. (2019); Yan et al. (2021); Zheng et al. (2018) or question answering Wang et al. (2020); Yang et al. (2022); Ding et al. (2022); Hu et al. (2022); Sabeh et al. (2022); Yang et al. (2023) methods. However, such approaches carry closed-world assumption, as they require the set of attributes as inputs to extract the corresponding values. More recently, researchers have explored the capabilities of generative models to tackle the PAVI task, in an open-world setting. Roy et al. (2024) proposed a generative framework for joint attribute and value extraction. They conduct experiments on the AE-110k dataset and show that the generative approaches surpass question-answering based methods. Shinzato et al. (2023) fine-tune a pre-trained T5 generative model Raffel et al. (2020) to decode a set of target attribute-value pairs from the input product text of the MAVE dataset Yang et al. (2022). They show that the generative approach outperforms extraction and classification-based methods Chen et al. (2022).

However, all above studies utilize an end-to-end generative approach. They did not explore other generative strategies for attribute-value identification (i.e., pipeline and multi-task). In addition, these approaches are not comparable as they are different in terms of datasets, settings, and evaluation metrics. Finally, none of the above proposed models have been made publicly available. In this work, we propose three generative approaches for PAVI and empirically compare them on three real-world datasets. We summarize how our approach differs from prior work in Table 1. As can be seen, we evaluate in total all approaches across three datasets.

3 Proposed Methods

Refer to caption
Figure 2: Overview of the proposed AVG approaches.

Given an input product data (title or description) x={x1,x2,,x|x|}𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑥x=\{x_{1},x_{2},\ldots,x_{|x|}\}italic_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT | italic_x | end_POSTSUBSCRIPT }, attribute-value generation aims to generate attribute-value pairs 𝒬xsubscript𝒬𝑥\mathcal{Q}_{x}caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT related to the information in x𝑥xitalic_x:

𝒬x={(a1,v1),(a2,v2),(a3,v3),}subscript𝒬𝑥superscript𝑎1superscript𝑣1superscript𝑎2superscript𝑣2superscript𝑎3superscript𝑣3\mathcal{Q}_{x}=\{(a^{1},v^{1}),(a^{2},v^{2}),(a^{3},v^{3}),\ldots\}caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = { ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , ( italic_a start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) , … } (1)

For instance, if x𝑥xitalic_x="Fossil",…,"Band", then 𝒬xsubscript𝒬𝑥\mathcal{Q}_{x}caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ("Brand","Fossil"), ("Band Color","Brown"), ("Band Material","Leather").

We formulate the attribute-value identification problem as an attribute-value generation (AVG) task and propose three approaches based on fine-tuning language models, as depicted in Figure 2.

3.1 Pipeline AVG

The AVG task can be decomposed into two simpler sub-tasks, value extraction (VE), and attribute generation (AG). The VE model Pvesubscript𝑃𝑣𝑒P_{ve}italic_P start_POSTSUBSCRIPT italic_v italic_e end_POSTSUBSCRIPT first generates the value candidate v~~𝑣\tilde{v}over~ start_ARG italic_v end_ARG from x𝑥xitalic_x. Then, the AG model Pagsubscript𝑃𝑎𝑔P_{ag}italic_P start_POSTSUBSCRIPT italic_a italic_g end_POSTSUBSCRIPT generates an attribute a~~𝑎\tilde{a}over~ start_ARG italic_a end_ARG whose value is v~~𝑣\tilde{v}over~ start_ARG italic_v end_ARG in the input x𝑥xitalic_x. The VE and AG models can be trained independently on a product dataset consisting of the triplet (x,a,v)𝑥𝑎𝑣(x,a,v)( italic_x , italic_a , italic_v ) by maximizing the conditional log likelihood of:

v~=argmaxvPve(vx)~𝑣subscriptargmax𝑣subscript𝑃𝑣𝑒conditional𝑣𝑥\displaystyle\tilde{v}=\operatorname*{arg\,max}_{v}P_{ve}(v\mid x)over~ start_ARG italic_v end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_v italic_e end_POSTSUBSCRIPT ( italic_v ∣ italic_x ) (2)
a~=argmaxaPag(ax,v)~𝑎subscriptargmax𝑎subscript𝑃𝑎𝑔conditional𝑎𝑥𝑣\displaystyle\tilde{a}=\operatorname*{arg\,max}_{a}P_{ag}(a\mid x,v)over~ start_ARG italic_a end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_a italic_g end_POSTSUBSCRIPT ( italic_a ∣ italic_x , italic_v ) (3)

In practice, the VE model input is [x1,x2,x|x|]subscript𝑥1subscript𝑥2subscript𝑥𝑥[x_{1},x_{2},\ldots x_{|x|}][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT | italic_x | end_POSTSUBSCRIPT ], where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i-th token of the product input x𝑥xitalic_x and ||{|\cdot|}| ⋅ | represents the number of tokens in the sequence. The input to the AG model takes the value into account by highlighting it inside the input. Specifically, following previous work Chan and Fan (2019); Ushio et al. (2023), we introduce a highlight token <hl> to take the value into account:

[x1,,<hl>,v1,,v|v|,<hl>,x|x|]subscript𝑥1<hl>subscript𝑣1subscript𝑣𝑣<hl>subscript𝑥𝑥[x_{1},\ldots,\texttt{<hl>},v_{1},\ldots,v_{|v|},\texttt{<hl>},\ldots x_{|x|}][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , <hl> , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT | italic_v | end_POSTSUBSCRIPT , <hl> , … italic_x start_POSTSUBSCRIPT | italic_x | end_POSTSUBSCRIPT ]

where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i-th token of v𝑣vitalic_v. At inference, we simply replace the gold value v𝑣vitalic_v of the AG model by the prediction from the VE model, and run the inference over the product context x𝑥xitalic_x. For example, if the VE model extracts "Leather" from the input x𝑥xitalic_x, we highlight "Leather" and feed it to the AG model as: ["Fossil",…,<hl>,"Leather",<hl>,…,"Band"]. Thus, the pipeline approach generates at most one attribute-value pair per product context x𝑥xitalic_x.

To allow the pipeline approach to generate multiple attribute-value pairs, we can convert the values into a flattened sentence y𝑦yitalic_y, and fine-tune a sequence-to-sequence model to generate y𝑦yitalic_y from x𝑥xitalic_x. Formally, we define a function \mathcal{L}caligraphic_L that maps 𝒬xsubscript𝒬𝑥\mathcal{Q}_{x}caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to a sentence as:

(𝒬x)="v1|v2|v3".subscript𝒬𝑥"subscript𝑣1subscript𝑣2subscript𝑣3"\displaystyle\mathcal{L}(\mathcal{Q}_{x})="v_{1}|v_{2}|v_{3}\ldots".caligraphic_L ( caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = " italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT … " . (4)

In this case, the VE model generates a set of possible values, and for each value we run the AG model to obtain a set of attribute-value pairs.

3.2 Multitask AVG

Instead of training two separate generative models for each sub-task, we can instead use a single shared model that is fine-tuned in a multi-task learning setting. Namely, we mix the training instances for the VE and AG tasks together, and randomly sample a batch at each iteration of seq2seq fine-tuning. We distinguish each task by adding a prefix to the beginning of the input text. Namely, we add extract value for the VE task, and generate attribute for the AG task.

3.3 End2End AVG

Instead of breaking the AVG task into two sub-tasks, we can directly model it by transforming the target attribute-value pairs to a flattened sentence z𝑧zitalic_z, and fine-tune a seq2seq model to directly generate the z𝑧zitalic_z from x𝑥xitalic_x. We define a function 𝒯𝒯\mathcal{T}caligraphic_T that maps the target 𝒬xsubscript𝒬𝑥\mathcal{Q}_{x}caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to a sentence as:

𝒯(𝒬x)="{t(a1,v1)|t(a2,v2)|}".𝒯subscript𝒬𝑥"conditional-set𝑡superscript𝑎1superscript𝑣1conditional𝑡superscript𝑎2superscript𝑣2"\displaystyle\mathcal{T}(\mathcal{Q}_{x})="\{t(a^{1},v^{1})|t(a^{2},v^{2})|% \ldots\}".caligraphic_T ( caligraphic_Q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = " { italic_t ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) | italic_t ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | … } " . (5)
t(a,v)="attribute:{a},value:{v}":𝑡𝑎𝑣"attribute𝑎value:𝑣"\displaystyle t(a,v)="\texttt{attribute}:\{a\},\texttt{value}:\{v\}"italic_t ( italic_a , italic_v ) = " attribute : { italic_a } , value : { italic_v } " (6)

We use the template t𝑡titalic_t to textualize the attribute-value pairs and separate them using a separator |. The end2end AVG model Pavg𝑃𝑎𝑣𝑔P{avg}italic_P italic_a italic_v italic_g is optimized by maximizing the conditional log-likelihood:

z~=argmaxzPavg(zx)~𝑧subscriptargmax𝑧subscript𝑃𝑎𝑣𝑔conditional𝑧𝑥\displaystyle\tilde{z}=\operatorname*{arg\,max}_{z}P_{avg}(z\mid x)over~ start_ARG italic_z end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ( italic_z ∣ italic_x ) (7)

4 Experimental Settings

Datasets. We use three real-world datasets.

  • AE-110K Xu et al. (2019): This dataset contains tuples of product titles, attributes, and values from AliExpress Sports & Entertainment category. Instances with NULL values are removed, resulting in 39,505 products with 2,045 unique attributes and 10,977 unique values.

  • MAVE Yang et al. (2022): This is a large and diverse dataset complied from the Amazon Review Dataset Ni et al. (2019). We remove negative examples from the MAVE dataset, where there are no values for the attributes. The final dataset contains around 2.9M attribute-value annotations from 2.2M cleaned Amazon products.

  • OA-Mine Zhang et al. (2022): We use the human-annotated dataset, which contains 1,943 product data from 10 product categories. No further processing is applied to this dataset.

We randomly split all datasets in train:val:test = 8:1:1. The splits are stratified by product category. Appendix A shows statistics of the three datasets.

Approach AE-110k OA-Mine MAVE
P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
T5 Small Pipeline 94.61 70.62 80.88 69.85 76.10 72.84 91.51 89.60 90.55
Multitask 94.94 73.00 82.53 73.70 79.46 76.48 94.88 92.87 93.86
End2End 94.07 70.45 80.56 65.12 49.57 56.29 90.22 90.29 90.25
\cdashline2-11 Ensemble 93.25 79.74 85.97 72.38 86.24 78.71 91.49 95.82 93.60
T5 Base Pipeline 94.93 73.74 83.01 78.82 87.46 82.92 92.10 91.52 91.80
Multitask 95.50 74.55 83.74 79.83 89.22 84.26 96.19 94.10 95.14
End2End 95.61 74.44 83.71 79.63 82.36 80.98 90.31 91.01 90.65
\cdashline2-11 Ensemble 93.82 91.27 87.10 79.11 94.58 86.15 91.72 96.76 94.18
T5 Large Pipeline 94.15 73.83 82.76 78.76 88.70 83.43 92.34 91.32 91.82
Multitask 94.89 69.73 80.38 81.43 90.30 85.63 96.21 92.51 94.32
End2End 95.21 75.62 84.29 82.69 90.20 86.28 96.39 94.01 95.19
\cdashline2-11 Ensemble 92.75 81.57 86.80 80.63 95.79 87.56 91.95 96.89 94.36
BART Base Pipeline 95.00 70.73 81.09 76.25 85.05 80.41 91.20 89.87 90.53
Multitask 95.07 71.66 81.72 78.78 87.27 82.81 89.92 90.74 90.33
End2End 83.33 51.86 63.93 50.85 39.04 44.17 79.46 87.40 83.24
\cdashline2-11 Ensemble 92.71 78.82 85.21 77.30 92.16 84.08 90.53 96.20 93.27
BART Large Pipeline 94.81 68.40 79.47 78.18 86.84 82.29 92.13 90.21 91.16
Multitask 94.42 72.52 82.04 78.62 87.96 83.03 90.47 91.41 90.94
End2End 63.02 46.66 53.62 48.83 37.24 42.26 77.29 86.45 81.61
\cdashline2-11 Ensemble 92.47 79.10 85.26 77.86 93.90 85.14 91.34 96.47 93.85
Table 2: Evaluation results of different attribute-value generation methods. The best score among the approaches for each language model is underlined, and the best result in each dataset across all models is in boldface.

Base Models. For all approaches (pipeline, multitask, and end2end), we experiment with the base language models T5 Raffel et al. (2020) and BART Lewis et al. (2020). We also compare between the model weights t5-{small,base,large} and facebook/bart-{base,large} from HuggingFace333https://huggingface.co/444See Appendix B for Hyper-parameter details..

Evaluation Metrics. Following previous works Yang et al. (2022); Shinzato et al. (2023), we use precision P𝑃Pitalic_P, recall R𝑅Ritalic_R, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score as evaluation metrics. The datasets may contain missing attribute-value pairs that the model might generate. To reduce the impact of such missing attribute-value pairs Shinzato et al. (2023), we discard predicted attribute-value pairs if there are no ground truth labels for the generated attributes.

5 Results

Table 4 provides the main results. In addition to the three approaches (i.e., pipeline, multitask, and end2end), we also provide an ensemble model that combines the generated attribute-value pairs from these approaches. Overall, T5 large (end2end) achieves the best scores across the three datasets. Additionally, the multitask approach exhibits commendable performance, often ranking the second best. There are several interesting observations in Table 4. First, while the end2end approach generally excels, there are instances where the pipeline or multitask approach outperforms it, especially with smaller model sizes. For example, for T5 small on the OA-Mine dataset, the multitask approach outperforms end2end with an F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 76.48 compared to 56.29. By analyzing the errors, we found that the end2end approach makes more errors in detecting attributes, which the multitask approach mitigates. This improvement is mainly because the multitask approach has been specifically trained on the task of attribute generation. Second, the influence of model size on performance is evident, with larger models generally achieving better results across all approaches. For instance, T5 base and T5 large consistently outperform T5 small across all datasets and approaches. This trend is also seen with BART models. Third, among the AVG approaches, T5 consistently works better with the end2end AVG, while BART is not well-suited when used end2end. A possible explanation is that T5 has observed sentences with structured observation due to its multitask pre-training objective, while BART did not encounter such training instances as it was trained only on a denoising sequence-to-sequence objective. Finally, there are notable differences in performance across the datasets. For instance, the MAVE dataset sees higher overall F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores compared to AE-110k and OA-Mine datasets. The higher results on the MAVE dataset can be attributed to its uniform annotation process using an ensemble of models, unlike the more varied human annotations in AE-110k and OA-Mine555See Appendix D for cross-dataset evaluation..

Ensemble models, which combine the generated attribute-value pairs across the three approaches, consistently improve results. For instance, in AE-110k, ensembling trades off a small amount of precision for substantial gains in recall, while in OA-Mine, precision remains stable with improved recall. In general, ensembling helps to identify more attributes and therefore enhances the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score by increasing the recall. However, it slightly reduces precision due to challenges in extracting accurate values for these new attributes.

6 Conclusion

In this paper, we formalized PAVI as an attribute-value generation task and established three different AVG approaches. Using T5 and BART base models, we conducted experiments on three benchmark product datasets. Our evaluation demonstrates that end2end AVG, which generates attributes and values simultaneously, is generally more reliable. However, pipeline or multitask approach can offer advantages, particularly for smaller models and when using language models like BART.

Limitations

Our study has two main limitations. First, the datasets used in our experiments do not have standard splits. We randomly split the datasets as discussed in Section 4, but we have provided the exact data splits in our repository to ensure reproducibility and comparability. Second, the evaluation measures employed do not penalize over-generated attribute-value pairs. We assume that the datasets do not have all possible annotations, so the generative models might correctly identify new attribute-value pairs. However, in our evaluation, we discard these newly generated attribute-value pairs. As future work, we plan to develop methods for the automatic evaluation of newly generated attribute-value pairs.

References

  • Bartolo et al. (2021) Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela. 2021. Improving question answering model robustness with synthetic adversarial data generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8830–8848.
  • Blume et al. (2023) Ansel Blume, Nasser Zalmout, Heng Ji, and Xian Li. 2023. Generative models for product attribute extraction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 575–585.
  • Brinkmann et al. (2023) Alexander Brinkmann, Roee Shraga, and Christian Bizer. 2023. Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537.
  • Chan and Fan (2019) Ying-Hong Chan and Yao-Chung Fan. 2019. A recurrent bert-based model for question generation. In Proceedings of the 2nd workshop on machine reading for question answering, pages 154–162.
  • Chen et al. (2022) Wei-Te Chen, Yandi Xia, and Keiji Shinzato. 2022. Extreme multi-label classification with label masking for product attribute value extraction. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 134–140.
  • Chen et al. (2023) Zhiyu Chen, Jason Choi, Besnik Fetahu, Oleg Rokhlenko, and Shervin Malmasi. 2023. Generate-then-retrieve: Intent-aware faq retrieval in product search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 763–771.
  • Deng et al. (2023) Yang Deng, Wenxuan Zhang, Qian Yu, and Wai Lam. 2023. Product question answering in e-commerce: A survey. In The 61st Annual Meeting Of The Association For Computational Linguistics.
  • Diederik (2014) P Kingma Diederik. 2014. Adam: A method for stochastic optimization. (No Title).
  • Ding et al. (2022) Yifan Ding, Yan Liang, Nasser Zalmout, Xian Li, Christan Grant, and Tim Weninger. 2022. Ask-and-verify: Span candidate generation and verification for attribute value extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 110–110.
  • Hu et al. (2022) Miaobo Hu, Jun Xiao, Yunfei Liu, Weihu Guo, and Xipeng Fan. 2022. Fusing attribute type features for attribute value extraction from product via question answering. In Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing, pages 179–184.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  • Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  • Roy et al. (2024) Kalyani Roy, Pawan Goyal, and Manish Pandey. 2024. Exploring generative frameworks for product attribute value extraction. Expert Systems with Applications, 243:122850.
  • Sabeh et al. (2022) Kassem Sabeh, Mouna Kacimi, and Johann Gamper. 2022. Cave: Correcting attribute values in e-commerce profiles. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4965–4969.
  • Shinzato et al. (2023) Keiji Shinzato, Naoki Yoshinaga, Yandi Xia, and Wei-Te Chen. 2023. A unified generative approach to product attribute-value identification. arXiv preprint arXiv:2306.05605.
  • Truong et al. (2022) Quoc-Tuan Truong, Tong Zhao, Changhe Yuan, ** Li, Jim Chan, Soo-Min Pantel, and Hady W Lauw. 2022. Ampsum: Adaptive multiple-product summarization towards improving recommendation captions. In Proceedings of the ACM Web Conference 2022, pages 2978–2988.
  • Ushio et al. (2023) Asahi Ushio, Fernando Alva-Manchego, and Jose Camacho-Collados. 2023. An empirical comparison of lm-based question and answer generation methods. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14262–14272.
  • Wang et al. (2020) Qifan Wang, Li Yang, Bhargav Kanagal, Sumit Sanghai, D Sivakumar, Bin Shu, Zac Yu, and Jon Elsas. 2020. Learning to extract attribute value from product via question answering: A multi-task approach. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 47–55.
  • Xu et al. (2019) Huimin Xu, Wenting Wang, Xinnian Mao, Xinyu Jiang, and Man Lan. 2019. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5214–5223.
  • Yan et al. (2021) Jun Yan, Nasser Zalmout, Yan Liang, Christan Grant, Xiang Ren, and Xin Luna Dong. 2021. Adatag: Multi-attribute value extraction from product profiles with adaptive decoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4694–4705.
  • Yang et al. (2023) Li Yang, Qifan Wang, **gang Wang, Xiaojun Quan, Fuli Feng, Yu Chen, Madian Khabsa, Sinong Wang, Zenglin Xu, and Dongfang Liu. 2023. Mixpave: Mix-prompt tuning for few-shot product attribute value extraction. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9978–9991.
  • Yang et al. (2022) Li Yang, Qifan Wang, Zac Yu, Anand Kulkarni, Sumit Sanghai, Bin Shu, Jon Elsas, and Bhargav Kanagal. 2022. Mave: A product dataset for multi-source attribute value extraction. In Proceedings of the fifteenth ACM international conference on web search and data mining, pages 1256–1265.
  • Zhang et al. (2022) Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, **gbo Shang, Christos Faloutsos, and Jiawei Han. 2022. Oa-mine: Open-world attribute mining for e-commerce products with weak supervision. In Proceedings of the ACM Web Conference 2022, pages 3153–3161.
  • Zheng et al. (2018) Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, and Feifei Li. 2018. Opentag: Open attribute value extraction from product profiles. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1049–1058.
Counts AE-110K OA-Mine MAVE
# products 39,505 1,943 2,226,509
# attribute-value pairs 88,915 11,008 2,987,151
# unique categories 10 10 1,257
# unique attributes 2,045 51 705
# unique values 10,977 5,201 79,199
Table 3: Statistics of AE-110K, OA-Mine, and MAVE datasets.
Approach Epochs LR Batch Size
T5 small Pipeline (VE) 9 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 128
Pipeline (AG) 11 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 128
Multitask 16 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 256
End2End 18 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 256
T5 base Pipeline (VE) 8 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
Pipeline (AG) 7 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
Multitask 8 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 128
End2End 11 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
T5 large Pipeline (VE) 6 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 128
Pipeline (AG) 5 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
Multitask 5 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
End2End 8 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
BART base Pipeline (VE) 5 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 64
Pipeline (AG) 4 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 128
Multitask 4 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 64
End2End 6 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 128
BART large Pipeline (VE) 6 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 64
Pipeline (AG) 4 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 128
Multitask 3 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 64
End2End 7 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 64
Table 4: Hyper-parameter details, including number of training epochs, learning rate (LR), and batch size, for different AVG approaches.

Appendix A Datasets

Table 3 shows the statistics of the three datasets: AE-110K Xu et al. (2019), OA-Mine Zhang et al. (2022), and MAVE Yang et al. (2022).

Appendix B Hyper-Parameters

All models are implemented using Pytorch and are trained on NVIDIA Tesla A100 GPUs. We use the validation set of the datasets to select the optimal hyper-parameters for all models, while we report our final results on the test set. During training, optimization is performed using Adam Diederik (2014) optimizer. We perform early stop** if there is no improvement in the loss on the validation set for 3 epochs. The maximum input length is fixed at 512. The maximum output length is 256 for the end2end models and 64 for the others. The details of all hyper-parameters for each approach are reported in Table 4.

Appendix C Model Comparison

Approach Training Cost Inference Cost Memory Generated AV
Pipeline 3.9×\times× 2.7×\times× 2×\times× 2.3×\times×
Multitask 2.5×\times× 2.7×\times× 1×\times× 2.3×\times×
End2End 1×\times× 1×\times× 1×\times× 1×\times×
Table 5: Training cost, inference cost, memory requirements, and number of generated attribute-value pairs (Generated AV) of the three proposed AVG approaches, normalized to the End2End approach. The comparison is performed for T5-large. Generated AV are averaged across the three datasets.

In addition to performance, computational cost and usability are crucial factors when selecting an AVG approach. Table 5 details the training cost, inference cost, memory requirements, and number of generated attribute-value pairs for the pipeline, multitask, and end2end approaches. The end2end approach is the most efficient during training due to its single integrated model. In contrast, the pipeline approach has the highest training cost because it requires training two separate models for its sequential processing stages. The multitask approach falls in between, as it uses a shared model, reducing redundancy and thus lowering the training cost compared to the pipeline approach. For inference, end2end AVG is the fastest as it can generate the attribute-value pairs in a single pass. Both pipeline and multitask approaches are slower since they handle each task independently, and a single prediction requires two steps: value extraction and attribute generation. Regarding memory requirements, both end2end and multitask AVG employ a single model, while pipeline AVG uses two separate models, effectively doubling the memory footprint. While the end2end approach is the most efficient overall, minimizing both training and inference costs, the pipeline and multitask approaches can generate a larger number of attribute-value pairs on average. Additionally, the pipeline and multitask approaches offer flexibility for separate processes, as they can perform value extraction or attribute generation sub-tasks independently.

Refer to caption
(a) Handbag product from MAVE dataset.
Refer to caption
(b) Coffee product from OA-Mine dataset.
Figure 3: Examples of cross-domain attribute-value identification. Correct predictions are highlighted in green, and wrong ones are highlighted in red. In the first example, the T5 model trained on OA-Mine incorrectly predicts food-related attributes, showing domain bias. While the in-domain T5 model, trained on MAVE dataset, correctly identifies all attribute-value pairs. In the second example, both T5 models trained on MAVE and AE-110K (cross-domain), fail to identify the Flavor attribute.

Appendix D Cross-dataset Evaluation

AE-110K OA-Mine MAVE
AE-110k 84.29 1.58 6.40
OA-Mine 16.42 86.28 4.31
MAVE 6.42 3.18 95.19
Table 6: F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of cross-dataset predictions of T5-large end2end model.

Table 6 presents the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of the T5-large end2end model evaluated on cross-dataset predictions. We chose the T5-large end2end model because it demonstrated the best in-domain performance, as shown in Table 4. The models exhibit high performance when evaluated on the same dataset used for training, with F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of 84.29, 86.28, and 95.19 on AE-110K, OA-Mine, and MAVE respectively. This indicates the models’ strong ability to fit the training data. However, there is a notable drop in performance when the models are tested on different datasets. For instance, when the model trained on AE-110K is evaluated on OA-Mine and MAVE, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores drop to 1.58 and 6.40 respectively. Similarly, models trained on OA-Mine and MAVE also show reduced performance on other datasets, with F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores as low as 4.31 and 3.18. These results highlight a significant challenge in the model’s ability to generalize across different datasets. The poor cross-dataset performance can be attributed to the different attribute names and categories/domains present in each dataset, which the model struggles to generate. Figure 3 provides two examples illustrating these challenges. In the first example of a Handbag product from the MAVE dataset, the model trained on OA-Mine predicts (Gender, "Woman"), (Flavor, "Wine"), and (Color, "Red"), while the model trained on AE-110k predicts (Brand Name, "COCIFIER") and (Gender, "Women"). This example demonstrates the domain bias of the model trained on OA-Mine, which includes food-related items. The model erroneously identifies the attribute "Flavor" for the value "Wine", a food-related attribute, when applied to a fashion product.

In the second example of a coffee product from the OA-Mine dataset, the model trained on MAVE predicts (Brand, "Lola Savannah") and (Container Type, "Bag"), while the model trained on AE-110k predicts (Brand Name, "Lola Savannah") and (Type, "Ground"). The differences in attribute names across datasets, such as "Brand" versus "Brand Name", lead to incorrect predictions. Additionally, since the MAVE and AE-110k datasets do not include products from food categories, they fail to identify the "Flavor" attribute, which is specific to the OA-Mine dataset.