An Empirical Comparison of Generative Approaches
for Product Attribute-Value Identification

Kassem Sabeh Free University of Bozen-Bolzano, Italy
{ksabeh, jgamper}@unibz.it Robert Litschko LMU Munich, Germany
{robert.litschko, b.plank}@lmu.de Mouna Kacimi Wonder Technology Srl, Italy
[email protected] Barbara Plank LMU Munich, Germany
{robert.litschko, b.plank}@lmu.de Johann Gamper Free University of Bozen-Bolzano, Italy
{ksabeh, jgamper}@unibz.it

Abstract

Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: https://github.com/kassemsabeh/pavi-avg

1 Introduction

Product attributes are a crucial component of e-commerce platforms, facilitating applications such as product search Chen et al. (2023), product recommendation Truong et al. (2022), and product-related question answering Deng et al. (2023). They provide useful details about product features, enabling customers to compare products and make informed purchasing decisions. Product attribute and value identification (PAVI) refers to the task of identifying both the attributes and their corresponding values from an input context, such as a product title or description. For example, given the product title "Fossil Men’s Watch Analog Display Slim Case Design with Brown Leather Band" (see Figure 1), a model should identify the attributes Brand, Band Color, and Band Material, with the corresponding values Fossil, Brown, and Leather.

Most existing work focuses on product attribute-value extraction (PAVE) Zheng et al. (2018); Xu et al. (2019); Wang et al. (2020); Yang et al. (2022), which extracts the value of a given attribute from the input context. Despite extensive research on PAVE Blume et al. (2023); Yang et al. (2023); Brinkmann et al. (2023), PAVI is a more realistic and complex task since it requires the attribute to be generated and not assumed to be part of the input. While recent studies have explored generative models for PAVI, these efforts are limited in scope and often lack comprehensive evaluation across different datasets and settings Roy et al. (2024); Shinzato et al. (2023). Moreover, existing work focus primarily on end-to-end models without exploring alternative generative strategies. Consequently, it remains unclear which types of PAVI models are effective in practice, as comprehensive experiments and comparisons are lacking.

Refer to caption — Figure 1: An example of a product title with tagged attribute-value pairs.

In this paper, we address these gaps by proposing three generative approaches for PAVI and conducting a comprehensive evaluation across multiple datasets. Inspired by recent advancements on question and answer generation methods Bartolo et al. (2021), we compare between three strategies based on fine-tuning encoder-decoder language models such as T5 Raffel et al. (2020) and BART Lewis et al. (2020). Our proposed approaches are: (1) pipeline attribute-value generation (AVG), which decomposes the task into value extraction and attribute generation, and builds a separate model for each sub-task; (2) multitask AVG, which uses a single shared model that is trained on both sub-tasks; (3) end2end AVG, which uses a single model to generate the attribute-value pairs. We evaluate the performance of these approaches on three real-world product datasets: AE-110K, OA-mine, and MAVE. All the models and datasets are publicly released on HuggingFace¹¹1https://huggingface.co/av-generation and available as a demo²²2https://bit.ly/4bWFjNV.

2 Related Work

	Pipe.	Multi.	E2E	AE-110k	OA-MINE	MAVE	Open
Shinzato et al. (2023)	✗	✗	✓	✗	✗	✓	✗
Roy et al. (2024)	✗	✗	✓	✓	✗	✗	✗
Ours	✓	✓	✓	✓	✓	✓	✓

Table 1: Comparison between our work and prior studies for generative-based PAVI.

Most existing approaches for attribute-value extraction use sequence tagging Huang et al. (2015); Xu et al. (2019); Yan et al. (2021); Zheng et al. (2018) or question answering Wang et al. (2020); Yang et al. (2022); Ding et al. (2022); Hu et al. (2022); Sabeh et al. (2022); Yang et al. (2023) methods. However, such approaches carry closed-world assumption, as they require the set of attributes as inputs to extract the corresponding values. More recently, researchers have explored the capabilities of generative models to tackle the PAVI task, in an open-world setting. Roy et al. (2024) proposed a generative framework for joint attribute and value extraction. They conduct experiments on the AE-110k dataset and show that the generative approaches surpass question-answering based methods. Shinzato et al. (2023) fine-tune a pre-trained T5 generative model Raffel et al. (2020) to decode a set of target attribute-value pairs from the input product text of the MAVE dataset Yang et al. (2022). They show that the generative approach outperforms extraction and classification-based methods Chen et al. (2022).

However, all above studies utilize an end-to-end generative approach. They did not explore other generative strategies for attribute-value identification (i.e., pipeline and multi-task). In addition, these approaches are not comparable as they are different in terms of datasets, settings, and evaluation metrics. Finally, none of the above proposed models have been made publicly available. In this work, we propose three generative approaches for PAVI and empirically compare them on three real-world datasets. We summarize how our approach differs from prior work in Table 1. As can be seen, we evaluate in total all approaches across three datasets.

3 Proposed Methods

Given an input product data (title or description) $x=\{x_{1},x_{2},\ldots,x_{|x|}\}$ , attribute-value generation aims to generate attribute-value pairs $\mathcal{Q}_{x}$ related to the information in $x$ :

\mathcal{Q}_{x}=\{(a^{1},v^{1}),(a^{2},v^{2}),(a^{3},v^{3}),\ldots\}

(1)

For instance, if $x$ ="Fossil",…,"Band", then $\mathcal{Q}_{x}$ = ("Brand","Fossil"), ("Band Color","Brown"), ("Band Material","Leather").

We formulate the attribute-value identification problem as an attribute-value generation (AVG) task and propose three approaches based on fine-tuning language models, as depicted in Figure 2.

3.1 Pipeline AVG

The AVG task can be decomposed into two simpler sub-tasks, value extraction (VE), and attribute generation (AG). The VE model $P_{ve}$ first generates the value candidate $\tilde{v}$ from $x$ . Then, the AG model $P_{ag}$ generates an attribute $\tilde{a}$ whose value is $\tilde{v}$ in the input $x$ . The VE and AG models can be trained independently on a product dataset consisting of the triplet $(x,a,v)$ by maximizing the conditional log likelihood of:

	$\displaystyle\tilde{v}=\operatorname*{arg\,max}_{v}P_{ve}(v\mid x)$		(2)
	$\displaystyle\tilde{a}=\operatorname*{arg\,max}_{a}P_{ag}(a\mid x,v)$		(3)

In practice, the VE model input is $[x_{1},x_{2},\ldots x_{|x|}]$ , where $x_{i}$ is the i-th token of the product input $x$ and ${|\cdot|}$ represents the number of tokens in the sequence. The input to the AG model takes the value into account by highlighting it inside the input. Specifically, following previous work Chan and Fan (2019); Ushio et al. (2023), we introduce a highlight token <hl> to take the value into account:

[x_{1},\ldots,\texttt{<hl>},v_{1},\ldots,v_{|v|},\texttt{<hl>},\ldots x_{|x|}]

where $v_{i}$ is the i-th token of $v$ . At inference, we simply replace the gold value $v$ of the AG model by the prediction from the VE model, and run the inference over the product context $x$ . For example, if the VE model extracts "Leather" from the input $x$ , we highlight "Leather" and feed it to the AG model as: ["Fossil",…,<hl>,"Leather",<hl>,…,"Band"]. Thus, the pipeline approach generates at most one attribute-value pair per product context $x$ .

To allow the pipeline approach to generate multiple attribute-value pairs, we can convert the values into a flattened sentence $y$ , and fine-tune a sequence-to-sequence model to generate $y$ from $x$ . Formally, we define a function $\mathcal{L}$ that maps $\mathcal{Q}_{x}$ to a sentence as:

\displaystyle\mathcal{L}(\mathcal{Q}_{x})="v_{1}|v_{2}|v_{3}\ldots".

(4)

In this case, the VE model generates a set of possible values, and for each value we run the AG model to obtain a set of attribute-value pairs.

3.2 Multitask AVG

Instead of training two separate generative models for each sub-task, we can instead use a single shared model that is fine-tuned in a multi-task learning setting. Namely, we mix the training instances for the VE and AG tasks together, and randomly sample a batch at each iteration of seq2seq fine-tuning. We distinguish each task by adding a prefix to the beginning of the input text. Namely, we add extract value for the VE task, and generate attribute for the AG task.

3.3 End2End AVG

Instead of breaking the AVG task into two sub-tasks, we can directly model it by transforming the target attribute-value pairs to a flattened sentence $z$ , and fine-tune a seq2seq model to directly generate the $z$ from $x$ . We define a function $\mathcal{T}$ that maps the target $\mathcal{Q}_{x}$ to a sentence as:

	$\displaystyle\mathcal{T}(\mathcal{Q}_{x})="\{t(a^{1},v^{1})\|t(a^{2},v^{2})\|% \ldots\}".$		(5)
	$\displaystyle t(a,v)="\texttt{attribute}:\{a\},\texttt{value}:\{v\}"$		(6)

We use the template $t$ to textualize the attribute-value pairs and separate them using a separator |. The end2end AVG model $P{avg}$ is optimized by maximizing the conditional log-likelihood:

\displaystyle\tilde{z}=\operatorname*{arg\,max}_{z}P_{avg}(z\mid x)

(7)

4 Experimental Settings

Datasets. We use three real-world datasets.

•

AE-110K Xu et al. (2019): This dataset contains tuples of product titles, attributes, and values from AliExpress Sports & Entertainment category. Instances with NULL values are removed, resulting in 39,505 products with 2,045 unique attributes and 10,977 unique values.
•

MAVE Yang et al. (2022): This is a large and diverse dataset complied from the Amazon Review Dataset Ni et al. (2019). We remove negative examples from the MAVE dataset, where there are no values for the attributes. The final dataset contains around 2.9M attribute-value annotations from 2.2M cleaned Amazon products.
•

OA-Mine Zhang et al. (2022): We use the human-annotated dataset, which contains 1,943 product data from 10 product categories. No further processing is applied to this dataset.

We randomly split all datasets in train:val:test = 8:1:1. The splits are stratified by product category. Appendix A shows statistics of the three datasets.

Approach		AE-110k			OA-Mine			MAVE
Approach		$P$	$R$	$F_{1}$	$P$	$R$	$F_{1}$	$P$	$R$	$F_{1}$
T5 Small	Pipeline	94.61	70.62	80.88	69.85	76.10	72.84	91.51	89.60	90.55
	Multitask	94.94	73.00	82.53	73.70	79.46	76.48	94.88	92.87	93.86
	End2End	94.07	70.45	80.56	65.12	49.57	56.29	90.22	90.29	90.25
\cdashline2-11	Ensemble	93.25	79.74	85.97	72.38	86.24	78.71	91.49	95.82	93.60
T5 Base	Pipeline	94.93	73.74	83.01	78.82	87.46	82.92	92.10	91.52	91.80
	Multitask	95.50	74.55	83.74	79.83	89.22	84.26	96.19	94.10	95.14
	End2End	95.61	74.44	83.71	79.63	82.36	80.98	90.31	91.01	90.65
\cdashline2-11	Ensemble	93.82	91.27	87.10	79.11	94.58	86.15	91.72	96.76	94.18
T5 Large	Pipeline	94.15	73.83	82.76	78.76	88.70	83.43	92.34	91.32	91.82
	Multitask	94.89	69.73	80.38	81.43	90.30	85.63	96.21	92.51	94.32
	End2End	95.21	75.62	84.29	82.69	90.20	86.28	96.39	94.01	95.19
\cdashline2-11	Ensemble	92.75	81.57	86.80	80.63	95.79	87.56	91.95	96.89	94.36
BART Base	Pipeline	95.00	70.73	81.09	76.25	85.05	80.41	91.20	89.87	90.53
	Multitask	95.07	71.66	81.72	78.78	87.27	82.81	89.92	90.74	90.33
	End2End	83.33	51.86	63.93	50.85	39.04	44.17	79.46	87.40	83.24
\cdashline2-11	Ensemble	92.71	78.82	85.21	77.30	92.16	84.08	90.53	96.20	93.27
BART Large	Pipeline	94.81	68.40	79.47	78.18	86.84	82.29	92.13	90.21	91.16
	Multitask	94.42	72.52	82.04	78.62	87.96	83.03	90.47	91.41	90.94
	End2End	63.02	46.66	53.62	48.83	37.24	42.26	77.29	86.45	81.61
\cdashline2-11	Ensemble	92.47	79.10	85.26	77.86	93.90	85.14	91.34	96.47	93.85

Counts	AE-110K	OA-Mine	MAVE
# products	39,505	1,943	2,226,509
# attribute-value pairs	88,915	11,008	2,987,151
# unique categories	10	10	1,257
# unique attributes	2,045	51	705
# unique values	10,977	5,201	79,199

Approach		Epochs	LR	Batch Size
T5 small	Pipeline (VE)	9	$5e^{-5}$	128
	Pipeline (AG)	11	$5e^{-5}$	128
	Multitask	16	$5e^{-4}$	256
	End2End	18	$5e^{-4}$	256
T5 base	Pipeline (VE)	8	$5e^{-4}$	64
	Pipeline (AG)	7	$5e^{-4}$	64
	Multitask	8	$5e^{-4}$	128
	End2End	11	$5e^{-4}$	64
T5 large	Pipeline (VE)	6	$5e^{-5}$	128
	Pipeline (AG)	5	$5e^{-4}$	64
	Multitask	5	$1e^{-4}$	64
	End2End	8	$1e^{-4}$	64
BART base	Pipeline (VE)	5	$5e^{-5}$	64
	Pipeline (AG)	4	$1e^{-4}$	128
	Multitask	4	$1e^{-4}$	64
	End2End	6	$5e^{-4}$	128
BART large	Pipeline (VE)	6	$5e^{-5}$	64
	Pipeline (AG)	4	$5e^{-5}$	128
	Multitask	3	$1e^{-5}$	64
	End2End	7	$1e^{-5}$	64

Approach	Training Cost	Inference Cost	Memory	Generated AV
Pipeline	3.9 $\times$	2.7 $\times$	2 $\times$	2.3 $\times$
Multitask	2.5 $\times$	2.7 $\times$	1 $\times$	2.3 $\times$
End2End	1 $\times$	1 $\times$	1 $\times$	1 $\times$

	AE-110K	OA-Mine	MAVE
AE-110k	84.29	1.58	6.40
OA-Mine	16.42	86.28	4.31
MAVE	6.42	3.18	95.19

An Empirical Comparison of Generative Approaches
for Product Attribute-Value Identification

Abstract

1 Introduction

2 Related Work

3 Proposed Methods

3.1 Pipeline AVG

3.2 Multitask AVG

3.3 End2End AVG

4 Experimental Settings

5 Results

6 Conclusion

Limitations

References

Appendix A Datasets

Appendix B Hyper-Parameters

Appendix C Model Comparison

Appendix D Cross-dataset Evaluation

An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification

Abstract

1 Introduction

2 Related Work

3 Proposed Methods

3.1 Pipeline AVG

3.2 Multitask AVG

3.3 End2End AVG

4 Experimental Settings

5 Results

6 Conclusion

Limitations

References

Appendix A Datasets

Appendix B Hyper-Parameters

Appendix C Model Comparison

Appendix D Cross-dataset Evaluation

An Empirical Comparison of Generative Approaches
for Product Attribute-Value Identification