Abstract

How deep is your art: an experimental study on the limits of artistic understanding in a single-task, single-modality neural network

Mahan Agha Zahedi^1¶, Niloofar Gholamrezaei^2¶, Alex Doboli^1¶,

1 Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, New York, United States of America

2 work done when at School of Art, Texas Tech University, Lubbock, Texas, United States of America (now at the Department of Humanities, Regis College, Weston, Massachusetts, United States of America )

¶These authors contributed equally to this work.

[email protected] (MAZ)

Abstract

Computational modeling of artwork meaning is complex and difficult. This is because art interpretation is multidimensional and highly subjective. This paper experimentally investigated the degree to which a state-of-the-art Deep Convolutional Neural Network (DCNN), a popular Machine Learning approach, can correctly distinguish modern conceptual art work into the galleries devised by art curators. Two hypotheses were proposed to state that the DCNN model uses Exhibited Properties for classification, like shape and color, but not Non-Exhibited Properties, such as historical context and artist intention. The two hypotheses were experimentally validated using a methodology designed for this purpose. VGG-11 DCNN pre-trained on ImageNet dataset and discriminatively fine-tuned was trained on handcrafted datasets designed from real-world conceptual photography galleries. Experimental results supported the two hypotheses showing that the DCNN model ignores Non-Exhibited Properties and uses only Exhibited Properties for artwork classification. This work points to current DCNN limitations, which should be addressed by future DNN models.

Introduction

While the study of art has traditionally been the focus of art history, aesthetics, philosophy, psychology and other related areas, advances in Artificial Intelligence (AI) and Machine Learning (ML) have enabled new avenues of inquiry, like devising novel computational models, such as Deep Neural Networks (DNNs), to automatically classify, recognize, and generate artwork [1]. It has been reported that DNNs can identify art genres, artists, and the time range of an art object’s creation [2, 3, 1, 4, 5, 6]. Applications of these DNN models include hel** art curators and historians understand, explore, and navigate through the numerous artworks in museums, galleries, and online sources. Investigating AI/ML models also offers insight on how low-level visual features can lead towards the discovery of high-level semantic knowledge, like image content and object significance, and thus possibly lead to unsupervised knowledge discovery, including tacit knowledge, abstractions, and conceptual reasoning.

Any attempt to mechanically analyze artwork should reflect the nature of art and how it differs from other types of images. This work employed Jerrold Levinson’s philosophy of art as it offers a concrete definition of art, inclusive of both traditional and conceptual works of art. Levinson considers a work of art to incorporate two major properties, Exhibited Properties (EXPs) and Non-Exhibited Properties (NEXPs) [7]. EXPs are the visible elements of art objects, such as color, texture, and form. NEXPs are the ones that are essential artistic aspects of artwork, though they are not simply visible in art objects. NEXPs are accessible by relating art objects and EXPs to human history, culture, and individuals who created the work [8]. Fig 1 summarizes the two kinds of properties.

Refer to caption — Fig 1: Art analysis according to Levinson’s definition of the elements of arts. (a) Art consists of Exhibited Properties (EXPs), illustrated in a darker color, and None-Exhibited Properties (NEXP), depicted in a lighter color. (b) NEXPs are accessed by relating EXPs to the historical discourse of art. (c) The difficulty in art interpretation is shown as a spectrum with an example for each end, top: “Fountain” by Marcel Duchamp, and bottom: “The Accident” by William Geets.

EXPs sometimes directly point to NEXPs but other times they do not. Rather, understanding NEXPs may require complex contextualization and interpretation. Moreover, some artwork contains more EXPs than NEXPs, while some work, particularly artwork identified as conceptual art, is highly loaded with NEXPs. For example, the painting “The Accident” by William Geets (1899) (Fig 1(c)-bottom) is a narrative figurative work that can be understood to a great extent just by looking at the picture, as it contains more EXPs than NEXPs. In contrast, the famous work “Fountain” by Marcel Duchamp (1917) is meaningful mainly based on its NEXPs (Fig 1(c)-top). Based on Levinson’s theory, what makes Duchamp’s urinal art, and hence different from other mass-produced urinals, is not its shape, color, or style but the intention of the artist toward the object in relation to the historical discourse of art [9].

DNN-based computational methods used for automated art-related activities rely on EXP processing. While EXPs might be sufficient to tackle some art genres, like iconoclasm and medieval European religious art [1], it is unclear if EXPs are sufficient to identify NEXPs in modern artwork, e.g., intention and historical conditions. A recent model of visual aesthetic experience suggests two parallel, quasi-independent processing modes: bottom-up, perceptual processing universal among all humans (similar to EXP processing) and top-down, cognitive processing that accounts for contextual information, artist intention, and artwork presentation circumstances (similar to NEXP processing) . As summarized in Section Related Work, previous work suggests that DNN models can gain some insight on artwork meaning (semantics) starting only from EXPs, like color, texture, and shapes [2, 3, 4, 11, 12]. However, there are no comprehensive studies on the degree to which NEXP recognition can emerge during DNN training using artwork images, and whether such NEXPs are sufficient to distinguish art objects from non-art objects or other artwork, especially in case of conceptual arts. Such studies are important not only to identify and characterize the limitations of DNN models but also to understand if NEXPs of art objects can be sufficiently well distinguished using only their EXPs, thus if an art object is fully specified within its body of similar work, e.g., gallery or art show, or if NEXPs depend to a significant degree on elements not embodied into an art object, like contextual elements, the artist’s intention, and the viewer’s interpretation.

This paper presents a comprehensive experimental study that tackled understanding the degree to which a state-of-the-art Deep Convolutional Neural Network (DCNN) learns EXPs and NEXPs and then uses the learned knowledge to classify such artworks in the same galleries and exhibitions as artists and curators. As discussed in Section Related Work, existing work does not consider the boundary between the AI/ML’s perception of a computer image and an image’s interpretation as art. Thus, computational modeling of art is often not grounded in the theory of art. For example, AI/ML models are trained on art images labeled with their styles, genres, or authors but without information about their contexts, intentions, interpretations, or emotions. It is unknown the degree to which AI/ML models, like DCNNs, can automatically pick up these essential details. To address this limitation, this work devised an experimental study integrating semantic and conceptual ideas in aesthetics with AI/ML modeling and experimentation. Given that EXPs of art objects are the basis of DCNN model training while NEXPs are likely to be learned, the following two hypotheses were defined to experimentally study the importance of EXPs and NEXPs in the automated classification of artwork into different galleries:

Hypothesis I: Deep Convolutional Neural Network (DCNN) models do not capture NEXPs well for art gallery classification. Therefore, their classification results are not influenced by NEXPs.

Hypothesis II: The similarities and differences of EXPs within and between art galleries determine the difficulty level of classification using DCNN models.

The experimental method devised to study the two hypotheses includes three experiments that used datasets assembled by an art expert. The used DCNN model was the VGG- 11 [13] pre-trained on ImageNet database [14]. The model was then retrained using art images. The datasets designed for the study included images of contemporary photography of diverse style and conceptual orientation. Our art expert chose exhibitions of artists from different countries and photographs that reflect different approaches toward fine art photography, like realism, abstraction, commercial, and conceptual photography. As already discussed, conceptual art, like Duchamp’s “Fountain”, often represents ordinary ready-made or mass-produced objects (or their photographs) as a work of art, which exclusively relies on the ideas intended towards the artwork rather than artistic style. Hence, the two above hypotheses suggested that including conceptual photography increases the difficulty of classifying a dataset into galleries. A high EXP diversity within a gallery also increases the difficulty level. To further explore the degree to which the model learns beyond EXPs to classify artwork based on their NEXPs, e.g., concept, ideas, and historical context, the study added a gallery of none-art images of ordinary objects that resembled in their appearance the conceptual fine art photography exhibitions included in the experiment. The two hypotheses indicate that the model should show poor performance for the dataset with conceptual photography and non-art images because of the high EXP similarity of their galleries. Experimental results were analyzed using statistical and classification metrics. The results of the three experiments confirmed the validity of the two hypotheses.

The paper has the following structure. Related work was summarized next, followed by the presentation of the experimental methodology. Results and their discussion were described next. The paper ends with conclusions and further research directions.

Related Work

Recent work has proposed using modern AI/ML methods for automated analysis of artworks, such as style recognition, classification, and generation [2, 3, 1, 4]. A comprehensive overview paper discusses recent computational and experimental approaches to visual aesthetics, including modern AI/ML methods [15]. The AI/ML methods often use Deep Convolutional Neural Networks (DCNNs), a DNN type devised for computer image processing [2, 3, 4, 11]. To address the need of large datasets for DCNN training [13], which is often difficult to meet for art objects, the traditional solution is to pre-train a DCNN using large databases of images, e.g., ImageNet, and then retrain only the output and intermediate layers using art images [2, 3, 4, 11, 16, 12]. Fig 2 summarizes the reported using of artwork EXPs and NEXPs for automated art understanding activities.

Automated style recognition attempts to identify the artistic style of art objects, like paintings and porcelain objects [2, 3, 4, 11, 12]. This work uses EXPs, like color and texture. For example, [17] examines the classification of artistic styles into their respective historical periods based on the ideas of Heinrich Wölfflin, a prominent art historian (1846-1945). Wölfflin explains that different artistic styles reveal their respective historical contexts. Therefore, a machine could classify artwork into historical periods by relying on the stylistic characteristics of artwork [17]. Different edge orientations are characteristic to traditional artwork from different cultures [18]. Statistical differences in image composition are presented between traditional art, bad art, and twentieth century abstract art [19]. Seven DCNN models were tested for three art datasets to classify genres (i.e. landscapes, portraits, abstract paintings, etc.), styles (e.g., Renaissance, Baroque, Impressionism, Cubism, Symbolism, etc.), and artists [2]. Classification uses mostly color information to achieve for some styles a recognition accuracy similar to human experts. However, the authors state that some EXPs, like gestural brushstrokes, can be misleading. Also, certain styles are hard to be automatically differentiated from each other, like Post Impressionism and Impressionism, Abstract Expressionism and Art Informel, or Mannerism and Baroque [3]. A dual-path DCNN model recognizes both artistic style and painting content [20]. DCNN are proposed to recognize non-traditional art styles too, like Outsider Art style [12]. Adding more features to DCNN training does not improve classification accuracy, which is due, in the author’s analysis, to the curse of dimensionality.

Work on uncovering semantic information about art stems from the goal to understand the content of art objects, including the orientation of an object, the objects in a scene, and the central figures of a scene [13, 21, 22, 23]. Only EXPs are used in this work. Object orientation, e.g., if a painting is correctly displayed, uses low-level features, like simple, local cues, image statistics, and explicit rules [6, 21, 24, 25]. For example, using low-level features to train DNNs has been reported to be as effective as human interpretation across different granularities and styles [21]. The method performs better for portrait paintings than for abstract art, as portraits arguably include more reliable and repetitive cues, which improves DCNN learning. Distinguishing image classes seems to focus on localized parts of a few, large objects. Low intra-class variability of the parts is important in a part being selected. Different semantic parts might be selected for objects of related classes, like wheels for cars and windows for buses. Generative Adversarial Networks (GANs) are proposed for hierarchical scene understanding [26]. Analysis shows that the early layers learn physical positions, like the spatial layout and configuration, the intermediate layers detect categorical objects, and the latter layers focus on scene attributes and color schemes.

DCNN have been also used to recognize an artist that authored an artwork from a group of possible artists by learning the artist-specific, visual features (hence EXPs) of his/her work [4]. During DCNN training, various regions of an art image are occluded, so that the sensitivity of that region for correct classification can be established [4]. Experiments suggest that artist recognition uses low-level features, like material textures, color, edges, and the empty areas used to create visual patterns [4]. Other work advocates for using intermediate level features, like localized regions, and semantic features, e.g., scene content and composition [5]. Performance decreases if the pool of artists from which selection is made increases.

In summary, current work on using DCNN models for the automated study of art utilizes EXPs, including features (e.g., color, space, texture, form, shape), principles of art (like movement, unity, harmony, variety, balance, contrast, proportion and patterns), and subject topics (i.e. composition, pose, brushstrokes and historical context). DCNNs are suggested to have the capability of unsupervised learning to relate low-level EXPs to higher-level semantics, like objects and object parts, importance of visual cues, hierarchical compositions, and extraction of hidden structures [1, 5, 21, 22, 26]. Therefore, DCNNs seem to learn at least some facets of the NEXPs - EXPs relationships, but it is unclear the degree to which this learning happens for situations in which NEXPs are the principal features in deciding the results, like modern artwork inclusion in a gallery. Arguably, the broader theoretical problem is if reductionist approaches can explain NEXPs only based on EXPs of art objects in similar and dissimilar time periods, styles, genres, and galleries, or whether NEXPs cannot be fully understood without considering subjective factors, like artist intention, viewer interpretation, and social context.

Methodology

Three experiments were conducted to assess the validity of the two hypotheses presented in the Introduction section. Fig 3(c) summarizes the experiments. The experimental work included devising the required datasets to check the hypotheses for a comprehensive set of cases. Then, the designed datasets were verified with respect to their intended purpose for the study. A trained and fine-tuned DCNN was used to classify the images of the datasets into galleries. The effectiveness of DCNN model was evaluated using statistical measures and classification metrics. The gaps (labeled as $\Delta$ in Fig 3(a)) between artwork classification using DCNN model and artwork understanding by human experts were also studied. The components of the experimental study are discussed next.

Dataset Design

The validity of the two hypotheses stated in the Introduction section was verified by studying how different mixtures of EXPs and NEXPs of artwork determined DCNN classification performance. Two EXP- and NEXP-related factors were expected to set the level of classification difficulty of an art dataset: (1) the similarity of EXPs between different galleries (of the same dataset) while their NEXPs are different, and (2) the diversity of EXPs within a single gallery. Fig 3(a) reflects how the similarities and dissimilarities of EXPs and NEXPs decided the expected difficulty level in art classification. Fig. 3(b) summarizes the related parameters. A dataset was considered to be more difficult to classify, if its images from multiple galleries (with different themes, and hence likely with distinct NEXPs) were more similar with respect to their EXPs, and/or if its images in a gallery had a high diversity of EXPs. For instance, two galleries that contain black and white photography look more similar with respect to their EXPs (e.g., their colors). If the two hypotheses are correct, then this similarity makes it harder for the DCNN model to classify the images into their correct gallery, even if these galleries have different NEXPs, i.e. their different historical context and meaning. Moreover, a high diversity of EXPs within a single gallery increases the classification difficulty because it is harder for the DCNN model to find common EXPs to identify the images that belong to the same gallery, even if the gallery’s artworks have similar NEXPs. This observation is especially evident for group exhibitions, since they were curated around similar themes and meanings but they contain artworks that look different from each other as distinct artists created them. Hence, separate datasets studied the using of EXPs and NEXPs in solo and group galleries for classification.

Fig 3(b) and Fig 4 summarize the datasets utilized in training and testing the DCNN classification model. The datasets comprehensively covered artwork EXPs and NEXPs to examine the degree to which the model learns and uses NEXPs to classify images into their correct galleries. Second, insights into the limits of DCNN classifiaction as compared to human art experts were also explored. The assembled datasets were input to DCNN model with random and handpicked test/train splits to create difficult, average, and easy classification situations. These splits targeted different mixtures of EXPs and NEXPs sets. Four subsets, called difficult, average, easy, and random were created for each datasets. Third, to further describe the model’s ability to use NEXPs, a dataset was assembled to test the capacity of the model to separate non-art images (with no NEXPs) from artwork (with NEXPs). Fourth, to observe the degree to which the dataset size influences classification results, the devised datasets included galleries of different sizes, e.g., number of gallery images.

The considered art galleries were chosen from existing online exhibitions curated by established art curators, and not by the art expert that participated to our study. To limit the scope of the datasets, we picked artwork that uses photography as an underlying medium, so that the galleries reflect diverse approaches towards fine arts photography in contemporary art. The artwork pursues different artistic attitudes, like highly conceptual, representational, and abstract. Certain art objects mix photography with paint, some used digital, and others are analog photography. We also included group exhibitions, as group exhibitions are curated around a common theme or concept but are vastly different in their styles and formal features. Table 1 summarizes the characteristics of the galleries in Dataset S1 (see the Appendix for the other datasets). Table 2 details the EXPs and Table 3 the NEXPs of the galleries in this dataset. Table 4 shows the outliers of the galleries, such as the images that are different from the rest.

Table 1: Characteristics of Dataset S1:

the related galleries (e.g., names of the exhibitions or shows), the number of images per gallery, and the total number of images per dataset. Dataset Related galleries Number of Images Total S1 Heat + High Fashion
Mukono
My Mother’s Clothes
Scene
Trigger 6
16
22
13
7 64

Table 2: EXPs of the galleries in Dataset S1

Gallery	EXPs
	Medium, Color	Shape, Form, Texture	Composition ${}^{1}$	Subject Matter ${}^{2}$
Heat + High Fashion	- Black and white photography - Monochromatic - Dark value - Light value - Midtones - High contrast	- Figurative - Organic - Plain - Open forms - Painterly	- Closed compositions - Open composition - Tendency toward symmetry - A-symmetrical - Centered - Alignment of the subject matter - Horizontal frames - Vertical frames - Emphasis on a single subject matter - Empty and quiet composition - Busy and crowded compositions	- Female body - Female torso - Portraits - Hidden human faces - Fashion - Interior space - Shadows, reflections
Mukono	- Black and white photography - Monochromatic - Dark value - Light value - Midtones - High contrast - Medium contrast - Low contrast - Low saturation - Neutral colors	- Figurative - Organic - Plain - Open forms - Closed forms	- Closed composition - Tendency toward symmetry - Asymmetrical - Centered alignment of the subject matter - Horizontal frame - Vertical frame - Emphasis on a single subject matter - Empty and quiet composition	- Human torso/male torso - Female torso - Portraits - Hidden human faces - Nature/landscape - Human body - Everyday objects - Still life - Animals
My Mother’s Clothes	- Color photography - High saturation - Medium saturation - Low saturation - Cool colors - Warm colors - Neutral colors - Dark value - Light value - Midtones - High contrast - Low contrast - Medium contrast - Chromatic	- Organic - Geometric - Textured - Plain - Open form - Linear - Closed form - Decorative - Pattern - Floral - Text	- Closed composition - Tendency toward symmetry - Asymmetrical - Centered alignment of the subject matter - Square frames - Emphasis on a single subject matter - Busy and crowded compositions - Empty and quiet composition	- Female clothes - Still life - Everyday objects - Domestic space
Scene	- Black and white photography - Monochromatic - Dark value - Light value - Midtones - High contrast - Medium contrast	- Figurative - Textured - Plain - Open form - Closed-form - Pattern	- Closed composition - Tendency toward symmetry - Asymmetrical - Centered alignment of the subject matter - Square frame - Emphasis on a single subject matter	- Human body - Female body - Male body - Human torso - Male torso - Female torso - Interior space - Shadows, reflections - Artists
Trigger	- Color photography - Medium saturation - Low saturation - Neutral colors - Cool colors - Dark value - Light value - Mid-tones - High contrast - Low contrast - Medium contrast - Chromatic	- Plain - Organic - Geometric - Closed form	- Closed composition - Tendency toward symmetry - Asymmetrical - Centered alignment of the subject matter - Horizontal frame - Vertical frame - Emphasis on a single subject matter - Empty and quiet composition	- Interior space - Domestic space - Still life - Animals

${}^{1}$ Arrangements of visual elements within a frame
${}^{2}$ What we are looking at

Table 3: NEXPs of the galleries in Dataset S1

Gallery	NEXPs
	Context ${}^{3}$	Intention ${}^{4}$	Meaning
Heat + High Fashion	- Modern and contemporary - New York “bohemian.”	- Engaging female fashion through photography - Realism yet a sense of ambiguity through blurred images and painterly qualities of the medium of photography (in that sense, it can place itself against modernist photography and its notion of medium specificity)	- Photographs of dancer Isadora Duncan - Fashion photography that reflects the time and historical context - Female identity through fashion and clothing - Realism yet a sense of ambiguity through blurred images and painterly qualities of the medium of photography
Mukono	- Contemporary photography - Cultural studies	- Realism and documentary photography	- Documenting people around the world - Race, ethnicity, culture-racial and cultural identity
My Mother’s Clothes	- Contemporary photography conceptual art (using ready-made objects as works of art)	- Conceptual photography inspired by conceptual art and the use of readymade/ordinary objects and blurring the boundary between art and life -Photographing her mothers’ clothes and personal items as a form of portraits or chronology of her mother’s life [27] - Using art and photography to cope with the loss of her mother and her mother’s suffering from Alzheimer[27]	- Her mother’s clothes and personal items as her mother’s portrait/body= clothes as a metonymy of the person - Remembering the past, memories of her mother, perhaps a sense of nostalgia - Co** with Truma of loss of her mother and her suffering from Alzheimer - Gender expression/identity - Social class in America
Scene	- 1960s underground /avant-garde artists’ scenes in the United States	- Realistic photographs of Avant-garde artists in New York during the 1960s - Documentary photography	- Photography and realism - Indexicality - Representation of avant-garde artists in NYC during the c1960s - Human emotion and psychological expression
Trigger	- Contemporary photography - Conceptual photography	- Conceptual photography - Photographing everyday objects and domestic space (her hometown) - Capturing time passing through photography	- Artist’ hometown and the lives of people who has lived there [28]. - Passage of time [28] - Her personal experiences [28] - Collision of past and present [28] - Domestic space and meaningful—perhaps personal everyday objects [28] - Time, temporality

${}^{3}$ Historical, social, political, cultural conditions in which the work is created
${}^{4}$ Intention refers to artists’ intention to make a work of art with a meaningful connection to previous works of art and the history of art.

[Uncaptioned image] — Table 4: Gallery outliers for Dataset S1. Galleries *Heat + High Fashion*, *My Mother’s Clothes*, and *Scene*
do not have any outlier.

${}^{5}$ An image that is different from the rest due to multiple differences or one major distinction.

Design Verification

Principal Component Analysis (PCA) was used to visualize the designed datasets to be verified with respect to their purpose for the experimental study. Each image in a dataset underwent the same pre-processing as the images used with the DCNN model (except for data augmentation transformations), such as to resize and center-crop to a 224 $\times$ 224-pixel image in RGB (color images) and L (grayscale images) color space. The first three principal components of each image were then plotted in a 3D scatter plot. Each gallery was shown using a separate color. The formation of distinct clusters with points of the same color indicates the success of image classification based on the chosen features. Random placement of points of the same color indicates the opposite.

DCNN Model

The three experiments used a VGG-11 DCNN model [13] pre-trained on ImageNet dataset [14] and discriminatively fine-tuned [29]. The model came from the PyTorch framework. To avoid overfitting, batch normalization [30] was used as a regularization technique along with data augmentation using methods Random Rotation, Random Horizontal Flip, and Random Crop with Padding from Torchvision library [31]. The learning rate obtained by method Cyclical Learning Rates [32] for the transferred features was an order of magnitude lower than that of the output classifier layer. DCNN model training utilized Adam Optimizer [33] with a cross-entropy loss function.

Model Evaluation

For comprehensive evaluation without uncertainty and with the least possible overstatement of the results [34], the trained DCNN model was tested using statistical measures and classification metrics. Statistical measures assessed the model behavior in response to the intended experimental setup. Unimodal distributions indicate if the model has a single behavior, and multimodal distributions show if the model responds in multiple ways or if there are undetected, underlying variables present. Comparison of means through ANOVA tests can identify interpretable characteristics of the model’s responses to the designed inputs. Classification metrics evaluate a model’s response to a specific task, and can facilitate an unbiased and more balanced assessment of the DCNN model.

Statistical measures. A total of 1400 trials were collected for each experiment to sample the space of possibilities, and unique seeds to torch and all other random processes were used to maximize the likelihood of finding outliers. A one-way ANOVA test with a confidence interval of 0.99 ( $\alpha$ = 0.01) was performed as the primary measure for statistical comparison of the model’s overall performance. Although ANOVA test is robust to the non-normality of the distribution and to some degrees of the heterogeneity of variances with equal sample sizes [35, 36], we still performed Leven’s test of homogeneity of variances [37] and Shapiro-Wilk normality test [37], and visually verified these assumptions by assessing the histograms and normal Q-Q plots. For large sample sizes, like ours, minuscule derivations from normality can be flagged as statistically significant by parametric tests [38, 39, 40], hence the need to visually inspect the distributions. To pinpoint the different pairs and to consider the deviations from normal distributions of homogeneous variance when using ANOVA test for comparison of means, Games-Howell (nonparametric) [41] and Dunnett’s T3 (parametric) [42] post-hoc tests were carried out to account for the violation of homoscedasticity or equality of variances, and Tukey’s test [43] for controlling Type I error, or the likelihood of an incorrect rejection of the hypothesis. All statistical tests were performed by SPSS and plots were created by seaborn API.

Classification metrics. Eight class-wise measures were computed to observe the DCNN model’s performance for each art gallery: Positive Predictive Value (PPV, Precision), True Predictive Rate (TPR, Recall, Sensitivity), True Negative Rate (TNR, Specificity), Negative Predictive Value (NPV), False Positive Rate (FPR), False Negative Rate (FNR), False Discovery Rate (FDR), and Class-wise Accuracy (Acc). In addition to the overall accuracy (ACC), Matthew’s Correlation Coefficient (MCC) was utilized to avoid overemphasized (inflated) results [44]. For simplicity and without information loss, only measures primarily evaluating True values (True Positive, True Negative), PPV, TPR, TNR and NPV were used to analyze the DCNN model’s performance. The full report of the discussed statistical measures and plotted results were presented in the supporting materials section.

Experiments

Experiment I. The importance of EXPs Vs. NEXPs in classification of solo shows

Dataset S1 description. As summarized in Table 2 and Table 3, this dataset was designed by our art expert to have the most dissimilar NEXPs and the most similar EXPs among its galleries as compared to datasets S2 and S3. If Hypotheses I and II were true, then the high diversity of NEXPs (Hypotheses I) and high similarity of EXPs among galleries (Hypotheses II) would result in a poor performance of the DCNN model in classifying the artwork images into their correct galleries. Strong classification performance rejects at least one of the hypotheses.

Results. The PCA 3D plots in Fig 5(a) depict the similarity of the data points (e.g., art images) of the three subsets, difficult, average, and easy. Different colors indicate different galleries. Moving from subset difficult to subset easy, the plots showed the gradual formation of single-color clusters of points, with fewer occlusions and mixtures of images from different galleries. However, there were no fully homogeneous clusters. The PCA plots, with three principal components variance of 79.5%, 67.82%, and 54.44% for the three subsets, validate their purpose with respect to studying the impact of the training/test sets on the DCNN performance.

The performance results obtained for classifying Dataset S1 into galleries using the DCNN model were as follows. ANOVA post-hoc tests indicated that there was no statistically significant difference between the ACC of subsets difficult and average (p_{Tukey HSD} = 0.236, 99% C.I = [-0.0049, 0.0199], p_{Dunnett T3} = 0.374, 99% C.I = [-0.0057, 0.0207], p_Games-Howell = 0.283, 99% C.I = [-0.0056, 0.0206]), even though the box plots in Fig 5(b) and the bar plots of the class-wise metrics in Fig 6 suggested otherwise. ANOVA post-hoc tests for the rest of the subset pairs showed a statistically significant difference (p $<$ 0.01 for all tests). Hence, the DCNN model had a distinct behaviour for each of the four subsets.

Next, the capacity of the DCNN model to correctly identify a specific gallery was studied for the four subsets of Dataset S1. The class-wise metrics in Fig 6 displayed distinct distributions for the four subsets. The closeness of MCC averages (i.e. 0.69, 0.61, 0.62, and 0.76) and of ACC averages (e.g., 0.75, 0.68, 0.68, and 0.80) for subsets random, difficult, average, and easy suggest a very low possibility of random assignment of gallery labels by the model. Also, the closeness of MCC averages and ACC averages in addition to the ascending trends also support the intention about the desired difficulty levels of the four subsets. However, metric Acc was not reliable in understanding the model’s ability to find the correct gallery labels, as it barely changed for any gallery. More insight was obtained by analyzing the other metrics. As NPV and TNR measure True Negatives (TNs), their relatively continuous high values suggest that DCNN model was quite successful in differentiating galleries, such as indicating that a certain artwork image is not part of a gallery. PPV and TPR measure True Positives (TP), hence the DCNN’s ability to identify the correct gallery of an artwork image. As Fig 6 indicates, PPV and TPR were consistently low for galleries “Trigger” and “Heat + High Fashion”, and low for subsets difficult and average for galleries “Scene” and “Mukono”. PPV and TPR were consistently high only for Gallery “My Mother’s Clothes” and for subsets random of galleries “Mukono” and “Scene”. Hence, with the exception of one gallery, “My Mother’s Clothes”, DCNN model struggled with but finding the correct gallery of the artwork images. An additional experiment was performed to clarify whether the high performance obtained for Gallery “My Mother’s Clothes” was due to EXPs or NEXPs being learned by the model. The experiment, called Experiment 4 is discussed at the end of this section. In this experiment, additional images were added as part of the new gallery “Non-Art”, so that these images were very similar in their EXPs with galleries “Trigger” and “My Mother’s Clothes” but had no NEXPs, as they did not represent artwork. As shown in Figure 14, the new gallery worsened the classification performance, which suggests that the DCNN model did not learn NEXPs but was negatively affected by the increased EXP similarity between distinct galleries. In conclusion, the experiments using Dataset S1 confirmed the two hypotheses.

A detailed analysis was then performed by the art expert on the classification results to understand how NEXPs and EXPs influenced the performance. Even though Gallery “Trigger” was expected to be the hardest to classify among all galleries, DCNN results showed that it was the second hardest to classify. Instead, Gallery “Heat+ High Fashion” had the lowest performance. This is likely because it is one of the three grayscale galleries. It shares similar EXPs with Gallery “Scene”, and its size is slightly smaller than the other two galleries (see the supporting materials section). Assuming that the model learned NEXPs and used them for classification, distinguishing the three grayscale galleries with dissimilar NEXPs (due to their historical differences) would be easier. However, this situation was not observed. As summarized in Table 2, Gallery “My Mother’s Clothes” had the most the diverse EXPs as compared to the other galleries (and the most similar EXPs within the gallery), which explains the high performance in correctly finding the gallery for its images. For the situations with high TP performance, the values were similar for subsets random and easy. Thus, randomly selecting training images for these situations is likely to include sufficient features to support a relatively correct classification, such as having enough diverse EXPs, as estimated by our art expert. Moreover, DCNN performance is less linked to NEXP diversity.

Cross-referencing PCA and class-wise metrics showed that inter-class similarities of principal components represented by Euclidian distance in a 3D space pose more challenges than within-class dissimilarities. This was observed in three instances for Dataset S1: (i) The performance for Gallery “Heat+ High Fashion” was the lowest, even though two test images in subset average were close to each another. This is likely because other classes’ datapoints were concentrated nearby. (ii) The low performance for Gallery “Trigger” is due to the occluded points in subset difficult and the distant points in subset average. The best performance was obtained for subset easy, as there were no points from other classes present between the two test image datapoints. (iii) Gallery “Scene” ‘s performance was the lowest for subset average and about similarly high for the other subsets. This was likely due to the placement in close proximity of its three test images as well as the closeness of subset average’s points to other galleries. The gallery size was not critical on its own in setting the difficulty level of a gallery, however, it biased the classifier in some cases towards the larger galleries.

Dataset S2 description. Dataset S2 was designed to have the most dissimilar EXPs and most similar NEXPs between galleries as compared to datasets S1 and S3. If Hypotheses I and II were true, then the high similarity of NEXPs (Hypotheses I) and high diversity of EXPs between galleries (Hypotheses II) would result in strong performance of the DCNN model in classifying the artwork images into their correct galleries. Poor classification performance rejects at least one of the hypotheses.

Results. PCA 3D plots in Fig 7(a) illustrate the similarity of the datapoints (e.g., art images) of the three subsets. As observed for Dataset S1 too, clusters of points of the same color (hence, pertaining to the same gallery) were observed as the DCNN model classified subsets difficult, average, and then easy. There were multiple instances of fully homogeneous clusters. PCA plots, with three principal components variance of 61.18%, 60.63%, and 64.06% for the three subsets, validate their purpose with respect to studying the impact of the training/test sets on the DCNN performance.

The resulting performance for classifying Dataset S2 into galleries using the DCNN model were discussed next. ANOVA post-hoc tests showed a statistically significant difference between the ACC values of all subsets (p $<$ 0.01 for all tests). Although box plots of subsets random and average in Fig 7(b) are almost identical, the two subsets are still distinguishable through their outliers. This is supported by their class-wise metrics in Fig 8 too. Hence, the DCNN model had a distinct behavior for each of the four subsets.

The capacity of the DCNN model to correctly identify a specific gallery was studied using the four subsets of Dataset S2. The class-wise metrics in Fig 8 displayed distinct distributions for the subsets. The minimal changes of metric Acc show that it is unreliable in finding the correct gallery labels. The closeness of MCC averages (i.e. 0.92, 0.80, 0.94, and 0.99) and of ACC averages (e.g., 0.93, 0.82, 0.94 and 0.99) for subsets random, difficult, average, and easy indicate a very low possibility of random assignment of gallery labels by the model. The ascending trends support the desired difficulty levels of the four subsets. Classification performance was strong for all galleries. This supports the validity of Hypothesis I and II.

The detailed analysis of the other metrics produced the following observations. Subset easy had almost perfect values for all metrics and for all galleries. Slightly worse, Subset average had close to perfect metric values for all galleries with the exception of galleries ”Painted Nudes” and “Heat + High Fashion”. The relatively low value of TPR for Gallery ”Painted Nudes” indicates that False Negatives (FNs) are the cause of the poorer performance, e.g., the DCNN model did not recognize well the images of this gallery. False Positives (FPs) reduce the classification performance for galleries “Heat + High Fashion” and “Fall of Spring Hill”, i.e. the DCNN model mistook images in Gallery ”Painted Nudes” as being in one of the two galleries. For subset random, the DCNN model struggled with recognizing images for galleries “Persephone” and “Bullets”. It misclassified their images as being in Gallery ”Painted Nudes”. Galleries “Heat + High Fashion” and “The Fall of Spring Hill” had the lowest and second lowest PPV and TRP values. For subset difficult, the images in galleries “Fall of Spring Hill” and “Bullets” were hard to classify. A likely reason for the model confusing images in the two galleries as being in galleries “Heat + High Fashion” or ”Painted Nudes” is the high concentration of human figures in these galleries.

Dataset S3 description. Dataset S3 was designed to have its amounts of inter and within similarities and dissimilarities of EXPs and NEXPs between those of datasets S1 and S2. If Hypotheses I and II were true, then its in between amount of dissimilarities and similarities of NEXPs (Hypotheses I) and of EXPs (Hypotheses II) would result a classification performance that is between those for datasets S1 and S2. A strong or a poor performance rejects at least one of the hypotheses.

Results. PCA 3D plots in Fig 9(a) depict the similarity of the data points (e.g., art images) of the three subsets. Similar to datasets S1 and S2, more homogeneous point clusters were formed while processing subsets difficult, average, and easy. Like for dataset S1, there were no fully homogeneous clusters. The PCA plots, with three principal components variance of 81.12%, 79.28%, and 73.05% for the three subsets, validate their purpose with respect to studying the impact of the training/test sets on the DCNN performance.

The resulting performance for classifying Dataset S3 into galleries using the DCNN model was presented next. ANOVA post-hoc tests showed a statistically significant difference between the ACC of all subsets (p $<$ 0.01 for all tests). The box plots of all subsets in Fig 9(b) indicated three distinct distributions. Hence, the DCNN model had a distinct behavior for each of the four subsets.

The capacity of the DCNN model to correctly identify a specific gallery was studied using the four subsets of Dataset S3. The class-wise metrics in Fig 10 displayed distinct distributions for the four subsets. The minimal changes of metric Acc suggest that it is unreliable in finding the correct gallery labels. The closeness of MCC averages (i.e. 0.73, 0.53, 0.68 and 0.84) and of ACC averages (e.g., 0.77, 0.62, 0.74 and 0.869) for subsets random, difficult, average, and easy indicate a very low possibility of random assignment of gallery labels by the model. The ascending trends support the desired difficulty levels of the four subsets. TNR and NPV values are high for all situations, hence the DCNN model can often correctly indicate if an artwork does not pertain to a gallery. PPV and TPR values are superior than for Dataset S1 but lower than for Dataset S2. Results supprt the two hypotheses in the introductory section.

The detailed analysis of the metrics showed that the DCNN model consistently succeeded in classifying galleries ”Boarding House” and ”Painted Nudes” for all subsets. This was due to the high EXP dissimilarity of the two galleries and the rest of the dataset. Gallery ”Boarding House” was the only grayscale gallery, and ”Painted Nudes” was the only mixed medium gallery with textures of paint and brush on top of photography. The low performance for all subsets of Gallery “Trigger” was because of its heavy using of NEXPs, as the gallery presents conceptual art. For the cases with a low performance, like Gallery “Trigger” and subset difficult of Gallery “Private”, TPR values are often less than PPV values, hence, the artwork in these galleries were incorrectly assigned to other galleries. Images in galleries “Private” and “Boarding House” were assigned to Gallery “My Mother’s clothes”.

Experiment II. The importance of EXPs Vs. NEXPs in classification of group shows

Dataset G1 description. This experiment investigated the impact of the EXPs diversity within a gallery on DCNN classification performance while NEXPs similarity remained high for a gallery. To that end, the devised dataset included two group exhibitions by multiple artists with distinctive styles, hence diverse EXPs, while their artwork shared NEXPs that allowed the curator to assemble them in a single group exhibition. If hypotheses I and II were true then the classification performance for Dataset G1 should be worse than the performance in Experiment I due to the increased EXP diversity within a gallery. Otherwise, at least one of the tw hypotheses should be rejected.

Results. The PCA 3D plots in Fig 11(a) present similar trends as the trends observed for Experiment I. However, the obtained overall clusters of images are less homogeneous as the experiment shifted from subsets difficult to subset easy. The data variance was 67.79%, 64.69%, and 62.02% for subsets difficult, average, and easy. This confirmed the validity of the subsets with respect to their purpose for the experiments.

The performance results about DCNN model’s capacity to correctly classify Dataset G1 into galleries were summarized next. ANOVA post-hoc analysis exhibited four distinct behaviors (p $<$ 0.01 or p = 0 in all tests). The box plot in Fig 11(b) of the overall metrics along with their numerical values, e.g., MCC was 0.54, 0.30, 0.56, and 0.64 and ACC was 0.65, 0.46, 0.66, and 0.71, also confirmed the desired levels of difficulties of the three subsets. MCC values were 0.7 to 0.16 larger than ACC values.

Class-wise metrics in Fig 12 show a decreased DCNN performance as compared to the classification performance obtained for Dataset S1. PPV and TPR values were consistently low for all galleries, except Gallery “The Unknown”, which offered the best performance for Dataset G1. Galleries “Trigger”, “Epilogue”, and “Bullets” were the hardest, second hardest, and third hardest to classify, as Experiment II went from considering subset random to subset easy. Hence, the increased diversity of the within-gallery EXPs had an important influence on lowering the DCNN model’s capacity to correctly identify a gallery. Moreover, the high within-gallery EXPs diversity of subset difficult made this subset to be the hardest to classify among all the subsets used in the three experiments of this work. Note, however, that the increased EXP diversity did not affect the expected difficulty level of subsets difficult, average, and easy. The lower capacity of DCNN model to correctly classify Dataset G1 support hypotheses I and II.

A more detailed analysis showed that for subset difficult of Gallery “The Unknown”, PPV dropped while its TPR stayed high, suggesting that the number of FP increased. Images from other galleries were misclassified to this gallery. Also, our expectation for Gallery “30 Years of Women” was incorrect, as DCNN model had an average performance for this gallery. One possible reason could be its very large number of data points as compared to the other galleries. Our expectation about Gallery “The Unknown” were correct despite its sample size being about 2.8 times smaller than for Gallery “30 Years of Women”. Future work will address the two unexpected situations.

Experiment III. The Importance of EXPs Vs. NEXPs in Distinguishing Art Images from Non-Art Images

Dataset S4 description. This experiment investigated the validity of the two hypotheses depending on the size of the datasets, including images that were not art. In addition to the galleries in Dataset S1, Dataset S4 included a new gallery of non-art images of ready-made, ordinary objects, like human clothes. These images were similar in their EXPs with galleries “Trigger” and “My Mother’s Clothes”, but had no NEXPs. If Hypotheses I and II were true then Dataset S4 would be more difficult to classify than Dataset S1, including having a lower performance for galleries “Trigger” and “My Mother’s Clothes” than their classification performance obtained for Dataset S1. The performance obtained for Gallery “Non-Art” should be also low. These outcomes were expected due to Gallery “Non-Art” having no NEXPs but presenting similar EXPs as the two galleries above. Otherwise, at least one of the two hypotheses is rejected.

Experiments were run for two versions of Gallery “Non-Art”, a 34-image version and an 18-image version. This experiment also addressed the question obtained after the experiment for Set SF1, which is that the gallery size does not influence classification using NEXPs, but if two galleries are similar in their EXPs, the larger gallery offers better performance.

Results. The PCA 3D plots in Fig 13(a) show similar trends as the trends for Experiment I (Fig 5(a)) and Experiment II (Fig 7(a)), and confirm the designed levels of difficulty of the four subsets. The data variance was 73.27%, 68.97%, and 60.62% for subsets difficult, average, and easy.

The performance results for classifying Dataset S4 into galleries using the DCNN models were as follows. One-way ANOVA results for the two versions of Dataset S4 (e.g., with 34 and 18 extra non-art images) showed a statistically significant difference (p $<$ 0.01). However, based on the box plot in Fig 14(b), the statistical sensitivity was negligible. The class-wise analysis in Fig 14(a) of the two versions showed identical trends aside from the expected bias because of Gallery “Non-Art”. Hence, the rest of the experiments focused only on the 34-image version of this gallery.

ANOVA post-hoc analysis showed no statistically significant difference between subsets average and random (p ${}_{Tukey~{}HSD}$ = 0.001, 99% C.I = [0.0021, 0.0249]), p ${}_{Dunnett~{}T3}$ = 0.002, 99% C.I = [0.0015, 0.0255]), and p ${}_{Games-Howell}$ = 0.002, 99% C.I = [0.0017, 0.0254]), and a strongly significant difference for the other subsets (p $<$ 0.01 or p = 0 in all tests). However, the box plots for subsets average and random present minimal differences, such as outliers and IQR, similar to the previous two experiments. Thus, the DCNN model had distinct behavior for the four subsets.

The capacity of DCNN model to identify a certain gallery for the four subsets of Dataset S4 was discussed next. The class-wise metrics in Fig 16(a) indicated different distributions for the four subsets. Similar to set G1, MCC were 0.55, 0.35, 0.57, and 0.62, and ACC values were 0.63, 0.49, 0.64, and 0.68. The overall and class-wise performance confirmed the expectation that the non-art gallery confused the DCNN model. PPV and TPR values decreased for galleries “Trigger” and “My Mother’s Clothes” as compared to their performance for Dataset S1. PPV and TPR values for Gallery “Non-Art” were low except subsets random and easy for which is was higher.

More specifically, DCNN classification performance was the highest for galleries “Mukono” and “Scene” (for all their subsets), as they had distinct EXPs while not having similar EXPs with Gallery “Non-Art”. The large size of “Mukono” and “Scene” also explains why the model performed better for these galleries as compared to Gallery “Heat + High Fashion”. The lowest performance among all experiments in this work was obtained for subset difficult of Gallery “Trigger”. Galleries “My Mother’s Clothes”, “Heat+ High Fashion”, and “Non-Art” also produced a low classification performance. The performance for subsets easy was better for all galleries, aside galleries “Trigger” and “Heat+ High Fashion”. Art galleries contain detectable EXPs that should differentiate art objects from non-art images of the same object. However, Experiment III showed that DCNN model’s understanding of EXPs is not sufficient yet.

Discussion

Human experts assembled galleries and exhibitions based on their interpretations grounded in mental processing of the visual art images through their explicit and tacit knowledge, obtained through formal training and experience, as well as the ideas specific to their context [45]. Some of their analysis and decisions can be explained through rules, like those summarized in art history [45], but other are subjective interpretations. It can be argued that there is currently no formalized, quantitatively-defined art ontology and procedural analysis method that could serve as the theoretical backbone for automatically understanding art, including artwork grou** into galleries based on its meaning, artist intention, and viewer interpretation. Instead, art galleries reflect a qualitative, narrative interpretation of art objects based on assembling EXPs into NEXPs that define the meaning, intention, and interpretation of artwork. A conclusion of this work is that using general-purpose vision databases have likely only a limited role in curating art, such as to use them to train DNNs to recognize low-level features, because their meaning is absent (i.e. NEXPs). Experiments showed that differentiating between difficulty levels (e.g., subsets difficult, average, and easy) is not cumulative, so that it can be easily quantified statistically, as there are no significant statistical differences between the subsets distinguished by the art expert.

While other research suggests that DCNNs can reliably learn object fragments and then use these fragments in some scene understanding [26], this work argues that learning does not include all object features needed to group related artwork into galleries. Features that define an object’s uniqueness within an artwork are likely not learned, if they are not critical in recognizing the object from other objects. For example, a unique but repetitive combination of color on a grayscale image can be specific to an artist and help distinguish his work from other artwork. Due to its repetitive nature, a DCNN might learn the specific feature. However, rare features (e.g., EXPs) are not learned if they pertain to repetitive, high-level concepts (i.e NEXPs). Experiments showed that an artist’s signature was not picked up by DCNN model unless it was based on repetitive EXPs that could be learned, like having a yellow stripe over a greyscale image. A consequence of this observation is that aggregated, statistical metrics can observe global, systematic differences but not individual features. Histograms and outlier analysis, e.g., the number, position, and type of outliers, could address this limitation. New metrics are required to capture the assessment by experts, like novelty, craftmanship, and viewer perception of artwork. These metrics must be conditioned by the cultural context of the expert’s assessment.

Fig 15 summarizes the themes of some of the galleries used in the experments. They include genres, like human figures and landscapes. Art objects having, but not necessarily, female figures as one of their central pieces addressed themes, like female identity, female fashion, African identity in the sixties and seventies, eighties, and contemporary. Fig 15 shows an ontology fragment of these concepts, in which arrows indicate the general – to specific relation and dashed lines the combination of concepts that co-occur in an art object. These relations are one kind of possible associated meanings, but other interpretations exist too. Extracting possible meanings for an art object includes identifying the symbolism of the concepts as well as conceptual interpretations, like analogies and metaphors, for the relations among concepts. Moreover, object EXPs, like color, shapes, texture, position, hues, illumination, and so on, can have a certain symbolism, interpretation, or induce a certain feeling to the viewer [45]. For example, common objects in an art composition could point to everyday life, and possibly to the collision of present and past [45]. Or, the relative positioning of objects or their unusual postures, e.g., a chair’s position, can serve a certain purpose in an artwork’s theme narrative [45]. Experiments suggest that DCNN classification difficulty relates to the ambiguity of how EXPs, i.e. the visuals of physical objects, relate to NEXPs and their higher-level semantics, like intention and interpretation, which is an artwork’s projection into the idea space. The difficulty increases with the abstraction levels of the ontology (Fig 15) where ambiguities occur. Having multiple narratives for an object is also part of the possible ambiguities.

The analysis of the differences between human art curation and DCNN classification shows several limitations of DCNN models in learning and understanding higher-level semantics. All learned differences are based on visual EXPs, like texture, tones, shapes, and objects. However, models are not capable to dynamically reprioritize the importance of EXPs depending on the process that would lead to understanding NEXPs and the meaning of an art object. For example, Baxandall explains that understanding art is a problem-solving process that constructs a narrative that expresses an object’s meaning [45]. The object must be reinterpreted and reprioritized in the context of the narrative, while possibly drop** significant amounts of general-purpose learning using generic image databases, like ImageNet.

Another limitation of DCNNs related to NEXP learning refers to creating plausible narratives expressing the theme of an artwork. Narratives are based on the connections to the artist’s or observer’s context (including previous art), and the causal relationships between objects or their symbolic meaning, as well the map** relations in the case of meanings based on analogies, metaphors, and abstractions [46, 47]. Some insight related to the historical context can be inferred using details, like clothing, hair style, or furniture. However, some of these details might not be captured during DCNN learning as they are less frequent than other features. Also, while recent methods can identify and learn some analogical map**s [22], these methods are symbolic and use numeric metrics to establish the map**s. Current DCNNs cannot learn well map**s, including some with qualitative, subjective, social, and emotional knowledge. A possibility would be to collect such data through surveys and then incorporate it to the DCNN learning process [12]. However, surveys are likely to be ineffective in hel** to find which EXPs and NEXPs are the cause for the survey inputs, even though a human expert can indicate quite accurately how visual cues, like color and pattern, produce a certain interpretation or emotion.

Finally, there is similarity between art creation defined as open-ended problem solving and other creative processes, like engineering design. Problem framing in design relates to theme selection in art, while creating the structure (architecture) of an engineering solution corresponds to creating the structure of a painting scene. The two solution spaces are constrained by various design rules and aesthetic rules, respectively, e.g., proportions, projections, coloring, and so on [45]. However, there are major differences too. While engineering is mostly guided by numerical performance values that express the objective quality of a design and to a much lesser degree by subjective factors, like preference for some functions, art creation is guided by arguably no quantitative analysis, being subjected only to qualitative, subjective evaluations. Besides, an engineering solution has a well-defined meaning and purpose, which is perceived in the same way by all. In contrast, the meaning of art depends on the artists and viewers, gets shaped by different cultures, and evolves over time.

Conclusions

Modern theories of art suggest that Exhibited Properties (EXPs) and Non-Exhibited Properties (NEXPs) characterize any work of art. EXPs are visible features, like color, texture and form, and NEXPs are artistic aspects that result by relating an art object to human history, culture, the artist’s intention, and the viewer’s perception. Current work on using Deep Neural Network (DNN) models to computationally characterize artwork suggests that DNNs can learn EXPs and can gain some insight on meaning aspects tightly related to EXPs, but there are no extensive studies about the degree to which NEXPs are learned during DNN training, and then used for automated activities, like classifying artwork into galleries. To address this limitation, this work conducted a comprehensive set of experiments about the degree to which Deep Convolutional Neural Network (DCNN) models learn NEXPs of artwork. Two hypotheses were formulated to answer this question: The first hypothesis states that DCNN models do not capture NEXPs well for art gallery classification. The second hypotheses states that EXP similarities and differences within and between art galleries determine the difficulty level of DCNN classification.

Three experiments were devised and performed to verify the two hypotheses using datasets about art galleries assembled by an art expert. Experiments used the VGG-11 DCNN pretrained on ImageNet database, and then retrained using art images. The three experiments considered the following situations: (1) using EXPs and NEXPs for classification of art objects in solo (single artist) galleries, (2) utilizing EXPs and NEXPs for classification of art objects in group galleries, and (3) distinguishing art objects from non-art objects, and the impact of dataset size on classification results. Datasets were put together for each situation and for different difficulty levels of DCNN classification. Results were analyzed using statistical and classification measures.

The experimental study validated the two hypotheses. VGG-11 DCNN did not learn NEXPs sufficiently well to support accurate classification of modern artwork into galleries similar to those curated by human experts, and EXPs were insufficient for understanding, interpreting, and classifying artwork. Higher EXP similarity among galleries or higher EXP diversity within a gallery increased the difficulty level of classification in spite of their NEXP values, which suggests that EXPs were the determining factor in classification. Dataset size was not a main factor in improving DCNN classification, but increasing dataset size can help galleries with similar EXPs. This work suggests that any attempt to automate art understanding should be equipped with mechanisms to capture well EXP and NEXP of artwork.

The three experimental studies are useful not only to characterize the general limitations of DCNN models, but also to understand if NEXPs of art objects can be distinguished only using their EXPs, thus if an art object is fully specified within its body of similar work, e.g., gallery, or if NEXPs depend to a significant degree on elements not embodied into an art object, like contextual elements, the artist’s intention, and the viewer’s interpretation. Experimental results support the second perspective.

Further Research Directions

The DCNN model studied in this work can arguably be a rough, qualitative predictor of artwork understanding by a person without artistic training. The model’s art “knowledge” comes by superimposing features learned from a few art galleries on the features learned using images from the general vision domain. Experiments with DCNN models that would aggressively transfer knowledge from the art domain (and not only a few galleries) would add to the understanding of how well DCNNs can learn NEXPs. Another avenue of future work would consider other DNN models, such as VisionTransformer [47] and ConvNeXt [48], alongside with Transfer Learning techniques with a higher learning capacity, e.g., cascaded network architectures. Finally, the design and analysis of the datasets and experiments could explore the DCNN’s preferences and biases, i.e. whether shape or color are more important in classification, or which features are tend to be misclassified by measuring the frequency of the misclassification instances at the image level.

References

1. Spratt EL, Elgammal A. Computational Beauty: Aesthetic Judgment at the Intersection of Art and Science. In: Agapito L, Bronstein MM, Rother C, editors. Computer Vision - ECCV 2014 Workshops. Cham: Springer International Publishing; 2015. p. 35–53.
2. Zhao W, Zhou D, Qiu X, Jiang W. Compare the performance of the models in art classification. PLOS ONE. 2021;16(3):1–16. doi:10.1371/journal.pone.0248414.
3. Lecoutre A, Negrevergne B, Yger F. Recognizing Art Style Automatically in Painting with Deep Learning. In: Zhang ML, Noh YK, editors. Proceedings of the Ninth Asian Conference on Machine Learning. vol. 77 of Proceedings of Machine Learning Research. Yonsei University, Seoul, Republic of Korea: PMLR; 2017. p. 327–342. Available from: https://proceedings.mlr.press/v77/lecoutre17a.html.
4. Van Noord N, Hendriks E, Postma E. Toward Discovery of the Artist’s Style: Learning to recognize artists by their artworks. IEEE Signal Processing Magazine. 2015;32(4):46–54. doi:10.1109/MSP.2015.2406955.
5. Saleh B, Abe K, Arora RS, Elgammal A. Toward automated discovery of artistic influence. Multimedia Tools and Applications. 2016;75(7):3565–3591. doi:10.1007/s11042-014-2193-x.
6. Rodriguez CS, Lech M, Pirogova E. Classification of Style in Fine-Art Paintings Using Transfer Learning and Weighted Image Patches. 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS). 2018; p. 1–7.
7. Levinson J. DEFINING ART HISTORICALLY. The British Journal of Aesthetics. 1979;19(3):232–250. doi:10.1093/bjaesthetics/19.3.232.
8. Levinson J. Aesthetic Contextualism. Postgraduate Journal of Aesthetics. 2007;4(3):1–12.
9. MoMa. Readymade: Moma; 2023. Available from: https://www.moma.org/collection/terms/readymade.
10. Redies C. Combining universal beauty and cultural context in a unifying model of visual aesthetic experience. Frontiers in Human Neuroscience. 2015;9. doi:10.3389/fnhum.2015.00218.
11. Tan WR, Chan CS, Aguirre HE, Tanaka K. Ceci n’est pas une pipe: A deep convolutional network for fine-art paintings classification. 2016 IEEE International Conference on Image Processing (ICIP). 2016; p. 3703–3707.
12. Roberto J, Ortego D, Davis B. Toward the Automatic Retrieval and Annotation of Outsider Art images: A Preliminary Statement. In: Proceedings of the 1st International Workshop on Artificial Intelligence for Historical Image Enrichment and Access. Marseille, France: European Language Resources Association (ELRA); 2020. p. 16–22. Available from: https://aclanthology.org/2020.ai4hi-1.3.
13. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2015;abs/1409.1556.
14. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.
15. Brachmann A, Redies C. Computational and Experimental Approaches to Visual Aesthetics. Frontiers in Computational Neuroscience. 2017;11. doi:10.3389/fncom.2017.00102.
16. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? ArXiv. 2014;abs/1411.1792.
17. Elgammal A, Mazzone M, Liu B, Kim D, Elhoseiny M. The Shape of Art History in the Eyes of the Machine. In: AAAI; 2018. p. 2183–2191.
18. Redies C, Brachmann A, Wagemans J. High entropy of edge orientations characterizes visual artworks from diverse cultural backgrounds. Vision Research. 2017;133:130–144. doi:https://doi.org/10.1016/j.visres.2017.02.004.
19. Redies C, Brachmann A. Statistical Image Properties in Large Subsets of Traditional Art, Bad Art, and Abstract Art. Frontiers in Neuroscience. 2017;11. doi:10.3389/fnins.2017.00593.
20. Mao H, Cheung M, She J. DeepArt: Learning Joint Representations of Visual Arts. In: Proceedings of the 25th ACM International Conference on Multimedia. MM ’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 1183–1191. Available from: https://doi.org/10.1145/3123266.3123405.
21. Lelièvre P, Neri P. A deep-learning framework for human perception of abstract art composition. Journal of Vision. 2021;21(5):9–9. doi:10.1167/jov.21.5.9.
22. Hamilton M, Fu S, Lu M, Bui J, Bopp D, Chen Z, et al. MosAIc: Finding Artistic Connections across Culture with Conditional Image Retrieval. In: NeurIPS 2020 Competition and Demonstration Track. PMLR; 2021. p. 133–155.
23. Gonzalez-Garcia A, Modolo D, Ferrari V. Do Semantic Parts Emerge in Convolutional Neural Networks? International Journal of Computer Vision. 2018;126(5):476–494. doi:10.1007/s11263-017-1048-0.
24. Liu J, Dong W, Zhang X, Jiang Z. Orientation judgment for abstract paintings. Multimedia Tools and Applications. 2017;76(1):1017–1036. doi:10.1007/s11042-015-3104-5.
25. Zafar B, Ashraf R, Ali N, Ahmed M, Jabbar S, Chatzichristofis SA. Image classification by addition of spatial information based on histograms of orthogonal vectors. PLOS ONE. 2018;13(6):1–26. doi:10.1371/journal.pone.0198175.
26. Yang C, Shen Y, Zhou B. Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis. International Journal of Computer Vision. 2021;129(5):1451–1466. doi:10.1007/s11263-020-01429-5.
27. Wolf A. Review: Jeanette Montgomery Barron’s “My Mother’s Clothes” at Jackson Fine Art; 2010. Available from: https://www.artsatl.org/a-mother-remembered-in-jeanette-montgomery-barrons-my-mothers-clothes-at-jackson-fine-art-by-alana-wolf.
28. Angela West;. Available from: https://www.jacksonfineart.com/artists/angela-west.
29. Howard J, Ruder S. Universal Language Model Fine-tuning for Text Classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 328–339. Available from: https://aclanthology.org/P18-1031.
30. Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: Bach F, Blei D, editors. Proceedings of the 32nd International Conference on Machine Learning. vol. 37 of Proceedings of Machine Learning Research. Lille, France: PMLR; 2015. p. 448–456. Available from: https://proceedings.mlr.press/v37/ioffe15.html.
31. maintainers T, contributors. TorchVision: PyTorch’s Computer Vision library; 2016. Available from: https://github.com/pytorch/vision.
32. Smith LN. Cyclical Learning Rates for Training Neural Networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017; p. 464–472.
33. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. CoRR. 2015;abs/1412.6980.
34. Agarwal R, Schwarzer M, Castro PS, Courville AC, Bellemare M. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, editors. Advances in Neural Information Processing Systems. vol. 34. Curran Associates, Inc.; 2021. p. 29304–29320. Available from: https://proceedings.neurips.cc/paper/2021/file/f514cec81cb148559cf475e7426eed5e-Paper.pdf.
35. Rogan JC, Keselman HJ. Is the ANOVA F-Test Robust to Variance Heterogeneity When Sample Sizes are Equal?: An Investigation via a Coefficient of Variation. American Educational Research Journal. 1977;14(4):493–498. doi:10.3102/00028312014004493.
36. Blanca MJ, Alarcón R, Arnau J, Bono R, Bendayan R. Non-normal data: Is ANOVA still a valid option? Psicothema. 2017;29(4):552–557.
37. Brown MB, Forsythe AB. Robust Tests for the Equality of Variances. Journal of the American Statistical Association. 1974;69(346):364–367. doi:10.1080/01621459.1974.10482955.
38. Field A. Discovering statistics using SPSS. 3rd ed. SAGE Publications; 2009.
39. Öztuna D, Elhan AH, Tüccar E. Investigation of Four Different Normality Tests in Terms of Type 1 Error Rate and Power under Different Distributions. Turkish Journal of Medical Sciences. 2006;36:171–176.
40. Kadane JB. Principles of uncertainty. Whittles Publishing; 2011.
41. Pairwise Multiple Comparison Procedures with Unequal N’s and/or Variances: A Monte Carlo Study. Journal of Educational Statistics. 1976;1(2):113–125.
42. Dunnett CW. Pairwise Multiple Comparisons in the Unequal Variance Case. Journal of the American Statistical Association. 1980;75(372):796–800.
43. Tukey JW. Comparing Individual Means in the Analysis of Variance. Biometrics. 1949;5(2):99–114.
44. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi:10.1186/s12864-019-6413-7.
45. Spangler S, Wilkins AD, Bachman BJ, Nagarajan M, Dayaram T, Haas P, et al. Automated Hypothesis Generation Based on Mining Scientific Literature. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 1877–1886. Available from: https://doi.org/10.1145/2623330.2623667.
46. Gil Y, Garijo D, Ratnakar V, Mayani R, Adusumilli R, Boyce H, et al. Automated Hypothesis Testing with Large Scientific Data Repositories; 2016.
47. Kolesnikov A, Dosovitskiy A, Weissenborn D, Heigold G, Uszkoreit J, Beyer L, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale; 2021.
48. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. arXiv preprint arXiv:220103545. 2022;.

Abstract

Introduction

Related Work

Methodology

Dataset Design

Design Verification

DCNN Model

Model Evaluation

Experiments

Experiment I. The importance of EXPs Vs. NEXPs in classification of solo shows

Experiment II. The importance of EXPs Vs. NEXPs in classification of group shows

Experiment III. The Importance of EXPs Vs. NEXPs in Distinguishing Art Images from Non-Art Images

Discussion

Conclusions

Further Research Directions

References

Supporting information

S1 Appendix: Metrics Definition

S2 Appendix: Dataset Descriptions

S3 Appendix: Follow-Up Experiments: The Effect of Dataset Size on Classification

S4 Appendix: Histogram Plots

S5 Appendix: Normal Q-Q Plots

S6 Appendix: Summary Box Plots

S7 Appendix: Class-wise Metrics (FPR, FNR, and FDR)

S8 Appendix: Statistical Details

S9 Appendix: Dataset Metadata

Gallery	Outlier ${}^{5}$	Gallery	Outlier
Mukono	Most images in this gallery are human portraits with a high contrast between the figure and the background. However, these images have different features from those.	Trigger	Each image in this gallery has distinct characteristics. It is hard to find an outlier due to the high diversity within the gallery. However, this image is slightly more different from the rest as it is different in terms of its subject and composition. While the other images in the gallery are close to a symmetrical arrangement, the composition in this image is asymmetrical.